keras-team / keras-tuner

A Hyperparameter Tuning Library for Keras
https://keras.io/keras_tuner/
Apache License 2.0
2.85k stars 395 forks source link

Bayesian oracle fails in parallel execution once initial points are exhausted #666

Closed montanier closed 2 years ago

montanier commented 2 years ago

Bug description

When running a parallel search, the bayesian oracle fails once the intial points are exhausted. The error log is as follow:

Traceback (most recent call last):
  File "tuning.py", line 66, in <module>
    callbacks=[tf.keras.callbacks.EarlyStopping("val_accuracy")],
  File "/usr/local/lib/python3.7/site-packages/keras_tuner/engine/base_tuner.py", line 169, in search
    trial = self.oracle.create_trial(self.tuner_id)
  File "/usr/local/lib/python3.7/site-packages/keras_tuner/distribute/oracle_client.py", line 74, in create_trial
    service_pb2.CreateTrialRequest(tuner_id=tuner_id), wait_for_ready=True
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: 'GaussianProcessRegressor' object has no attribute '_x_train'"
        debug_error_string = "{"created":"@1647254964.145628000","description":"Error received from peer ipv4:127.0.0.1:8000","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"Exception calling application: 'GaussianProcessRegressor'
object has no attribute '_x_train'","grpc_status":2}"

We can see in the source code that the _x_train variable of GaussianProcessRegressor is initialized in the fit, but the following call to _vectorize_trials ends up calling the predict before the fit. https://github.com/keras-team/keras-tuner/blob/a9a384ab4158edb306acbc21e2c7599f79ab8424/keras_tuner/tuners/bayesian.py#L247

Reproduce the bug

Files

All files are stored in the same directory

Dockerfile:

FROM python:3.7-buster
RUN pip install keras-tuner==1.1.0 tensorflow==2.8.0

docker-compose.yml:

version: "3.7"

services:
  search:
    build:
      dockerfile: Dockerfile
      args:
        - KERASTUNER_ORACLE_IP="127.0.0.1"
        - KERASTUNER_ORACLE_PORT="8000"
    volumes:
      - ".:/home"

tuning.py

import keras_tuner as kt
import tensorflow as tf
import numpy as np

def build_model(hp):
    """Builds a convolutional model."""
    inputs = tf.keras.Input(shape=(28, 28, 1))
    x = inputs
    for i in range(hp.Int("conv_layers", 1, 3, default=3)):
        x = tf.keras.layers.Conv2D(
            filters=hp.Int("filters_" + str(i), 4, 32, step=4, default=8),
            kernel_size=hp.Int("kernel_size_" + str(i), 3, 5),
            activation="relu",
            padding="same",
        )(x)

        if hp.Choice("pooling" + str(i), ["max", "avg"]) == "max":
            x = tf.keras.layers.MaxPooling2D()(x)
        else:
            x = tf.keras.layers.AveragePooling2D()(x)

        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.ReLU()(x)

    if hp.Choice("global_pooling", ["max", "avg"]) == "max":
        x = tf.keras.layers.GlobalMaxPooling2D()(x)
    else:
        x = tf.keras.layers.GlobalAveragePooling2D()(x)
    outputs = tf.keras.layers.Dense(10, activation="softmax")(x)

    model = tf.keras.Model(inputs, outputs)

    optimizer = hp.Choice("optimizer", ["adam", "sgd"])
    model.compile(
        optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"]
    )
    return model

hyperparameters = kt.HyperParameters()

tuner = kt.Tuner(
    hypermodel=build_model,
    oracle=kt.oracles.BayesianOptimization(
        objective=kt.Objective("val_accuracy", "max"),
        hyperparameters=hyperparameters,
        num_initial_points=3,
        max_trials=30)
)

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Reshape the images to have the channel dimension.
x_train = (x_train.reshape(x_train.shape + (1,)) / 255.0)[:1000]
y_train = y_train.astype(np.int64)[:1000]
x_test = (x_test.reshape(x_test.shape + (1,)) / 255.0)[:100]
y_test = y_test.astype(np.int64)[:100]

tuner.search(
    x_train,
    y_train,
    steps_per_epoch=600,
    validation_data=(x_test, y_test),
    validation_steps=100,
    callbacks=[tf.keras.callbacks.EarlyStopping("val_accuracy")],
)

tuner.results_summary(num_trials=2)

Commands

Expected behavior

We expect to see all workers without error until the end of the optimization.

Additional context

This error is aggravated when running in TFX. The failure of a single (or multiple) workers makes the whole tuning operation fail.

Would you like to help us fix it?

I can try. What is the strategy to fix ?

brydon commented 2 years ago

Sounds like this pull request should solve the issue. Does your problem persist if you use the version of keras-tuner from the github repo rather than from pip?

montanier commented 2 years ago

Yes it looks like this should fix the issue. Another error seems to have been introduced though:

Traceback (most recent call last):
  File "tuning.py", line 66, in <module>
    callbacks=[tf.keras.callbacks.EarlyStopping("val_accuracy")],
  File "/home/keras-tuner/keras_tuner/engine/base_tuner.py", line 178, in search
    self.on_trial_begin(trial)
  File "/home/keras-tuner/keras_tuner/engine/base_tuner.py", line 240, in on_trial_begin
    self._display.on_trial_begin(self.oracle.get_trial(trial.trial_id))
  File "/home/keras-tuner/keras_tuner/engine/tuner_utils.py", line 109, in on_trial_begin
    self.trial_number = int(trial.trial_id) + 1
ValueError: invalid literal for int() with base 10: '9f5711e7578f752031b104016b181877'

I was wondering if a test should have been introduced on #664 , what do you think ?

brydon commented 2 years ago

See #668

re: a test in #664, #664 extends #650 where a test was introduced. (And both are unrelated to your new issue).

montanier commented 2 years ago

Thanks :)