keras-team / keras-tuner

A Hyperparameter Tuning Library for Keras
https://keras.io/keras_tuner/
Apache License 2.0
2.86k stars 397 forks source link

gRPC error occuring in worker tuners stalls chief with possible trial loss #915

Open hmf opened 1 year ago

hmf commented 1 year ago

Describe the bug

While trying to refactor and correct a bug of mine, I started getting errors I did not detect before. I made an initial report here but (as expected), the problem lies elsewhere. I have created a small example that shows the same exception that occurs with a high probability. This problem may be due to some race condition that I have been able to trigger in nearly every experiment. Here is the error I get in one or more workers:

Best val_loss So Far: 0.13012780249118805
Total elapsed time: 00h 00m 01s
Traceback (most recent call last):
  File "/workspaces/Unsupervised-Anomaly-Detection-with-SSIM-AE/KerasTunerEx1.py", line 101, in <module>
    tuner.search(
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 220, in search
    trial = self.oracle.create_trial(self.tuner_id)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/distribute/oracle_client.py", line 69, in create
_trial
    response = self.stub.CreateTrial(
  File "/home/vscode/.local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/vscode/.local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: too many indices for array: array is 0-dimensional, but 1 were indexed"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Exception calling application: too many indices for array: array is 0-dimensional, but 1 were indexed", grpc_status:2, created_time:"2023-07-12T14:12:41.621822148+00:00"}"
>

Usually, when all workers get several trials to perform, the chief shows the message:

Oracle server on chief is exiting in 40s.The chief will go on with post-search code.

and then terminates gracefully. In this case, no error occurs. Another thing I have noticed is that usually when the error occurs, the chief generates a ConvergenceWarning as shown below:

$ ./tune_master_ex1.sh 
2023-07-12 15:30:28.331151: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-12 15:30:28.942709: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Keras Tuner Ex1
/home/vscode/.local/lib/python3.10/site-packages/sklearn/gaussian_process/kernels.py:419: ConvergenceWarning: The optimal value found for dimension 0 of parameter length_scale is close to the specified lower bound 1e-05. Decreasing the bound and calling fit again may find a better value.
  warnings.warn(
Oracle server on chief is exiting in 40s.The chief will go on with post-search code.

When the exception above does occur, the chief will still print the message that it will exit in 40 seconds. However, it stalls indefinitely. I need to kill it with a Ctrl-C or kill command.

The probability that this error occurs increases with the number of trials. So the code I provide runs 20 trials using a single hyperparameter with about 10 values. I limit the number of epochs of each trial to make testing faster. It is not guaranteed for the exception to occur in every experiment, but it does happen very often. In my original experiments, this occurs all the time.

To Reproduce

I first launch 3 workers as follows:

cd scripts
./tune_slave_ex1.sh 1
./tune_slave_ex1.sh 2
./tune_slave_ex1.sh 3

I then launch the chief:

./tune_master_ex1.sh 

Usually within 2 attempts, I get the exception. YMMV.

keras_tuner_ex1_1.zip

I did the experiments in a VM with Linux (Ubuntu).

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

I use:

$ python --version
Python 3.10.6

I use VSCode to set up the container and have the following Python packages installed:

Successfully installed MarkupSafe-2.1.3 PyWavelets-1.4.1 absl-py-1.4.0 astunparse-1.6.3 cachetools-5.3.1 certifi-2023.5.7 charset-normalizer-3.2.0 contourpy-1.1.0 cycler-0.11.0 flatbuffers-23.5.26 fonttools-4.40.0 gast-0.4.0 google-auth-2.22.0 google-auth-oauthlib-1.0.0 google-pasta-0.2.0 grpcio-1.56.0 gviz-api-1.10.0 h5py-3.9.0 idna-3.4 imageio-2.31.1 joblib-1.3.1  keras-2.13.1 keras-tuner-1.3.5 kiwisolver-1.4.4 kt-legacy-1.0.5 lazy_loader-0.3 libclang-16.0.0 markdown-3.4.3 matplotlib-3.7.2 networkx-3.1 numpy-1.24.3 oauthlib-3.2.2 opencv-contrib-python-headless-4.8.0.74 opencv-python-headless-4.8.0.74 opt-einsum-3.3.0 packaging-23.1 pillow-10.0.0 protobuf-4.23.4 pyasn1-0.5.0 pyasn1-modules-0.3.0 pyparsing-3.0.9 python-dateutil-2.8.2 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 scikit-image-0.21.0 scikit-learn-1.3.0 scipy-1.11.1 six-1.16.0 tensorboard-2.13.0 tensorboard-data-server-0.7.1 tensorboard_plugin_profile-2.13.0 tensorflow-2.13.0 tensorflow-estimator-2.13.0 tensorflow-io-gcs-filesystem-0.32.0 termcolor-2.3.0 threadpoolctl-3.1.0 tifffile-2023.7.10 tqdm-4.65.0 typing-extensions-4.5.0 urllib3-1.26.16 werkzeug-2.3.6 wrapt-1.15.0

Expected behavior

I expect the chief to terminate and no exceptions to occur in the workers.

Additional context

Besides correcting this issue, I would like to know if there is some workaround I can use to prevent this issue. I cannot confirm when the workers fail but have noticed a drop in CPU usage. The goal here is to run the experiments in as short a time as possible and as soon as it terminates to run some additional evaluation code in the chief. On both counts, because of this issue, I cannot do this automatically. I am also unsure if trials are lost.

Would you like to help us fix it?

Sure. Might need some guidance.

hmf commented 1 year ago

According to the error stack trace I assumed that error was for all tuners. However, just to check I repeated the tests with the random and hyperband tuners. Try as I might, I cannot get then random tuner to fail. Maybe I am not trying hard enough. However, I was able to get the hyperband tuner to fails. Unfortunately this seems to be another error. Here is the trace:

Traceback (most recent call last):
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 270, in _try_run_and_update_trial
    self._run_and_update_trial(trial, *fit_args, **fit_kwargs)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 235, in _run_and_update_trial
    results = self.run_trial(trial, *fit_args, **fit_kwargs)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/tuners/hyperband.py", line 425, in run_trial
    return super().run_trial(trial, *fit_args, **fit_kwargs)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 287, in run_trial
    obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 213, in _build_and_fit_model
    model = self._try_build(hp)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 155, in _try_build
    model = self._build_hypermodel(hp)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/tuners/hyperband.py", line 432, in _build_hypermodel
    model.load_weights(self._get_checkpoint_fname(trial_id))
  File "/home/vscode/.local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/vscode/.local/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 31, in error_translator
    raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ../results/keras_tuner_ex1/project_keras_tuner_ex1/trial_0005/checkpoint
Traceback (most recent call last):
  File "/workspaces/Unsupervised-Anomaly-Detection-with-SSIM-AE/KerasTunerEx2.py", line 129, in <module>
    tuner.search(
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 231, in search
    self.on_trial_end(trial)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 335, in on_trial_end
    self.oracle.end_trial(trial)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/distribute/oracle_client.py", line 90, in end_trial
    self.stub.EndTrial(
  File "/home/vscode/.local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/vscode/.local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: can only concatenate str (not "NoneType") to str"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Exception calling application: can only concatenate str (not \"NoneType\") to str", grpc_status:2, created_time:"2023-07-13T09:17:41.3629395+00:00"}"
>

I am wondering if the cause may not be the same. I used the same code but added the hyperband tuner. The error however seems to manifest itself only when I first launch the slaves and then the master. Also what is interesting is that the worker actually starts work and reports something (note overwrite is set to True):

Search: Running Trial #14

Value             |Best Value So Far |Hyperparameter
64                |40                |units
4                 |2                 |tuner/epochs
0008              |None              |tuner/trial_id
2                 |2                 |tuner/bracket
1                 |0                 |tuner/round
2                 |0                 |tuner/initial_epoch

Traceback (most recent call last):
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 270, in _try_run_and_update_trial
    self._run_and_update_trial(trial, *fit_args, **fit_kwargs)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 235, in _run_and_update_trial
    results = self.run_trial(trial, *fit_args, **fit_kwargs)
  File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/tuners/hyperband.py", line 425, in run_trial
    return super().run_trial(trial, *fit_args, **fit_kwargs)
  File

I would have expected the worker to poll the chief first and only then start its work. Restarting the workers always produces the same result. As with the case above, once a worker fails, the chief will stall and needs to be killed. To get this to work again I first launch the workers and then the master.

keras_tuner_ex2_1.zip

EDIT: this error also occurs with the Bayes tuner 8-(