keras-team / keras-tuner

A Hyperparameter Tuning Library for Keras
https://keras.io/keras_tuner/
Apache License 2.0
2.86k stars 396 forks source link

Chief process does not resume tuning after restart due to missing `chief.json` file #976

Closed sfo closed 1 year ago

sfo commented 1 year ago

Describe the bug In a distributed tuning setting, in case the chief process is discontinued and after restarting it, it starts with trial #1 again instead of resuming at the last unfinished trial.

After some investigation, I found out that the BaseTuner class checks for a tuner-specific file to exist (in this case, it's chief.json): https://github.com/keras-team/keras-tuner/blob/e935ac15d72a2b183b79e8c99bed49f91708681c/keras_tuner/engine/base_tuner.py#L126-L129

However, this file is only created for worker processes, not for the chief process, so for the latter the condition will always be false and cause it to not reload the state and resume tuning.

Creating the file manually makes the chief process to continue, as expected.

To Reproduce Use the code from the documentation but with RandomSearch instead of Hyperband to ensure long running tuning and setting parameter overwrite=False.

In two terminals, run the chief and worker processes as described in the documentation, then after a while, kill both processes and run them again.

Expected behavior The chief process should print Reloading Tuner from results_dir/mnist/chief.json and resume with the last running trial.

Additional context Python 3.10.13 keras-tuner 1.4.6

Would you like to help us fix it? I'd like too, but time is short, so I may need help to point me at the right location in the code to look after.

haifeng-jin commented 1 year ago

@sfo Thank you for the issue! Would you like to try to remove this line and unindent the following line? This may work

sfo commented 1 year ago

Thanks for the hint. Unfortunately, this does not work (AttributeError: 'OracleClient' object has no attribute "save"). This method is executed in worker processes only and will throw an exception if change accordingly. The chief starts a server than serves RPC calls to the oracle, only. The save method of the tuner never gets triggered.

I did some further inspection and see the following possibilities, without implying any order of preference, since I don't know, which might be the best one:

In #977, I decided to go for the second approach, since it only requires to add one method call.