Closed sfo closed 1 year ago
@sfo Thank you for the issue! Would you like to try to remove this line and unindent the following line? This may work
Thanks for the hint. Unfortunately, this does not work (AttributeError: 'OracleClient' object has no attribute "save"
). This method is executed in worker processes only and will throw an exception if change accordingly. The chief starts a server than serves RPC calls to the oracle, only. The save
method of the tuner never gets triggered.
I did some further inspection and see the following possibilities, without implying any order of preference, since I don't know, which might be the best one:
oracle.json
file exists.chief.json
file by calling the tuner's save
method.In #977, I decided to go for the second approach, since it only requires to add one method call.
Describe the bug In a distributed tuning setting, in case the chief process is discontinued and after restarting it, it starts with trial
#1
again instead of resuming at the last unfinished trial.After some investigation, I found out that the
BaseTuner
class checks for a tuner-specific file to exist (in this case, it'schief.json
): https://github.com/keras-team/keras-tuner/blob/e935ac15d72a2b183b79e8c99bed49f91708681c/keras_tuner/engine/base_tuner.py#L126-L129However, this file is only created for worker processes, not for the chief process, so for the latter the condition will always be
false
and cause it to not reload the state and resume tuning.Creating the file manually makes the chief process to continue, as expected.
To Reproduce Use the code from the documentation but with
RandomSearch
instead ofHyperband
to ensure long running tuning and setting parameteroverwrite=False
.In two terminals, run the chief and worker processes as described in the documentation, then after a while, kill both processes and run them again.
Expected behavior The chief process should print
Reloading Tuner from results_dir/mnist/chief.json
and resume with the last running trial.Additional context Python 3.10.13 keras-tuner 1.4.6
Would you like to help us fix it? I'd like too, but time is short, so I may need help to point me at the right location in the code to look after.