Why only "epoch_0" is saved in checkpoints ?

keras-team / keras-tuner

A Hyperparameter Tuning Library for Keras

https://keras.io/keras_tuner/

Apache License 2.0

2.86k stars 396 forks source link

Why only "epoch_0" is saved in checkpoints ? #277

Open q-55555 opened 4 years ago

q-55555 commented 4 years ago

I have noticed that only "epoch_0" is saved in checkpoints. Is this something normal ? Or is there something to do in order to save the best epoch (and not only epoch_0) ?

My point is that I would like to save the best epoch during training but calling "trial.best_step" after having run the trial always returns 'None'.

Thank you !

johnhringiv commented 4 years ago

I was just investigating this. It appears that the model saved in epoch_0 is actually the best model and that best_step is simply not being recorded properly. To be clear I believe tuner.get_best_models(1)[0] returns the best, fully trained model from the sweep.

ysgit commented 4 years ago

I'm fairly sure this is a bug. Checkpoints should be saved after every epoch, for some reason the tuner's on_epoch_end is being called with epoch = 0 always but other callbacks get the correct epoch number so it seems like the issue is somewhere in the tuner code, I just couldn't figure out where

ysgit commented 4 years ago

Actually I see now that this is intentional, the BayesianOptimization tuner overrides the base tuner's checkpoint behavior and keeps only the best checkpoint

elliotvilhelm commented 4 years ago

Wondering if there is a way to override this to checkpoint on every epoch? That to me seems like the expected behavior.

haifeng-jin commented 3 years ago

Currently, you cannot override to checkpoint every epoch. However, you can use TensorBoard to visualize the learning curve.

hp2500 commented 3 years ago

Actually I see now that this is intentional, the BayesianOptimization tuner overrides the base tuner's checkpoint behavior and keeps only the best checkpoint

Hi @ysgit, I am running into the same issue. Do you happen to have a link to the source saying that the checkpoint is overwritten with the best model?