Open giulatona opened 5 months ago
@giulatona instead of using ReportCheckpointCallback
, Can you save the model checkpoint yourself and use ray.train.report
to upload the checkpoint?
Here's the user guide: https://docs.ray.io/en/master/train/user-guides/checkpoints.html
Since I am using keras model.fit what you suggest would mean that I would have to write my own callback instead of using ReportCheckpointCallback. I do not think that it would be useful. Also, it is a problem with the code in ReportCheckpointCallback, or better a problem with ray.train.TensorflowCheckpoint
ReportCheckpointCallback
is a default solutioin provided by ray team, and it should cover common cases. But if it requires customization, users need to write their own report callback (e.g. save model in keras native format in your case)
I do not want to customise it, it just does not work and I tried to provide a reason for that in the issue. I do not care how the model are saved
What happened + What you expected to happen
While tuning the hyper parameters of a custom keras model (subclass of keras.model) and requesting to save a checkpoint at epoch end through ReportCheckpointCallback the process fails. In particular ray.train.TensorflowCheckpoint tries to save the model using model.save(). This results, in the following error:
This can be traced to the call to model.save in TensorflowCheckpoint that somehow results in
save_format == h5
.Versions / Dependencies
Ray[tune] version 2.10.0 Keras version 2.15 Tensorflow version 2.15 Python version 3.11 OS: Ubuntu 22.04
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.