Open SCMusson opened 2 years ago
Got it. This is indeed a problem. I wonder what is the best way from a UX experience perspective, which ultimately depends on what is to be achieved. By saving to exp/trial/checkpoint_xxxxx, one saves one copy each epoch. One can leverage Tune's checkpoint offering to configure how many checkpoints to keep at any time, what is the scoring attribute etc. By using torch trainer's native saving mechanism, you only get one checkpoint, which gets overwrite over time.
Maybe we should somehow throw a warning to disable native checkpointing when callback is used? Or even better to automatically drop it?
What happened + What you expected to happen
tune.integration.pytorch_lightning.TuneReportCheckpointCallback needs to subclass pytorch_lightning.callbacks.Checkpoint or otherwise you need to pass enable_checkpointing=False to trainer. Without this I think two copies of every checkpoint is saved because trainer.save_checkpoint is called twice
One checkpoint is correctly saved as in the
<experiment>/<trial>
folder another is saved in<experiment>/<trial>/checkpoints
I would expect only one of them to be saved to save space.Versions / Dependencies
I'm using Ray 1.13.0
Reproduction script
The example in the pytorch_lightning population based training tutorial would work to reproduce this.
Issue Severity
Low: It annoys or frustrates me.