ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.69k stars 5.73k forks source link

[Tune] TuneReportCheckpointCallback causes two checkpoints to made every time it is called. #27524

Open SCMusson opened 2 years ago

SCMusson commented 2 years ago

What happened + What you expected to happen

tune.integration.pytorch_lightning.TuneReportCheckpointCallback needs to subclass pytorch_lightning.callbacks.Checkpoint or otherwise you need to pass enable_checkpointing=False to trainer. Without this I think two copies of every checkpoint is saved because trainer.save_checkpoint is called twice

One checkpoint is correctly saved as in the <experiment>/<trial> folder another is saved in <experiment>/<trial>/checkpoints I would expect only one of them to be saved to save space.

Versions / Dependencies

I'm using Ray 1.13.0

Reproduction script

The example in the pytorch_lightning population based training tutorial would work to reproduce this.

Issue Severity

Low: It annoys or frustrates me.

xwjiang2010 commented 2 years ago

Got it. This is indeed a problem. I wonder what is the best way from a UX experience perspective, which ultimately depends on what is to be achieved. By saving to exp/trial/checkpoint_xxxxx, one saves one copy each epoch. One can leverage Tune's checkpoint offering to configure how many checkpoints to keep at any time, what is the scoring attribute etc. By using torch trainer's native saving mechanism, you only get one checkpoint, which gets overwrite over time.

Maybe we should somehow throw a warning to disable native checkpointing when callback is used? Or even better to automatically drop it?