[Tune] TuneReportCheckpointCallback causes two checkpoints to made every time it is called.

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Apache License 2.0

33.69k stars 5.73k forks source link

What happened + What you expected to happen

tune.integration.pytorch_lightning.TuneReportCheckpointCallback needs to subclass pytorch_lightning.callbacks.Checkpoint or otherwise you need to pass enable_checkpointing=False to trainer. Without this I think two copies of every checkpoint is saved because trainer.save_checkpoint is called twice

One checkpoint is correctly saved as in the <experiment>/<trial> folder another is saved in <experiment>/<trial>/checkpoints I would expect only one of them to be saved to save space.

Versions / Dependencies

I'm using Ray 1.13.0

Reproduction script

The example in the pytorch_lightning population based training tutorial would work to reproduce this.

Issue Severity

Low: It annoys or frustrates me.

ray-project / ray