Open n30111 opened 2 years ago
Thanks for reporting the issue @n30111! Indeed, definitely something we should fix.
I think we should switch to using the Tune Session API internally instead of tune.checkpoint_dir
, and then on the Tune side, it can fill in the checkpoint step
the training_iteration
in the corresponding metrics. cc @xwjiang2010 @Yard1
There is the same issue for HuggingfaceTrainer, when using steps for saving frequency, like 1000 steps, the first checkpoint is checkpoint 00000, not checkpoint1000.
How is this impacting workloads, aside from the Keras callback not saving the epoch? As far as I understand, the most important thing is that we have an incremental counter for checkpoints. The actual epoch/iteration number should be saved inside the checkpoint itself (which is indeed the case with Huggingface, but not with the Keras callback).
How is this impacting workloads, aside from the Keras callback not saving the epoch? As far as I understand, the most important thing is that we have an incremental counter for checkpoints. The actual epoch/iteration number should be saved inside the checkpoint itself (which is indeed the case with Huggingface, but not with the Keras callback).
But keep the checkpoint number consistent with Huggingface checkpoint number will be more connivence for managing checkpoints
@amogkam not exactly sure that I followed. How does Tune Session know about the specific application details (freq etc)?
I haven't set checkpoint_frequency in CheckpointConfig
@amogkam any update on this issue?
@justinvyu does #36220 resolve this?
What happened + What you expected to happen
While enabling the
frequency
parameter In the Keras Callback (from ray.air.callbacks.keras import Callback
), the checkpoints folder does not include the correct training iteration number.If we set
frequency=1
, then the checkpoints follow the naming conventioncheckpoint_{(iteration-1):06d}
, but if we setfrequency>1
, the saved checkpoint folder does not have any info about the iteration number, and the checkpoints are saved with consecutive folder naming convention. This is because of the way checkpoints folder are created here : https://github.com/ray-project/ray/blob/master/python/ray/train/_internal/checkpoint.py#L228 . As it simply increment theself._latest_checkpoint_id
without considering thefrequency
parameter.While using
frequency=1
While using
frequency=2
But ideally these numbering should be ['checkpoint_000002', 'checkpoint_000004']
Versions / Dependencies
2.0.0
Reproduction script
Following script which is a minor modification of the test: https://github.com/ray-project/ray/blob/releases/2.0.0/python/ray/air/tests/test_keras_callback.py can be used to reproduce the bug.
Issue Severity
High: It blocks me from completing my task.