Open ciroaceto opened 5 months ago
I have also tried to increase num_to_keep
to 4 and 5 (the code mentions that a low num_to_keep
can cause a similar issue), but the error remains. I also realized the checkpoint_frequency seems not to be working as intended. In previous ray versions the checkpoints were only created every checkpoint_frequency
iterations. Right now a checkpoint is created in each iteration.
@ciroaceto PBT with (very frequent) time-based checkpointing and also setting a low num_to_keep
is not very stable due to trial scheduling being nondeterministic. Here's a few tips to get this working:
training_iteration
as the perturbation interval unit instead of time_total_s
:
checkpoint_frequency = 2
pbt = PopulationBasedTraining( time_attr="time_total_s", perturbation_interval=checkpoint_frequency, ..., ) tuner = Tuner( ..., checkpoint_config=train.CheckpointConfig(
checkpoint_frequency=checkpoint_frequency,
)
)
* Another option is to set `synch=True` to make sure that all trials are in lock step, so the checkpoint assigned to a trial will never be missing. You should be able to set a lower `num_to_keep` in this scenario.
------
> In previous ray versions the checkpoints were only created every checkpoint_frequency iterations. Right now a checkpoint is created in each iteration.
This may be a combination of a checkpoint folder naming change, as well as the time-based perturbation interval you have at the moment:
* Checkpoint folders are now named in terms of checkpoint index, rather than the `training_iteration`, starting from 0. It increments by 1 each time.
* A checkpoint is forced to happen on every perturbation interval for high performing trials, which may cause the checkpointing to become more frequent.
What happened + What you expected to happen
Checkpoints from RLlib tune experiment (PBT Scheduler) are being deleted before another trial restores it. The reproduction script below had two incomplete trials (status: ERROR) with the following error.txt:
ValueError: Could not recover from checkpoint as it does not exist on storage anymore. Got storage fs type 'local' and path: /home/.../PPO_2024-05-07_13-41-09/PPO_Pendulum-v1_b3dc9_00005
The previous error message is from trial PPO_Pendulum-v1_b3dc9_00004. I suppose the exploitation mechanism were being applied and the checkpoint to be restored from wasn't updated.
Versions / Dependencies
ray: 2.10/2.20 gymnasium: 0.28.1 python: 3.10.13 OS: Ubuntu 22.04.4
Reproduction script
Issue Severity
High: It blocks me from completing my task.