[X] I had searched in the issues and found no similar feature requirement.
Description
Steps to reproduce
There are examples that illustrate checkpointing and recovering from checkpointing in the Ray training frameworks. One such example illustrates how to configure checkpointing to a pytorch training job.
1. Trigger the training RayJob
kubectl apply rayjob.yaml
2. Kill the head pod
Let the training job make a couple of checkpoints and then kill the head pod.
The current driver pod errors out and a new driver pod gets created. The new driver pod runs the training job again from scratch ignoring the checkpoints produced in the last run.
Hacky Fix
To overcome this problem, we have to write a function with a tightly coupled logic. For example, look at the function findLatestCheckpoint in this job definition.
Use case
It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.
Search before asking
Description
Steps to reproduce
There are examples that illustrate checkpointing and recovering from checkpointing in the Ray training frameworks. One such example illustrates how to configure checkpointing to a pytorch training job.
1. Trigger the training RayJob
2. Kill the head pod
Let the training job make a couple of checkpoints and then kill the head pod.
3. The new driver ignores the checkpoint
The current driver pod errors out and a new driver pod gets created. The new driver pod runs the training job again from scratch ignoring the checkpoints produced in the last run.
Hacky Fix
To overcome this problem, we have to write a function with a tightly coupled logic. For example, look at the function
findLatestCheckpoint
in this job definition.Use case
It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.
Related issues
No response
Are you willing to submit a PR?