FT GCS should handle draining of node where head pod is scheduled

abatilo commented 1 month ago

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

If I have a TorchTrainer running, doing some work, and I drain the node where my head pod is running, nothing ever seems to actually recover. I've enabled FT GCS in KubeRay helm chart version v1.1.1 -- I have an external redis that has all the state in it, etc.

Is there truly no way to have my head pod, which is running on a spot node, survive being rescheduled? It doesn't seem like the head node can do any recovery whatsoever for jobs that were in the middle of training.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

kevin85421 commented 1 month ago

Hi @abatilo, thank you for opening the issue. You may have some misunderstanding for GCS FT. You can read https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-gcs-ft for more details. Currently, the only use case of GCS FT is for Ray Serve high availability.

abatilo commented 1 month ago

Thanks for responding @kevin85421. My original question still stands then. Is there no way to have the jobs running in progress survive after the Ray head gets restarted? I see that there's a way to resume a trainer from previous state but do I always have to re-submit that if the Ray head is gone?

kevin85421 commented 1 month ago

My current understanding is that Ray Train provides some degree of fault tolerance.

If a Ray worker Pod crashes, Ray Train will launch new Ray tasks or actors, allowing the job to continue running.
If the Ray head crashes, the driver process, which is typically running on the Ray head of the Ray job, will also crash. Therefore, the Ray tasks or actors will also be garbage collected automatically. In most cases, Ray tasks or actors are fate-sharing with the driver process. Only detached actors do not share this fate with the driver.
- In this case, users need to write the fault tolerance logic at the application level, specifically in their Ray Python script.
```
if (checkpoint exists):
read checkpoint
else:
start from scratch
```
train the model

The long-running Ray job is currently on our roadmap of the next release. We are currently working on:

Retry mechanism in both RayJob CRD / ray job submit.
Best practice for checkpointing.

If you are interested in this topic, you can reach out to me on Ray Slack (my handle is 'Kai-Hsun Chen (ray team)'). We can discuss your requirements and ensure there are no feature gaps for your use cases.

ray-project / kuberay