ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
974 stars 330 forks source link

FT GCS should handle draining of node where head pod is scheduled #2153

Open abatilo opened 1 month ago

abatilo commented 1 month ago

Search before asking

Description

If I have a TorchTrainer running, doing some work, and I drain the node where my head pod is running, nothing ever seems to actually recover. I've enabled FT GCS in KubeRay helm chart version v1.1.1 -- I have an external redis that has all the state in it, etc.

Is there truly no way to have my head pod, which is running on a spot node, survive being rescheduled? It doesn't seem like the head node can do any recovery whatsoever for jobs that were in the middle of training.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

kevin85421 commented 1 month ago

Hi @abatilo, thank you for opening the issue. You may have some misunderstanding for GCS FT. You can read https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-gcs-ft for more details. Currently, the only use case of GCS FT is for Ray Serve high availability.

abatilo commented 1 month ago

Thanks for responding @kevin85421. My original question still stands then. Is there no way to have the jobs running in progress survive after the Ray head gets restarted? I see that there's a way to resume a trainer from previous state but do I always have to re-submit that if the Ray head is gone?

kevin85421 commented 1 month ago

My current understanding is that Ray Train provides some degree of fault tolerance.

The long-running Ray job is currently on our roadmap of the next release. We are currently working on:

If you are interested in this topic, you can reach out to me on Ray Slack (my handle is 'Kai-Hsun Chen (ray team)'). We can discuss your requirements and ensure there are no feature gaps for your use cases.