ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.94k stars 5.58k forks source link

[Ray Head] How to ensures the High reliability of ray head? #46354

Open xubo245 opened 2 months ago

xubo245 commented 2 months ago

What happened + What you expected to happen

How to ensures the high reliability of ray head? Even though enable gcs fault tolerance with https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-redis-cleanup-gate, ray head can not support create ray job, ray serve application when ray head crashed and didn't finish recover, so how to ensures the high reliability of ray head during full-lifecycle? Especially for commercial scenarios. For example , when we deploy many ray serve application on resident ray cluster, and provide inference service for customer, it will reduce the service availability if ray head pod crashed, ray need some time to recover

Versions / Dependencies

ray 2.10.0 +kuberay 1.1.0

Reproduction script

ray 2.10.0 +kuberay 1.1.0

Issue Severity

High: It blocks me from completing my task.

xubo245 commented 2 months ago

@anyscalesam @liuxsh9