Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
How to ensures the high reliability of ray head?
Even though enable gcs fault tolerance with https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-redis-cleanup-gate, ray head can not support create ray job, ray serve application when ray head crashed and didn't finish recover, so how to ensures the high reliability of ray head during full-lifecycle? Especially for commercial scenarios. For example , when we deploy many ray serve application on resident ray cluster, and provide inference service for customer, it will reduce the service availability if ray head pod crashed, ray need some time to recover
What happened + What you expected to happen
How to ensures the high reliability of ray head? Even though enable gcs fault tolerance with https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-redis-cleanup-gate, ray head can not support create ray job, ray serve application when ray head crashed and didn't finish recover, so how to ensures the high reliability of ray head during full-lifecycle? Especially for commercial scenarios. For example , when we deploy many ray serve application on resident ray cluster, and provide inference service for customer, it will reduce the service availability if ray head pod crashed, ray need some time to recover
Versions / Dependencies
ray 2.10.0 +kuberay 1.1.0
Reproduction script
ray 2.10.0 +kuberay 1.1.0
Issue Severity
High: It blocks me from completing my task.