[Ray Head] How to ensures the High reliability of ray head?

What happened + What you expected to happen

How to ensures the high reliability of ray head? Even though enable gcs fault tolerance with https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-redis-cleanup-gate, ray head can not support create ray job, ray serve application when ray head crashed and didn't finish recover, so how to ensures the high reliability of ray head during full-lifecycle? Especially for commercial scenarios. For example , when we deploy many ray serve application on resident ray cluster, and provide inference service for customer, it will reduce the service availability if ray head pod crashed, ray need some time to recover

Versions / Dependencies

ray 2.10.0 +kuberay 1.1.0

Reproduction script

ray 2.10.0 +kuberay 1.1.0

Issue Severity

High: It blocks me from completing my task.

ray-project / ray