Closed wuisawesome closed 3 years ago
Does it only health check components in a head node?
For now, yes since we only need to implement ray client server for now, but in principle, anything that can connect to internal kv
Does the headnode currently expose a "metrics endpoint" so people can freely query it for different kind of metrics?
E.g. this demonstrates what I have in mind: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/data-storage/content/using_jmx_for_accessing_hdfs_metrics.html
We have prometheus metrics on the head node, which can help with the health of the cluster, but not as much the peripheral components (like ray client server or autoscaler).
LGTM, but can we keep it as an hidden API (ray _health_check
or hidden click flag) for now?
Can we close it?
yup
Why
When running a k8s cluster, advanced users need a way of health checking ray and its components. In particular, we want to be able to health check components of the cluster like the ray client server.
Ray almost exclusively uses grpc for its transport layer, and k8s doesn't officially support a grpc based health check.
Proposed API
The proposed API is a command that k8s can use as a liveness-command based health check.
By default, we assume there is one instance of ray running on the head node (we can support
--address=...
and--port=...
if necessary).Potential implementation
One potential implementation can rely on internal-kv.
ray health-check
(no args) can simply try to connect to GCS KV Service, which is sufficient proof that GCS is alive.ray health-check --component=client_server
can check some internal-kv keyhealthcheck:client_server
for status information. The ray client server should periodically put some heartbeat in internal-kv.The state of internal kv would look something like