ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.56k stars 5.7k forks source link

[rfc] `ray health-check` #15265

Closed wuisawesome closed 3 years ago

wuisawesome commented 3 years ago

Why

When running a k8s cluster, advanced users need a way of health checking ray and its components. In particular, we want to be able to health check components of the cluster like the ray client server.

Ray almost exclusively uses grpc for its transport layer, and k8s doesn't officially support a grpc based health check.

Proposed API

The proposed API is a command that k8s can use as a liveness-command based health check.

ray health-check # Returns 0 if it can connect to GCS else 1

ray health-check --component=client_server # Return 1 if the value is sufficiently recent.

By default, we assume there is one instance of ray running on the head node (we can support --address=... and --port=... if necessary).

Potential implementation

One potential implementation can rely on internal-kv.

ray health-check (no args) can simply try to connect to GCS KV Service, which is sufficient proof that GCS is alive.

ray health-check --component=client_server can check some internal-kv key healthcheck:client_server for status information. The ray client server should periodically put some heartbeat in internal-kv.

The state of internal kv would look something like

"healthcheck:client_server": "{'last_modified': 123455 # a unix timestamp}"
rkooo567 commented 3 years ago

Does it only health check components in a head node?

wuisawesome commented 3 years ago

For now, yes since we only need to implement ray client server for now, but in principle, anything that can connect to internal kv

zhe-thoughts commented 3 years ago

Does the headnode currently expose a "metrics endpoint" so people can freely query it for different kind of metrics?

E.g. this demonstrates what I have in mind: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/data-storage/content/using_jmx_for_accessing_hdfs_metrics.html

wuisawesome commented 3 years ago

We have prometheus metrics on the head node, which can help with the health of the cluster, but not as much the peripheral components (like ray client server or autoscaler).

ericl commented 3 years ago

LGTM, but can we keep it as an hidden API (ray _health_check or hidden click flag) for now?

rkooo567 commented 3 years ago

Can we close it?

wuisawesome commented 3 years ago

yup