HTTP liveness probe - Githubissues

doup123 commented 1 month ago

Is your feature request related to a problem? Please describe.

Having an HTTP liveness probe provides a de-facto way to identify the health status of an application (e.g. https://github.com/influxdata/telegraf/blob/master/plugins/outputs/health/README.md). This within the cloud world would allow: 1) To enable a liveness probe and restart any failed risingwave containers 2) To detect malfunctioning risingwave cluster and redirect traffic in operating risingwave cluster in high-available scenarios (e.g. upgrade one cluster to a newer version that hits a bug and make the application not working).

Describe the solution you'd like

A simple HTTP endpoint that would return 200 if risingwave works as expected, while 503 when it is not.

Describe alternatives you've considered

The current alternative requires:

wget https://raw.githubusercontent.com/risingwavelabs/risingwave/main/proto/health.proto
install grpcurl
and execute grpcurl -plaintext -d '{}' -import-path . -proto health.proto localhost:5690 health.Health/Check that returns: { "status": "SERVING" }

Additional context

No response

BugenZhao commented 1 month ago

Hi, thanks for your feedback.

The recommended approach to do health checking is currently through SQL interface with pg_is_in_recovery() or rw_recovery_status(), which will be available in the upcoming v1.11 release and also adopted by RisingWave Cloud.

The gRPC health check is also a standard interface, but I'm afraid it's not correctly implemented yet.

fuyufjh commented 2 weeks ago

It's not trivial to define "liveness" here. For example:

If the cluster (i.e. Meta Service) is bootstrapping, what is the status of a compute node?
If the cluster (i.e. Meta Service) is under recovery, what is the status of a compute node? (Remember that the CN can serve batch queries now)
If the compute node lost the heartbeat with Meta service, what is the status of a compute node? (IIUC, CN can serve batch queries now)

doup123 commented 2 weeks ago

@fuyufjh totally agree with what you mention, but what would be the requirements that should be met for a cluster to be considered healthy? IMHO, if all of the components are healthy (this means that they are able to perform their tasks), then the cluster could be considered healthy. I am sure that you can define better than me the conditions that should be satisfied by each component.

fuyufjh commented 3 days ago

@fuyufjh totally agree with what you mention, but what would be the requirements that should be met for a cluster to be considered healthy? IMHO, if all of the components are healthy (this means that they are able to perform their tasks), then the cluster could be considered healthy. I am sure that you can define better than me the conditions that should be satisfied by each component.

It's clear to define whether a cluster is healthy. However, this issue is talking about how to identify a component or a Pod (e.g. Compute node, Frontend Node, Meta node, compactor node, etc.) is healthy, right? This will be ambiguous...

xxchan commented 2 days ago

May I ask did you meet any real problems that want to solve with liveness probe?

According to the use cases you mentioned:

To enable a liveness probe and restart any failed risingwave containers

I think this can be handled by Kubernetes (risingwave-operator).

To detect malfunctioning risingwave cluster and redirect traffic in operating risingwave cluster in high-available scenarios (e.g. upgrade one cluster to a newer version that hits a bug and make the application not working).

May I ask how do you want to do HA? Are you replicating data between 2 RisingWave clusters? Healthcheck looks like the very last step..

BTW, for monitoring cluster health, perhaps you can also use Grafana dashboard and Promethus Alertmanager, which should provide more information about the cluster.

doup123 commented 2 days ago

@xxchan thank you for your responses.

I think this can be handled by Kubernetes (risingwave-operator). Probably you are correct on this

May I ask how do you want to do HA? Are you replicating data between 2 RisingWave clusters? Healthcheck looks like the very last step.. Considering the scenario that multiple clusters run anycasted via K8s in different geolocations, IMHO there should be a way to check if a cluster is operational or not, to send or withdraw traffic accordingly to it.

BTW, for monitoring cluster health, perhaps you can also use Grafana dashboard and Promethus Alertmanager, which should provide more information about the cluster.

The work that has been done with the exposed metrics via Prometheus and the corresponding dashboards is great and I will rely on it to check the "status" of the cluster. What I was trying to say is that it is a very common practice to have an HTTP endpoint with the health status of a service that can be directly used for monitoring/alerting.

risingwavelabs / risingwave

HTTP liveness probe #17771

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context