readiness and liveness endpoints for service health monitoring

gabreal commented 6 years ago

For the integration and automized deployments it would be convenient to have a (preferably) http endpoint which can be used to monitor the status of a polkadot/substrate/... node.

This is especially useful for Kubernetes where a readiness feedback can ensure that only a limited amount of nodes are drained (during upgrade) before the new ones are back up again. Liveness probes are useful for ensuring that a node is still well connected and potentially restarted upon outage.

in extension to that, it would be possible to use the same endpoint to provide metrics for prometheus (which can be used to integrate into grafana. These metrics usually are available at some http:///metrics path and are served as a textfile containing three lines for every metric (one is the metric, another specifies the type and a third for a description).

it is sometimes seen that /-/ready and /-/healthy are used for readiness and liveness endpoints.

alternatively one could use the rpc endpoint for that. then it would be helpful to add two api calls for health and readiness - probably when the chain is fully synced and operational and the other one maybe for the amount of connected peers.

gabreal commented 6 years ago

BulatSaif commented 2 years ago

Thank you for implementing the /health it is very useful.

Problem definitions

When you are using automated deployment, you have limited options to check if the node is ready, compering to a manual deployment where you can check logs, run RPC calls and e.t.c. By saying automating deployment I mostly refer to Kubernetes, but docker or AWS/GCP autoscaling groups have the same limited set of tools to check if the node is healthy.

The most common way to check if the node is healthy is HTTP GET status code. Currently Polkadot binary doesn't have such a good candidate for it. The endpoint should return 200 if the node is healthy and ready to accept the connection, and 5** in all other cases. The current endpoint always returns 200 even if the node even is syncing:

# polkadot 0.9.19
curl -v localhost:9933/health 
< HTTP/1.1 200 OK
{"isSyncing":true,"peers":4,"shouldHavePeers":true}

In terms of Kubernetes, this endpoint can be considered as a liveness, since it returns 200 when a node is in a valid state. But we also need a readiness probe to understate if the node can be registered as a validator or added behind the Loadbalancer if it is the RPC node.

Solution:

The proposed solution is to add two additional GET endpoints: /health/liveness - should return always 200, unless the node needs a restart. /health/readiness - should be 200 if the chain is synced, can connect to the rest of the network, and can accept new connections/quires. For more details see how it has done in spring-boot

Note: the endpoints should be very lightweight, they will be called every 5-10 sec.

bkchr commented 2 years ago

/health/liveness - should return always 200, unless the node needs a restart.

A node never needs a restart. So, this would always return 200.

BulatSaif commented 2 years ago

A node never needs a restart. So, this would always return 200.

That is correct, even if will have such a case (node needs a restart) the node can just restart itself or fail by itself (Example: corrupted database). The use case for liveness is to detect then the application is hanged. Example: bug in application with an endless loop. Something similar happened recently. In this case, liveness will be unresponsible (504 timeouts) and the node will be considered dead and will be restarted. It is important to put liveness in the main thread so it will reflect the state of the application.

One of the downsides of the liveness probe is that if the node is loaded with the job and busy with calculating something big and important for a long time, it may be restarted since it is not responding to the probe.

bkchr commented 2 years ago

None of the named examples would have been able to be detected. The node is running multiple threads and could still report to your liveness requests, while it would for example hang in some other important component.

ggwpez commented 2 years ago

I like the idea of making Substrate more "deploy friendly".

Not sure on the solution though, maybe a PolkadotJS script within the docker image which exposes these endpoints would also work?

bkchr commented 2 years ago

And what would this script do?

BulatSaif commented 2 years ago

We already have such script, we use it to implement readiness probe for ws endpoint it returns 200 if ws endpoint available, node is not syncing, and peers > 0. The script deployed near the node and used by Loadbalancer. But it will be very useful if such endpoint will be build in inside the substrate.

gabreal commented 2 years ago

esp the readiness endpoint would be tremendously useful for large scale test networks where nodes cannot be maintained one by one and automation is inevitable.

ggwpez commented 2 years ago

As @bkchr said we cannot check all invariants and liveliness assumptions of the node.
But could we add a progress tracker that checks if the node is making progress on importing blocks?
It would then either return an error or time out if the node did not import a block within the last x seconds.

bakhtin commented 2 years ago

As it becomes exponentially hard for a multi-component multi-threaded application to track and fix latent issues in runtime without making a real progress, it is sometimes better to just crash the application. Exceptions should be properly handled, unhandled exceptions become errors which should lead to a crash. A supervisor process will take care of the restart.

Liveness probe is not required at all. The application knows better about its health and we should not try to guess its health by poking a black-box with a stick. As @ggwpez mentioned liveness probe can not cover all the invariants of unhealthy application.

Readiness probe is useful to make sure the node is ready to accept traffic. What makes up a good readiness probe in our case? IDK. Probably a good first candidate would be that a node is synced and have non-zero peers as proposed by @BulatSaif. Implementing this natively would allow ops teams to avoid running sidecars to periodically evaluate whether a node is ready to accept traffic.

References:

bkchr commented 2 years ago

@niklasad1 can you implement the readiness endpoint? It should check that we are not doing a major sync and that we have at least one peer. This should be fairly simple.