Open gabreal opened 6 years ago
Thank you for implementing the /health
it is very useful.
When you are using automated deployment, you have limited options to check if the node is ready, compering to a manual deployment where you can check logs, run RPC calls and e.t.c. By saying automating deployment I mostly refer to Kubernetes, but docker or AWS/GCP autoscaling groups have the same limited set of tools to check if the node is healthy.
The most common way to check if the node is healthy is HTTP GET status code. Currently Polkadot binary doesn't have such a good candidate for it. The endpoint should return 200 if the node is healthy and ready to accept the connection, and 5** in all other cases. The current endpoint always returns 200 even if the node even is syncing:
# polkadot 0.9.19
curl -v localhost:9933/health
< HTTP/1.1 200 OK
{"isSyncing":true,"peers":4,"shouldHavePeers":true}
In terms of Kubernetes, this endpoint can be considered as a liveness
, since it returns 200
when a node is in a valid state. But we also need a readiness
probe to understate if the node can be registered as a validator or added behind the Loadbalancer if it is the RPC node.
The proposed solution is to add two additional GET endpoints:
/health/liveness
- should return always 200
, unless the node needs a restart.
/health/readiness
- should be 200
if the chain is synced, can connect to the rest of the network, and can accept new connections/quires.
For more details see how it has done in spring-boot
Note: the endpoints should be very lightweight, they will be called every 5-10 sec.
/health/liveness
- should return always200
, unless the node needs a restart.
A node never needs a restart. So, this would always return 200
.
A node never needs a restart. So, this would always return 200.
That is correct, even if will have such a case (node needs a restart) the node can just restart itself or fail by itself (Example: corrupted database).
The use case for liveness
is to detect then the application is hanged. Example: bug in application with an endless loop. Something similar happened recently. In this case, liveness
will be unresponsible (504 timeouts) and the node will be considered dead and will be restarted. It is important to put liveness
in the main thread so it will reflect the state of the application.
One of the downsides of the liveness
probe is that if the node is loaded with the job and busy with calculating something big and important for a long time, it may be restarted since it is not responding to the probe.
None of the named examples would have been able to be detected. The node is running multiple threads and could still report to your liveness requests, while it would for example hang in some other important component.
I like the idea of making Substrate more "deploy friendly".
Not sure on the solution though, maybe a PolkadotJS script within the docker image which exposes these endpoints would also work?
And what would this script do?
We already have such script, we use it to implement readiness
probe for ws endpoint it returns 200
if ws endpoint available, node is not syncing, and peers > 0. The script deployed near the node and used by Loadbalancer. But it will be very useful if such endpoint will be build in inside the substrate.
esp the readiness
endpoint would be tremendously useful for large scale test networks where nodes cannot be maintained one by one and automation is inevitable.
As @bkchr said we cannot check all invariants and liveliness assumptions of the node.
But could we add a progress tracker that checks if the node is making progress on importing blocks?
It would then either return an error or time out if the node did not import a block within the last x seconds.
As it becomes exponentially hard for a multi-component multi-threaded application to track and fix latent issues in runtime without making a real progress, it is sometimes better to just crash the application. Exceptions should be properly handled, unhandled exceptions become errors which should lead to a crash. A supervisor process will take care of the restart.
Liveness probe is not required at all. The application knows better about its health and we should not try to guess its health by poking a black-box with a stick. As @ggwpez mentioned liveness probe can not cover all the invariants of unhealthy application.
Readiness probe is useful to make sure the node is ready to accept traffic. What makes up a good readiness probe in our case? IDK. Probably a good first candidate would be that a node is synced and have non-zero peers as proposed by @BulatSaif. Implementing this natively would allow ops teams to avoid running sidecars to periodically evaluate whether a node is ready to accept traffic.
References:
@niklasad1 can you implement the readiness
endpoint? It should check that we are not doing a major sync and that we have at least one peer. This should be fairly simple.
@bkchr sorry missed this, sure I can try to fix this one.
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/new-json-rpc-api-mega-q-a/3048/6
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/new-json-rpc-api-mega-q-a/3048/7
ah damn, I will add this to my priority list so it gets done
For the integration and automized deployments it would be convenient to have a (preferably) http endpoint which can be used to monitor the status of a polkadot/substrate/... node.
This is especially useful for Kubernetes where a readiness feedback can ensure that only a limited amount of nodes are drained (during upgrade) before the new ones are back up again. Liveness probes are useful for ensuring that a node is still well connected and potentially restarted upon outage.
in extension to that, it would be possible to use the same endpoint to provide metrics for prometheus (which can be used to integrate into grafana. These metrics usually are available at some http:///metrics path and are served as a textfile containing three lines for every metric (one is the metric, another specifies the type and a third for a description).
it is sometimes seen that /-/ready and /-/healthy are used for readiness and liveness endpoints.
alternatively one could use the rpc endpoint for that. then it would be helpful to add two api calls for health and readiness - probably when the chain is fully synced and operational and the other one maybe for the amount of connected peers.