hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.59k stars 1.92k forks source link

[feature] easier method to understand node driver health #6102

Open jrasell opened 4 years ago

jrasell commented 4 years ago

Currently as I believe, the only way to programatically check the status of a driver on a Nomad client is to process the /v1/node/:node_id API endpoint. In situations where a driver fails, but the cluster has capacity to place the workload on another node, it is possible the driver failure could go unnoticed.

It would be helpful if there was an easier way to monitor the health of a Nomad client node driver, which could in-turn be integrated into an alerting system. A potential thought on this could be to register the detected drivers in Consul as a health check under the Nomad client catalog entry. The health check could be updated as the driver health changes, allowing for easier operation and better observability of cluster issues.

cc @stevenscg

endocrimes commented 4 years ago

Good call, It would potentially be interesting to emit metrics based on driver/plugin health for folks who run alerting through them too.

pznamensky commented 3 years ago

Would be very helpfull for us too.