hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.93k stars 1.96k forks source link

Feature request: Custom nomad agent health checks #3743

Open groggemans opened 6 years ago

groggemans commented 6 years ago

In some situations the current heartbeat check is not sufficient to detect problems with nodes running the nomad agent. It would be nice if we could extend the heartbeat check with custom checks.

I had a few cases where an application on a node misbehaved and my node checks in consul went into a failed state. Nomad's heartbeat check didn't detect the problem and just kept scheduling tasks to the node. For now the only way to work around this is by adding a consul watch/handler which starts draining the troublesome node.

@schmichael indicated that there will be some improvements regarding draining and node health detection in v0.8, but no custom health checks.

shantanugadgil commented 6 years ago

I too am eagerly awaiting for the enhancements around Nomad detecting a "driver failure" which seem to be part of version 8.0.

My need is that the docker daemon fails and is unable to start the assigned task. Nomad keeps scheduling the task though onto the same node.

There seems to be some docker bug as well for my specific problem, being fixed in the upcoming 18.03 version of Docker, but overall, driver failure detection would be an awesome functionality to have in Nomad itself.

chelseakomlo commented 6 years ago

This will be a feature in Nomad 0.8- if a Nomad client detects Docker as unresponsive, tasks requiring Docker will be scheduled onto another node where Docker is healthy.