hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

Driver health doesn't capture ability to start tasks #6599

Open notnoop opened 5 years ago

notnoop commented 5 years ago

Drivers current fingerprint health checks are too simple for many production scenarios and a client can healthy from nomad prospective yet unable to start any tasks.

Consider the case where a client is very close to file/process limits/etc, or if client networking is misconfigured such that docker image pulls always fail, or if docker agent hangs when starting containers. In these cases, client will always report as healthy, and will be accepting jobs. Depending on failure mode, allocations will get stuck (see https://github.com/hashicorp/nomad/issues/6598) and never run!

I can think of two general approaches to address concern, though both with some downsides:

First is to expand fingerprinting to account for these limits and potentially run sample tasks periodically to check health more realistically. Such health checks can be expensive, thus running infrequently and delay detection of unhealthy state. Also, they can be in-comprehensive if test jobs isn't representative of workload on node, negating their value.

Second is to take account for historical success of jobs and use a scheduling circuit breaker and throttling: e.g. if a client fails to start 5 tasks consequently, avoid scheduling further tasks on it with some probe job infrequently. Though it can be difficult to discerning whether the failure is truely a client failure (e.g. client's iptables got messed up), a cluster wide failure (e.g. docker registry is down or cluster networking down), user error or job specific (e.g. influx of jobs with bad image reference). We must balance accidental risk of marking entire/most cluster as unhealthy vs ability to detect truely failed and bad nodes.

Implementation-wise, we can introduce the concept of node health in addition to driver health and allow that to be user configurable.

stale[bot] commented 4 years ago

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!