Driver health doesn't capture ability to start tasks

Drivers current fingerprint health checks are too simple for many production scenarios and a client can healthy from nomad prospective yet unable to start any tasks.

Consider the case where a client is very close to file/process limits/etc, or if client networking is misconfigured such that docker image pulls always fail, or if docker agent hangs when starting containers. In these cases, client will always report as healthy, and will be accepting jobs. Depending on failure mode, allocations will get stuck (see https://github.com/hashicorp/nomad/issues/6598) and never run!

I can think of two general approaches to address concern, though both with some downsides:

First is to expand fingerprinting to account for these limits and potentially run sample tasks periodically to check health more realistically. Such health checks can be expensive, thus running infrequently and delay detection of unhealthy state. Also, they can be in-comprehensive if test jobs isn't representative of workload on node, negating their value.

Second is to take account for historical success of jobs and use a scheduling circuit breaker and throttling: e.g. if a client fails to start 5 tasks consequently, avoid scheduling further tasks on it with some probe job infrequently. Though it can be difficult to discerning whether the failure is truely a client failure (e.g. client's iptables got messed up), a cluster wide failure (e.g. docker registry is down or cluster networking down), user error or job specific (e.g. influx of jobs with bad image reference). We must balance accidental risk of marking entire/most cluster as unhealthy vs ability to detect truely failed and bad nodes.

Implementation-wise, we can introduce the concept of node health in addition to driver health and allow that to be user configurable.

hashicorp / nomad

Driver health doesn't capture ability to start tasks #6599