Metrics for `degraded` job

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.86k stars 1.95k forks source link

Metrics for `degraded` job #19346

Open marekhanzlik opened 10 months ago

marekhanzlik commented 10 months ago

Proposal

Hello, i'm looking for a way to monitor degraded jobs (or to know when are not all alocations in healthy running state) Right now i did not find a way how to monitor degraded jobs. I can see the failed allocations but they are unreliable after nomad cluster crash/outage.

Use-cases

Monitoring of long running jobs - to know when certain components of job failed and redeployment is prohibited by retry policy

Attempted Solutions

tgross commented 10 months ago

Hi @marekhanzlik! The answer to that depends on whether you're using general allocation health, Nomad native service discovery, Consul services, or just want metrics.

For general allocation health, you can use the List Allocations API to get the allocation status.
For Nomad native service discovery, checks are only local to the client node and not stored in Nomad's server state. So the Allocation Checks API will give you the current state of an allocation's checks but it makes a RPC that hits the node in the process.
For Consul, you can use the Consul API.
For metrics you can use the Allocation Metrics from each client.

If you want to continuously monitor the state of all allocations in the cluster and store the results over the long term, another option is to use the Event Stream and then stream that off to a database or log storage or whatever interface you'd like for querying.

marekhanzlik commented 10 months ago

I use Nomad native service discovery I'll check the EventStream, that looks interesting But i don't think any of it solves my problem

I'll try to better describe it: Few days ago, we had a power outage on our dev nomad env which brought the nomad cluster down. After restart everything started OK but few jobs had a problem, one of the tasks in the job started failing and hit restart limit. After that allocation was dead and Job was reported as degraded, allocations metrics did NOT show any allocation as terminated (probably because GC ran? [and it is possible it was a manual one])

And my question is, how to catch jobs that got into Degraded state, or if there is any other metric that would implicate that. I dont think allocations alone suit this well

tgross commented 10 months ago

The reason I recommended the event stream is because it's an immutable record over what's happening in the Raft state, which includes what happens to allocations. As you've noted, allocations eventually get GC'd. What does seem to be missing is any kind of indication on the job itself that it no longer has the expected number of allocations. There are a few quirks to that, because the correct number of allocs for a system job vary depending on the number of nodes that fit, and batch/sysbatch jobs run to completion.

I'm going to mark this issue for further discussion and roadmapping. It seems like an obvious problem but we don't have a good first-class solution for it. (Although your own application metrics should also detect degraded states of the workload itself.)

marekhanzlik commented 10 months ago

After the brief look on even stream, to solve my problem, i would have to "reimplement" nomad deployment logic to know if allocation died, or if it was stopped because of new deplotment. Am i right? This seems redundant and fragile

On the solution: how does nomad UI checks and shows Degraded job status? Can it be easily exposed by metrics?

Or maybe dont GC allocations if the job is still running, but that would require to set some metadata on allocation about jobs latest version (if it is latest, dont GC, if it is not, GC)

PS: yes application metrics should know, but it is problematic on stateless services that run several times, the count is set in nomad and monitoring doesn't have a clue about that count

tgross commented 10 months ago

@marekhanzlik the reason we GC terminal allocations for live jobs is because Nomad's state store is in-memory. So if we didn't do that, you could have a job version that lives for a very long time (even years!) and steadily uses up more and more memory as allocations are replaced due to draining.

What you might want to do for your case is to adjust the server#eval_gc_threshold value. We GC allocations at the same time we GC the evaluations that created those allocations (I should make this more obvious in the docs). So you could update to something like eval_gc_threshold = "72h" and then you'd still see several days worth of evaluations and allocations, even after they are terminal.

Note that I would not do this if you have a large number of batch jobs, as all those completed batch job allocations will end up sitting around in memory rather than getting GC'd.

marekhanzlik commented 7 months ago

Thanks for the information. I was thinking about this a bit more and wouldn't it be a good idea to create separate gc_threshold just for the batch jobs? If the GC (in some way) affects monitoring, and it is (at least by my thinking) affects only long running jobs - because i don't think the batch job can be in degraded state, why not separate the GC for them.