hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.78k stars 1.94k forks source link

Display allocation deployment health status in UI #12778

Open gmichalec-pandora opened 2 years ago

gmichalec-pandora commented 2 years ago

Proposal

When a deploy fails with an error such as 'Failed due to progress deadline', it would be hugely helpful to be able to inspect the allocation that caused the failure. Currently, at least as far as I can tell, there is no indication in the UI as to whether an individual allocation is regarded as having a healthy status by nomad. If we could get some sort of indicator - and ideally a way to filter - on this status, it would make debugging deploy failures much easier

Use-cases

End goal is to be able to use the web UI to quickly identify unhealthy allocations, so that we can inspect logs, etc to diagnose deployment failures.

Attempted Solutions

I've been using CLI to filter the allocation list on 'DeploymentStatus.Healthy': nomad alloc status -json | jq '.[] | select(.DeploymentStatus.Healthy == false) | del(.TaskStates)'

lgfa29 commented 2 years ago

Thanks for the suggestion @gmichalec-pandora, I think this makes sense but we will have to discuss it further to make sure we can present this information properly.

gmichalec-pandora commented 1 year ago

to anyone who stumbles on this, here's my revised command to get a tabular list of all unhealthy allocs in a cluster:

$ nomad alloc status -json | jq -r '.[] | select(.DeploymentStatus.Healthy == false) | [.JobID, .ID, .NodeName] | @tsv' | column -ts $'\t'
martech-hive-transfer-stage    593647db-1ab3-05d8-433d-9798d5cc23ff  sv7-corp-docker50
kafkadoc                       642903fd-d6a0-faca-b34a-f3a1c864bd35  sv7-corp-docker54
...