Open gmichalec-pandora opened 2 years ago
Thanks for the suggestion @gmichalec-pandora, I think this makes sense but we will have to discuss it further to make sure we can present this information properly.
to anyone who stumbles on this, here's my revised command to get a tabular list of all unhealthy allocs in a cluster:
$ nomad alloc status -json | jq -r '.[] | select(.DeploymentStatus.Healthy == false) | [.JobID, .ID, .NodeName] | @tsv' | column -ts $'\t'
martech-hive-transfer-stage 593647db-1ab3-05d8-433d-9798d5cc23ff sv7-corp-docker50
kafkadoc 642903fd-d6a0-faca-b34a-f3a1c864bd35 sv7-corp-docker54
...
Proposal
When a deploy fails with an error such as 'Failed due to progress deadline', it would be hugely helpful to be able to inspect the allocation that caused the failure. Currently, at least as far as I can tell, there is no indication in the UI as to whether an individual allocation is regarded as having a healthy status by nomad. If we could get some sort of indicator - and ideally a way to filter - on this status, it would make debugging deploy failures much easier
Use-cases
End goal is to be able to use the web UI to quickly identify unhealthy allocations, so that we can inspect logs, etc to diagnose deployment failures.
Attempted Solutions
I've been using CLI to filter the allocation list on 'DeploymentStatus.Healthy':
nomad alloc status -json | jq '.[] | select(.DeploymentStatus.Healthy == false) | del(.TaskStates)'