Open thetoothpick opened 1 month ago
Hey there @thetoothpick, thanks for the report!
Seems like there are two (presumably related) topics here:
Before tackling the strange deployment behavior, I'd like to get a better sense of the state of these stuck, GC-resistant allocs, and their evaluations (allocs get GC'd with their evals - they're tightly connected).
Are you able to run these commands against one such stuck alloc, and provide the output (sanitized as needed)?
nomad alloc status -json <alloc id>
# get the EvalID from that ^
nomad eval status -json <eval id>
With that, we may be able to 1) understand how they might've got stuck, and how to delete them, so the strange deployments don't occur, but also 2) perhaps I can get a test environment into the weird state to troubleshoot the deployment behavior.
If you can easily trigger one of the strange deployments, getting a eval status -json
of the deployment evaluation would be great too.
In general, looking at nomad eval list
and the full status of any that aren't complete
might be interesting to see.
Hello! I emailed some files to the nomad-oss-debug email address in the issue template.
There are three types of files, covering two different Nomad jobs.
failed-alloc
: the requested command output for the "failed" allocationreplacement-alloc
: the same command output, but for the currently running allocationdeploy
: command output from a deployment of each job, triggered by the following:
nomad job inspect $job | nomad job run -json -
(something about the upgrade from Nomad 1.6 -> Nomad 1.8 triggered an allocation replacement with that command, I can send along a job plan if it's interesting, but I'm not too concerned)
The two jobs (1 and 2) both had failed allocations while running Nomad 1.6. With job 1, I had explicitly triggered the failure (kill -9
), and had run the /evaluate
HTTP endpoint while on Nomad 1.6. I had not run /evaluate
on job 2. The count for the single task group in each job was 1, with a canary = 1.
After upgrading to Nomad 1.8, I gathered command output from job 1, and then tried a deploy using the above command. This deploy actually featured different behavior than I had previously seen - it still spawned two new allocations, but it listed both allocations as "placed," and was successful once both allocations went healthy. It then shut down one allocation.
I also gathered command output and tried to redeploy job 2. This exhibited the behavior I had originally seen, where the deployment thought it had only placed one allocation and hung once the second was stopped.
For both jobs, the failed allocation was only garbage collected after the deploy finished.
The main differences I found in the nomad alloc status
output for the failed allocs in jobs 1 and 2 was this:
Job 1:
"DeploymentStatus": {
"Canary": false,
"Healthy": true, // <-----
"ModifyIndex": 5405158,
"Timestamp": "2024-07-29T17:45:44.955436318Z"
},
"DesiredDescription": "",
"DesiredStatus": "run",
"DesiredTransition": {
"Migrate": null,
"Reschedule": null // <-----
},
Job 2:
"DeploymentStatus": {
"Canary": false,
"Healthy": false, // <-----
"ModifyIndex": 5405704,
"Timestamp": null
},
"DesiredDescription": "",
"DesiredStatus": "run",
"DesiredTransition": {
"Migrate": null,
"Reschedule": true // <-----
},
Having found this new behavior, I would like to have gone and run an evaluation on all the relevant jobs while the cluster was on Nomad 1.6, but we've already upgraded it to 1.8 🙂. At least the deploy was able to finish in that situation.
As it stands, we're still planning on running evaluations on the problematic jobs. Please let us know if you see anything that can be used to clear the failed allocations first. Hopefully the output I sent via email will be helpful.
Thanks for looking into this!
@gulducat just a heads up, we went ahead with the upgrade on our prod cluster and updated all of our jobs. with manual intervention for jobs that we identified as having "stuck" failed allocations (i.e. forcing an evaluation to "replace" stuck failed allocs, and then another to clean up the extras), we were able to upgrade our services with minimal pain.
it would still be nice to look into this to ensure it won't happen again, but from our side the pain points are at least gone.
Hello! We're having an issue when upgrading from Nomad 1.6.9 to Nomad 1.8.1. It seems to be caused by failed allocations that were never garbage collected (sometimes from months ago). We think we have a workaround to make our upgrade less painful, but it's still very involved.
We would like to 1) submit a bug report for the behavior, and 2) check if there's a better workaround to remove the stuck "failed" allocations in Nomad 1.8.1.
Nomad version
Nomad 1.6.9 / 1.8.1
Operating system and Environment details
Linux
Issue
We were running on Nomad 1.6.9 for some time, and upgraded directly to 1.8.1 in development environments. We're still running 1.6.9 in production.
In 1.6.9, it seems like the cluster has collected "failed" allocations and would not garbage collect them. Sometimes the allocations stick around even after its client was gone from the cluster. When we updated to 1.8.1 for the first time in development, any new deploy of a job that had these phantom failed allocations would attempt to "replace" the allocation on deployment. It took a while (and a second cluster upgrade) to figure out the mechanics of what was going on, but I think it's as follows:
nomad system gc
, even though it has been weeks / months since they failed, and the client has been removed.nomad system gc
.count
allocations are launched, and old allocations are stopped.count
(orcanary
) allocations, all "extra" allocations are shut down, leaving the correct count.It seems like the root issue is that Nomad didn't record that they were replaced already, and so it tries to replace them the next chance it gets.
Our system relies on updating a number of jobs in parallel / sequentially, so having to spawn tons of extra allocations and keep failing deployments / restarting the deploy controller is a huge pain.
Right now, our current plan for upgrading production is to upgrade to 1.8.1, and trigger two evaluations for each job with failed allocations, then run a
system gc
.We'd like to see if there is an easier way to get rid of these stuck failed allocations. Is there a way to forcibly remove the allocations?
I've poured over the documentation and tried everything GC-related I could find - any direct allocation GC fails because the clients are gone, and
nomad system gc
is ineffective.Example
In this job,
count = 2
andupdate.canary = 2
. There were 4 stuck failed allocations.Screenshots
Nomad Server logs (if appropriate)
I watched the Nomad 1.6.9 server logs via the GUI monitor during a
system gc
, nothing relevant was printed.Nomad CLI logs