Nomad 1.8.1 deploy/eval spawns extra allocations for non-GC'd "failed" allocations

thetoothpick commented 1 month ago

Hello! We're having an issue when upgrading from Nomad 1.6.9 to Nomad 1.8.1. It seems to be caused by failed allocations that were never garbage collected (sometimes from months ago). We think we have a workaround to make our upgrade less painful, but it's still very involved.

We would like to 1) submit a bug report for the behavior, and 2) check if there's a better workaround to remove the stuck "failed" allocations in Nomad 1.8.1.

Nomad version

Nomad 1.6.9 / 1.8.1

Operating system and Environment details

Linux

Issue

We were running on Nomad 1.6.9 for some time, and upgraded directly to 1.8.1 in development environments. We're still running 1.6.9 in production.

In 1.6.9, it seems like the cluster has collected "failed" allocations and would not garbage collect them. Sometimes the allocations stick around even after its client was gone from the cluster. When we updated to 1.8.1 for the first time in development, any new deploy of a job that had these phantom failed allocations would attempt to "replace" the allocation on deployment. It took a while (and a second cluster upgrade) to figure out the mechanics of what was going on, but I think it's as follows:

Failed allocations exist in Nomad 1.6.9 that have not been garbage collected.
- These allocations have not been marked as "replaced," even though the job is currently running the desired number of allocations and they are healthy.
- These allocations survive a full nomad system gc, even though it has been weeks / months since they failed, and the client has been removed.
Upon upgrading to Nomad 1.8.1, these allocations still survive a nomad system gc.
When updating any job that has these failed allocations in 1.8.1, a deploy is triggered. This deploy behaves strangely:
- The correct number of canary allocations are launched for canary updates, or count allocations are launched, and old allocations are stopped.
- In addition, one allocation is started for each of the remaining failed allocations.
- Once the deployment health conditions are met for count (or canary) allocations, all "extra" allocations are shut down, leaving the correct count.
- However, the deploy still waits for the total number of new allocations created, creating a deadlock.
- This also happens if the job requires a CSI volume, even with "single writer" mode. This also creates a deadlock, since only one of these allocations can run at any time, but the deploy starts multiple and tries to wait for all of them.
The deploy can be explicitly marked as "failed".
After the deploy is complete, the previous "failed" allocations can be GC'd
In 1.8.1, triggering an evaluation on the job (without a deploy) will launch a replacement task for each failed allocation. Triggering a second evaluation will stop the extra allocations (earliest allocations first).

It seems like the root issue is that Nomad didn't record that they were replaced already, and so it tries to replace them the next chance it gets.

Our system relies on updating a number of jobs in parallel / sequentially, so having to spawn tons of extra allocations and keep failing deployments / restarting the deploy controller is a huge pain.

Right now, our current plan for upgrading production is to upgrade to 1.8.1, and trigger two evaluations for each job with failed allocations, then run a system gc.

We'd like to see if there is an easier way to get rid of these stuck failed allocations. Is there a way to forcibly remove the allocations?

I've poured over the documentation and tried everything GC-related I could find - any direct allocation GC fails because the clients are gone, and nomad system gc is ineffective.

Example

In this job, count = 2 and update.canary = 2. There were 4 stuck failed allocations.

Screenshots

Screenshot 2024-07-26 at 6 44 38 PM

Screenshot 2024-07-26 at 6 46 08 PM

Screenshot 2024-07-26 at 6 46 29 PM

Nomad Server logs (if appropriate)

2024-07-27T01:41:39.963Z [DEBUG] nomad.job.service_sched: reconciled current state with desired state: eval_id=36076b6d-1bfa-40c3-1944-823c940d768b job_id=starbase-sandbox-gateway-sb-fix-dropcopy namespace=default
  results=
  | Total changes: (place 6) (destructive 0) (inplace 0) (stop 4) (disconnect 0) (reconnect 0)
  | Created Deployment: "92177b32-3655-0010-67f4-c335ce81de85"
  | Desired Changes for "gateway-sb-fix-dropcopy": (place 4) (inplace 0) (destructive 0) (stop 4) (migrate 0) (ignore 2) (canary 2)

2024-07-27T01:41:39.964Z [DEBUG] nomad.job.service_sched: duplicate alloc index found and changed: eval_id=36076b6d-1bfa-40c3-1944-823c940d768b job_id=starbase-sandbox-gateway-sb-fix-dropcopy namespace=default old_alloc_name=starbase-sandbox-gateway-sb-fix-dropcopy.gateway-sb-fix-dropcopy[0] new_alloc_name=starbase-sandbox-gateway-sb-fix-dropcopy.gateway-sb-fix-dropcopy[0]
2024-07-27T01:41:39.964Z [DEBUG] nomad.job.service_sched: duplicate alloc index found and changed: eval_id=36076b6d-1bfa-40c3-1944-823c940d768b job_id=starbase-sandbox-gateway-sb-fix-dropcopy namespace=default old_alloc_name=starbase-sandbox-gateway-sb-fix-dropcopy.gateway-sb-fix-dropcopy[1] new_alloc_name=starbase-sandbox-gateway-sb-fix-dropcopy.gateway-sb-fix-dropcopy[0]
2024-07-27T01:41:39.964Z [DEBUG] nomad.job.service_sched: failed to place all allocations, blocked eval created: eval_id=36076b6d-1bfa-40c3-1944-823c940d768b job_id=starbase-sandbox-gateway-sb-fix-dropcopy namespace=default blocked_eval_id=3044ae64-b667-3f38-fe1f-ad2d78726d0a
2024-07-27T01:41:39.965Z [DEBUG] nomad.job.service_sched: setting eval status: eval_id=36076b6d-1bfa-40c3-1944-823c940d768b job_id=starbase-sandbox-gateway-sb-fix-dropcopy namespace=default status=complete

I watched the Nomad 1.6.9 server logs via the GUI monitor during a system gc, nothing relevant was printed.

Nomad CLI logs

==> 2024-07-26T18:41:42-07:00: Monitoring evaluation "77f88bee"
    2024-07-26T18:41:43-07:00: Evaluation triggered by job "starbase-sandbox-gateway-sb-fix-dropcopy"
    2024-07-26T18:41:43-07:00: Evaluation within deployment: "4dd03b4b"
    2024-07-26T18:41:43-07:00: Allocation "80bfd041" created: node "3deec5e4", group "gateway-sb-fix-dropcopy"
    2024-07-26T18:41:43-07:00: Allocation "82911693" created: node "bc1c531b", group "gateway-sb-fix-dropcopy"
    2024-07-26T18:41:43-07:00: Evaluation status changed: "pending" -> "complete"
==> 2024-07-26T18:41:43-07:00: Evaluation "77f88bee" finished with status "complete" but failed to place all allocations:
    2024-07-26T18:41:43-07:00: Task Group "gateway-sb-fix-dropcopy" (failed to place 4 allocations):
      * Class "apne1-oms-a-shared": 12 nodes excluded by filter
      * Constraint "computed class ineligible": 12 nodes excluded by filter
      * Resources exhausted on 4 nodes
      * Class "apne1-oms-a-shared" exhausted on 4 nodes
      * Dimension "network: reserved port collision gateway_log_relay=9096" exhausted on 4 nodes
    2024-07-26T18:41:43-07:00: Evaluation "622502cc" waiting for additional capacity to place remainder
==> 2024-07-26T18:41:43-07:00: Monitoring deployment "4dd03b4b"
  ⠹ Deployment "4dd03b4b" in progress...

    2024-07-26T18:41:43-07:00
    ID          = 4dd03b4b
    Job ID      = starbase-sandbox-gateway-sb-fix-dropcopy
    Job Version = 46
    Status      = running
    Description = Deployment is running pending automatic promotion

    Deployed
    Task Group               Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
    gateway-sb-fix-dropcopy  false     6        2         2       0        0          2024-07-27T02:42:42Z

=========

==> 2024-07-26T18:41:42-07:00: Monitoring evaluation "77f88bee"
    2024-07-26T18:41:43-07:00: Evaluation triggered by job "starbase-sandbox-gateway-sb-fix-dropcopy"
    2024-07-26T18:41:43-07:00: Evaluation within deployment: "4dd03b4b"
    2024-07-26T18:41:43-07:00: Allocation "80bfd041" created: node "3deec5e4", group "gateway-sb-fix-dropcopy"
    2024-07-26T18:41:43-07:00: Allocation "82911693" created: node "bc1c531b", group "gateway-sb-fix-dropcopy"
    2024-07-26T18:41:43-07:00: Evaluation status changed: "pending" -> "complete"
==> 2024-07-26T18:41:43-07:00: Evaluation "77f88bee" finished with status "complete" but failed to place all allocations:
    2024-07-26T18:41:43-07:00: Task Group "gateway-sb-fix-dropcopy" (failed to place 4 allocations):
      * Class "apne1-oms-a-shared": 12 nodes excluded by filter
      * Constraint "computed class ineligible": 12 nodes excluded by filter
      * Resources exhausted on 4 nodes
      * Class "apne1-oms-a-shared" exhausted on 4 nodes
      * Dimension "network: reserved port collision gateway_log_relay=9096" exhausted on 4 nodes
    2024-07-26T18:41:43-07:00: Evaluation "622502cc" waiting for additional capacity to place remainder
==> 2024-07-26T18:41:43-07:00: Monitoring deployment "4dd03b4b"
  ⠙ Deployment "4dd03b4b" in progress...

    2024-07-26T18:45:11-07:00
    ID          = 4dd03b4b
    Job ID      = starbase-sandbox-gateway-sb-fix-dropcopy
    Job Version = 46
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group               Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
    gateway-sb-fix-dropcopy  true      6        2         2       2        0          2024-07-27T02:46:10Z

gulducat commented 1 month ago

Hey there @thetoothpick, thanks for the report!

Seems like there are two (presumably related) topics here:

Stuck allocations
Weird deployments

Before tackling the strange deployment behavior, I'd like to get a better sense of the state of these stuck, GC-resistant allocs, and their evaluations (allocs get GC'd with their evals - they're tightly connected).

Are you able to run these commands against one such stuck alloc, and provide the output (sanitized as needed)?

nomad alloc status -json <alloc id>
# get the EvalID from that ^
nomad eval status -json <eval id>

With that, we may be able to 1) understand how they might've got stuck, and how to delete them, so the strange deployments don't occur, but also 2) perhaps I can get a test environment into the weird state to troubleshoot the deployment behavior.

If you can easily trigger one of the strange deployments, getting a eval status -json of the deployment evaluation would be great too.

In general, looking at nomad eval list and the full status of any that aren't complete might be interesting to see.

thetoothpick commented 1 month ago

Hello! I emailed some files to the nomad-oss-debug email address in the issue template.

There are three types of files, covering two different Nomad jobs.

failed-alloc: the requested command output for the "failed" allocation
replacement-alloc: the same command output, but for the currently running allocation
deploy: command output from a deployment of each job, triggered by the following:
```
nomad job inspect $job | nomad job run -json -
```
(something about the upgrade from Nomad 1.6 -> Nomad 1.8 triggered an allocation replacement with that command, I can send along a job plan if it's interesting, but I'm not too concerned)

The two jobs (1 and 2) both had failed allocations while running Nomad 1.6. With job 1, I had explicitly triggered the failure (kill -9), and had run the /evaluate HTTP endpoint while on Nomad 1.6. I had not run /evaluate on job 2. The count for the single task group in each job was 1, with a canary = 1.

After upgrading to Nomad 1.8, I gathered command output from job 1, and then tried a deploy using the above command. This deploy actually featured different behavior than I had previously seen - it still spawned two new allocations, but it listed both allocations as "placed," and was successful once both allocations went healthy. It then shut down one allocation.

I also gathered command output and tried to redeploy job 2. This exhibited the behavior I had originally seen, where the deployment thought it had only placed one allocation and hung once the second was stopped.

For both jobs, the failed allocation was only garbage collected after the deploy finished.

The main differences I found in the nomad alloc status output for the failed allocs in jobs 1 and 2 was this:

Job 1:

    "DeploymentStatus": {
        "Canary": false,
        "Healthy": true, // <-----
        "ModifyIndex": 5405158,
        "Timestamp": "2024-07-29T17:45:44.955436318Z"
    },
    "DesiredDescription": "",
    "DesiredStatus": "run",
    "DesiredTransition": {
        "Migrate": null,
        "Reschedule": null // <-----
    },

Job 2:

    "DeploymentStatus": {
        "Canary": false,
        "Healthy": false, // <-----
        "ModifyIndex": 5405704,
        "Timestamp": null
    },
    "DesiredDescription": "",
    "DesiredStatus": "run",
    "DesiredTransition": {
        "Migrate": null,
        "Reschedule": true // <-----
    },

Having found this new behavior, I would like to have gone and run an evaluation on all the relevant jobs while the cluster was on Nomad 1.6, but we've already upgraded it to 1.8 🙂. At least the deploy was able to finish in that situation.

As it stands, we're still planning on running evaluations on the problematic jobs. Please let us know if you see anything that can be used to clear the failed allocations first. Hopefully the output I sent via email will be helpful.

Thanks for looking into this!

Kevin

thetoothpick commented 4 weeks ago

@gulducat just a heads up, we went ahead with the upgrade on our prod cluster and updated all of our jobs. with manual intervention for jobs that we identified as having "stuck" failed allocations (i.e. forcing an evaluation to "replace" stuck failed allocs, and then another to clean up the extras), we were able to upgrade our services with minimal pain.

it would still be nice to look into this to ensure it won't happen again, but from our side the pain points are at least gone.

hashicorp / nomad