hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.61k stars 1.93k forks source link

Batch job reschedule attempts sometimes not being respected #20462

Open sagebind opened 3 months ago

sagebind commented 3 months ago

Nomad version

Output from nomad version

Nomad v1.7.6
BuildDate 2024-03-12T07:27:36Z
Revision 594fedbfbc4f0e532b65e8a69b28ff9403eb822e

Operating system and Environment details

Output from uname -a:

Linux ip-10-151-99-125 6.5.0-1017-aws #17~22.04.2-Ubuntu SMP Mon Mar 25 20:47:22 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Issue

It seems that attempts for reschedule set on a job is sometimes not being respected. This job configuration was formerly known to be for sure working when we were running Nomad 1.4 (we recently upgraded versions to catch up with the latest). We run quite a bit of small, discrete operations using batch jobs, and we use the reschedule policy to allow Nomad to automatically retry the operation with a new allocation on a different node, in case the underlying node had an issue, but we have attempts set to 2.

Normally, this does what we expect -- if a task exits with a nonzero exit code or gets OOM Killed, the allocation fails (because restart.attempts is set to zero and mode is "fail"), and up to 2 additional allocations may be created to attempt the task again. However, since upgrading to 1.7, we've noticed that sometimes this limit is not being enforced, and Nomad is behaving as if unlimited were set to true, which it is not. This results in some failing tasks generating an infinite number of new allocations over time and filling up our clusters with wasted work, and causing quite a bit of problems for us. We've had jobs with attempts set to 2 somehow end up with hundreds of failed allocations. Here's a screenshot of the dashboard for a job that already had 63 before we found and purged it manually:

Screenshot 2024-04-18 at 11 18 49

So far, I haven't found a rhyme or reason yet as to why it seems like only sometimes this behavior happens, but over time the percentage of our cluster filled with these "zombie" jobs grows until most of our capacity is taken up by them, leaving no room for new work.

My current theory is that sometimes, when the task gets OOM Killed with the docker driver, the status code is seen as 0 and thus isn't incrementing some sort of counter somewhere that keeps track of the number of failed allocations so far. Normally, the event message is

Exit Code: 137, Exit Message: "OOM Killed"

But for these "zombie" jobs that reschedule infinitely, it seems like the event message is

Exit Code: 0, Exit Message: "OOM Killed"

Below are the allocations and events for the example zombie job in the reproduction steps section that works correctly and stops after attempts reschedules (the JSON is too big for GitHub comment):

zombie-allocations.json

Now below is the response of allocations from a buggy job in production that spawned more allocations than allowed by its attempts limit:

buggy-allocations.json

Reproduction steps

I haven't been able to consistently reproduce this issue yet, but the general idea would be to use a job like this one:

job "zombie" {
    type = "batch"
    datacenters = ["dc1"]
    group "zombie" {
        task "zombie" {
            driver = "docker"
            config {
                image = "alpine:3"
                entrypoint = ["sh"]
                args = ["-c", "sleep 60; head -c 64m /dev/zero | tail"]
                network_mode = "host"
                auth_soft_fail = true
                memory_hard_limit = 32
            }
            resources {
                cpu = 100
                memory = 16
            }
            restart {
                attempts = 0
                mode = "fail"
            }
        }
    }
    reschedule {
        attempts = 2
        interval = "168h"
        unlimited = false
    }
}

Expected Result

The job should be rescheduled up to attempts times, and then stay in a failed state after that.

Actual Result

The job is rescheduled without any limit, creating far more than attempts allocations over time.

Job file (if appropriate)

We use the Nomad API programmatically to generate these jobs so we don't have an HCL for such jobs, but here is the job definition of an example job that experienced this problem according to the Nomad API (with some sensitive bits redacted).

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "25b3cee6-86f0-4641-b72b-7e014c4049fb",
  "ParentID": "",
  "Name": "{redacted}",
  "Type": "batch",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1"
  ],
  "NodePool": "default",
  "Constraints": [
    {
      "LTarget": "${attr.cpu.arch}",
      "RTarget": "amd64",
      "Operand": "set_contains_any"
    },
    {
      "LTarget": "${attr.driver.docker.os_type}",
      "RTarget": "linux",
      "Operand": "="
    },
    {
      "LTarget": "${meta.network_type}",
      "RTarget": "public",
      "Operand": "!="
    },
    {
      "LTarget": "${attr.unique.storage.volume}",
      "RTarget": "^/dev/nvme.+",
      "Operand": "regexp"
    }
  ],
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "executor",
      "Count": 1,
      "Update": null,
      "Migrate": null,
      "Constraints": null,
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 3,
        "Interval": 86400000000000,
        "Delay": 15000000000,
        "Mode": "fail",
        "RenderTemplates": false
      },
      "Tasks": [
        {
          "Name": "executor",
          "Driver": "docker",
          "User": "",
          "Config": {
            "network_mode": "host",
            "memory_hard_limit": 8192,
            "mounts": [
              {
                "type": "bind",
                "source": "tmp",
                "target": "/tmp"
              },
              {
                "source": "tmp",
                "target": "/var/tmp",
                "type": "bind"
              }
            ],
            "work_dir": "/alloc/data",
            "image": "{redacted}"
          },
          "Env": {
            "_": "{redacted}"
          },
          "Services": null,
          "Vault": null,
          "Consul": null,
          "Templates": [
            {
              "SourcePath": "",
              "DestPath": "local/args.json",
              "EmbeddedTmpl": "{redacted}",
              "ChangeMode": "noop",
              "ChangeSignal": "",
              "ChangeScript": null,
              "Splay": 5000000000,
              "Perms": "0644",
              "Uid": null,
              "Gid": null,
              "LeftDelim": "\"\"\"",
              "RightDelim": "\"\"\"",
              "Envvars": false,
              "VaultGrace": 0,
              "Wait": null,
              "ErrMissingKey": false
            }
          ],
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 2000,
            "Cores": 0,
            "MemoryMB": 2048,
            "MemoryMaxMB": 0,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null,
            "NUMA": null
          },
          "RestartPolicy": {
            "Attempts": 0,
            "Interval": 86400000000000,
            "Delay": 15000000000,
            "Mode": "fail",
            "RenderTemplates": false
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 1,
            "MaxFileSizeMB": 10,
            "Disabled": false
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null,
          "Identity": {
            "Name": "default",
            "Audience": [
              "nomadproject.io"
            ],
            "ChangeMode": "",
            "ChangeSignal": "",
            "Env": false,
            "File": false,
            "ServiceName": "",
            "TTL": 0
          },
          "Identities": null,
          "Actions": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 50,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": {
        "Attempts": 2,
        "Interval": 604800000000000,
        "Delay": 5000000000,
        "DelayFunction": "constant",
        "MaxDelay": 0,
        "Unlimited": false
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": null,
      "Consul": {
        "Namespace": "",
        "Cluster": "default",
        "Partition": ""
      },
      "Services": null,
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "MaxClientDisconnect": null,
      "PreventRescheduleOnLost": false
    }
  ],
  "Update": {
    "Stagger": 0,
    "MaxParallel": 0,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "Meta": {
    "_": "{redacted}"
  },
  "ConsulToken": "",
  "ConsulNamespace": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "pending",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1713431957190805800,
  "CreateIndex": 25929346,
  "ModifyIndex": 30491074,
  "JobModifyIndex": 25929346
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

sagebind commented 2 months ago

Digging into this a bit more, not familiar with the codebase but the fact that RescheduleTracker is getting reset to null on some of these reschedules seems a bit dubious. Here's just the RescheduleTrackers of the allocs of the job above exhibiting the buggy behavior:

{
  "ID": "b9fd3bb7-7b8b-28cd-a848-717fc995893c",
  "RescheduleTracker": null
}
{
  "ID": "b550b588-fcf3-8431-e00d-c1271e307dd0",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713540616182219192,
        "PrevAllocID": "b9fd3bb7-7b8b-28cd-a848-717fc995893c",
        "PrevNodeID": "2d4576df-c1f6-8b8a-a0da-ca05123cf849",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "52a123ff-2e23-4860-e802-3c314e2b3a6a",
  "RescheduleTracker": null
}
{
  "ID": "46eb89fa-ef85-19cc-b79a-c60edf63ba01",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713540907034600526,
        "PrevAllocID": "52a123ff-2e23-4860-e802-3c314e2b3a6a",
        "PrevNodeID": "7349b23c-b638-15d2-ec50-f9ca5e4103f3",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "51147d13-cef8-a522-091e-8ede42887a17",
  "RescheduleTracker": null
}
{
  "ID": "eb529550-4ec9-eb9b-24da-f3892da18b81",
  "RescheduleTracker": null
}
{
  "ID": "1cd8ae47-39d8-d5e1-7b34-a42678fcee66",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541254356376383,
        "PrevAllocID": "eb529550-4ec9-eb9b-24da-f3892da18b81",
        "PrevNodeID": "5b244e29-d814-84dd-d83c-a235ada79a3e",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
  "RescheduleTracker": null
}
{
  "ID": "0505a317-7470-06cf-dfba-9ecef6e92e52",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541591399485131,
        "PrevAllocID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
        "PrevNodeID": "3ebe29d2-5c40-643c-6ddb-345a01d315c5",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "92022347-e05a-c3b2-877f-ba3458fb00fc",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541591399485131,
        "PrevAllocID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
        "PrevNodeID": "3ebe29d2-5c40-643c-6ddb-345a01d315c5",
        "Delay": 5000000000
      },
      {
        "RescheduleTime": 1713541647009844424,
        "PrevAllocID": "0505a317-7470-06cf-dfba-9ecef6e92e52",
        "PrevNodeID": "214d3a1c-3b69-8d67-c707-812bf8a943c6",
        "Delay": 5000000000
      }
    ]
  }
}

It seems like normally RescheduleTracker.Events should have an event for each prior allocation continually appended to it until the size of the list reaches the reschedule Attempts value. But because it is getting reset to null for some reason, it never reaches Attempts number of events, or it does eventually (by chance?).

tgross commented 2 months ago

Hi @sagebind! Any chance that these allocations failed a placement on reschedule, and then got placed? There's a known issue around that, which I have a test for in https://github.com/hashicorp/nomad/pull/12319 but we never got back to figuring out how to actually fix it.

You should be able to verify this by looking at the evaluation history for the job.

sagebind commented 2 months ago

@tgross Here's the evaluation history for a couple of jobs that placed 7 allocations even though Attempts was set to 2, along with their allocations:

5df4fd90-d0b3-46e1-992d-45f61b08c537-allocations.json 5df4fd90-d0b3-46e1-992d-45f61b08c537-evaluations.json f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json f290af7c-5bb9-41da-b3db-d643f8e70600-evaluations.json

The only evaluation triggers I see are alloc-failure and no placement failures. It looks pretty ordinary, other than the fact that there are too many of them.

tgross commented 2 months ago

Just a heads up that I'm digging into this. The cause of the problem seems straightforward and tied up in the logic we use for marking allocations for reschedule, but the solution is going to take a bit of figuring out. I'll update here as I know more.

gibbonsjohnm commented 1 month ago

We are seeing something tangentially related to this. Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing. No rhyme or reason identified yet for why this is occurring, but I'm wondering if we're dealing with the same issue identified above.

tgross commented 1 month ago

Just wanted to drop an update here. I've got a working solution to the problem described in #12147 and #17072, but @sagebind's report is slightly different because there's no failed placement visible. I strongly suspect the solution there will also fix this problem because of how we attach the reschedule tracker, but until I can reproduce this problem I don't feel confident in closing this out with it.

One of the issues I'm running into with the available evidence is that some of the evaluations in the "chain" of evaluations are missing. That shouldn't matter here, as all the evaluations that are missing should be the first eval after the allocation failure, which then creates a new eval to be run after the reschedule.delay.

For job f290af7c-5bb9-41da-b3db-d643f8e70600, the allocs 05cac123 and eb6eeeb9 have a chain that's intact but shows the dropped reschedule tracker.

example allocs ``` $ cat f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json| jq '.[] | select(.ID == "05cac123-d5e1-b767-8e13-32fca12970fa") | {ID: .ID, RescheduleTracker: .RescheduleTracker, FollowupEvalID: .FollowupEvalID, DesiredDescription: .DesiredDescription, EvalID: .EvalID, NextAllocation: .NextAllocation}' { "ID": "05cac123-d5e1-b767-8e13-32fca12970fa", "RescheduleTracker": { "Events": [ { "RescheduleTime": 1714060539543172000, "PrevAllocID": "dceaf833-8e6d-db00-9280-a2572e09fd4d", "PrevNodeID": "b3d352a2-8c4f-556c-df8c-8c70d5271779", "Delay": 15000000000 } ] }, "FollowupEvalID": "e218c7d6-3acf-7232-e8c0-65b9be87b2f4", "DesiredDescription": "alloc was rescheduled because it failed", "EvalID": "6c398c16-af9c-ad61-9f59-cd7af671f30a", "NextAllocation": "" } $ cat f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json| jq '.[] | select(.ID == "eb6eeeb9-e357-3f47-b447-aa3b8a4d52d7") | {ID: .ID, RescheduleTracker: .RescheduleTracker, FollowupEvalID: .FollowupEvalID, DesiredDescription: .DesiredDescription, EvalID: .EvalID, NextAllocation: .NextAllocation}' { "ID": "eb6eeeb9-e357-3f47-b447-aa3b8a4d52d7", "RescheduleTracker": null, "FollowupEvalID": "00a7ca3b-8ff0-fa0a-2e37-7297769153c3", "DesiredDescription": "alloc was rescheduled because it failed", "EvalID": "e218c7d6-3acf-7232-e8c0-65b9be87b2f4", "NextAllocation": "819e549e-7b37-7329-46c1-1e0809c18f1b" } ```

I'm still working on trying to figure out the code path that can reproduce this.

@gibbonsjohnm wrote:

Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing.

It might be related or not. Just by way of clarification, you mean you've got a reschedule block like this (as documented), right?:

reschedule {
  attempts  = 0
  unlimited = false
}

If so, it would be helpful if you could provide the allocations and evaluations list like @sagebind did above. Debug-level server log snippets from when the evaluations were processed would be helpful too.

tgross commented 1 month ago

I'm at the end of my week here, but wanted to leave one last update. It looks like in https://github.com/hashicorp/nomad/commit/e8efe2d251bf3628f13c7eb3ce2422eb7e5b85f6 we introduced a code path where "untainted" allocs but are disconnecting should pass through the reschedule check. But in practice those untainted allocs that aren't disconnecting also pass through the reschedule check now too. This could potentially cause a spurious reschedule, but I still need to figure out the exact circumstances where that could happen. The fix for that would be:

diff --git a/scheduler/reconcile_util.go b/scheduler/reconcile_util.go
index 09cc0e42f5..661db532ec 100644
--- a/scheduler/reconcile_util.go
+++ b/scheduler/reconcile_util.go
@@ -414,6 +414,7 @@ func (a allocSet) filterByRescheduleable(isBatch, isDisconnecting bool, now time
                isUntainted, ignore := shouldFilter(alloc, isBatch)
                if isUntainted && !isDisconnecting {
                        untainted[alloc.ID] = alloc
+                       continue // TODO: right?
                }

                if ignore {

But right now that doesn't cause any tests to fail, which at least suggests test coverage may be why we didn't see this earlier. The only way that explains the failed allocs not getting their reschedule tracker is if they don't get marked rescheduable in that code path and just get dropped in the "untainted" list. That would cause a later stage of the reconciler to determine we were short healthy allocs and backfill them. I'll pick this all back up on Monday.

tgross commented 1 month ago

Ok folks, I've got https://github.com/hashicorp/nomad/pull/12319 marked as ready for review to cover the similar-but-not-quite identical case, and that should ship in the next regular version of Nomad.

For the remaining cases described in this issue, I have a high degree of confidence at this point that we're not going anywhere near the reschedule logic. Instead, the scheduler is considering the job to be "underprovisioned", so it's creating allocations that it doesn't consider to be replacements for existing allocations at all. That is, the scheduler isn't saying "I have a count=3 but one alloc failed, so make a new alloc to replace it" and is instead saying "I have count=3 and 2 running allocs, so place a new alloc". I admit this sounds only subtly different but the internals of the scheduler treat those cases very differently.

Unfortunately I don't have a reproduction for that yet. What could help me at this point is if you're experiencing allocations being replaced without a reschedule tracker and without there having been a placement failure during rescheduling, or completed allocations getting replaced, is to post the following:


@sagebind I did a little digging on the OOM report exit codes, and that looks like it's a case of https://github.com/hashicorp/nomad/issues/13119, which we're aware of and are waiting for a fix from upstream

tgross commented 1 month ago

If you are reading this to report you've seen similar issues and have not followed along with the above: The key factor in this issue is that the replacement is happening without an apparent attempt to reschedule. If you have seen placement failures during rescheduling, then you're seeing #17072 #12147, which we already have a patch for.

gibbonsjohnm commented 1 month ago

Just wanted to drop an update here. I've got a working solution to the problem described in #12147 and #17072, but @sagebind's report is slightly different because there's no failed placement visible. I strongly suspect the solution there will also fix this problem because of how we attach the reschedule tracker, but until I can reproduce this problem I don't feel confident in closing this out with it.

One of the issues I'm running into with the available evidence is that some of the evaluations in the "chain" of evaluations are missing. That shouldn't matter here, as all the evaluations that are missing should be the first eval after the allocation failure, which then creates a new eval to be run after the reschedule.delay.

For job f290af7c-5bb9-41da-b3db-d643f8e70600, the allocs 05cac123 and eb6eeeb9 have a chain that's intact but shows the dropped reschedule tracker. example allocs

I'm still working on trying to figure out the code path that can reproduce this.

@gibbonsjohnm wrote:

Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing.

It might be related or not. Just by way of clarification, you mean you've got a reschedule block like this (as documented), right?:

reschedule {
  attempts  = 0
  unlimited = false
}

If so, it would be helpful if you could provide the allocations and evaluations list like @sagebind did above. Debug-level server log snippets from when the evaluations were processed would be helpful too.

We do have that reschedule block, yes. If the issue happens again, I will provide the data requested. For what it's worth, these are batch jobs.