Reschedule attempts limit is not respected when there is placement failures

Nomad version

╰─ nomad version
Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)

Operating system and Environment details

Fedora 35 with nomad agent --dev Also reproduced on ubuntu 20.04 with nomad server and client on two separated VMs.

Issue

When there is a placement failure of an allocation after a reschedule, the reschedule attempts number of the task group is not respected at all. We have this problem with parametrized batch, we don't know if it's the same with other job types.

Reproduction

Run agent:
```
sudo nomad agent -dev -bind 0.0.0.0
```

Run a parametrized batch job with restart attempts = 0 and reschedule attempts > 0 that exhaust your resources and that always fail. For example with the following job definition (on a computer with less than 20GB of RAM for the node to be exhausted by the 10GB resources reservation):

job "example" {

parameterized {
}

datacenters = ["dc1"]
type = "batch"

group "example" {

reschedule {
  attempts       = 1
  interval       = "24h"
  unlimited      = false
}

restart {
  attempts = 0
  mode     = "fail"
}

task "example" {
  driver = "docker"

  config {
    image = "alpine"
    entrypoint = ["/bin/sleep", "10" ,"&&", "false"]
  }

  resources {
    memory = 10000
  }

}
}
}

nomad job run example.nomad

Dispatch that job several times to make placement failures on reschedule.

nomad job dispatch example & nomad job dispatch example & nomad job dispatch example & nomad job dispatch example

Expected Result

The tasks are reschedule at least the number of attempts defined and then die, no matter if there was placement failure between the attempts. For our example, we expect the container to be run only twice per dispatch.

Actual Result

The tasks are reschedule a lot before dying and sometimes never die.

We can see the problem easily in nomad UI in the job evaluations tab. The job has a reschedule attempts of 1 per 24h but the dispatched job has 4 failures (alloc-failure with placement failures to False):

Capture d’écran de 2022-02-24 11-25-41

Hey @avandierast

Thanks for bringing this to our attention! We'll look into this and let you know what we find 👍

Hi @avandierast! I just tested this myself and wasn't able to reproduce any unexpected behavior. There's two things that might be at play here:

The first is the easy one: the failed dispatched jobs don't get GC'd until they reach the job_gc_threshold. See https://github.com/hashicorp/nomad/issues/11256. So when you're looking at the eval list in the UI, make sure those "extra" evals you're seeing are actually for the correct dispatched job. If you use nomad eval list it should make that fairly clear.
The second is a little more complicated. There's a subtle difference between rescheduling and re-evaluating that isn't particularly clear in the docs. So I just want to make sure I understand which of the two you're seeing. Supposing I have clients with only 1000MB of memory available, if I try to dispatch the following job:

jobspec

```hcl job "example" { parameterized { } datacenters = ["dc1"] type = "batch" group "example" { reschedule { attempts = 1 interval = "24h" unlimited = false } restart { attempts = 0 mode = "fail" } task "example" { driver = "docker" config { image = "alpine" entrypoint = ["/bin/sleep", "10", "&&", "false"] } resources { memory = 10000 } } } } ```

I'll end up with the following results:

$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1646687115-e773ac03
Evaluation ID     = e09ee549

==> 2022-03-07T16:05:16-05:00: Monitoring evaluation "e09ee549"
    2022-03-07T16:05:16-05:00: Evaluation triggered by job "example/dispatch-1646687115-e773ac03"
==> 2022-03-07T16:05:17-05:00: Monitoring evaluation "e09ee549"
    2022-03-07T16:05:17-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-03-07T16:05:17-05:00: Evaluation "e09ee549" finished with status "complete" but failed to place all allocations:
    2022-03-07T16:05:17-05:00: Task Group "example" (failed to place 1 allocation):
      * Resources exhausted on 3 nodes
      * Class "vagrant" exhausted on 3 nodes
      * Dimension "memory" exhausted on 3 nodes
    2022-03-07T16:05:17-05:00: Evaluation "2e310f97" waiting for additional capacity to place remainder

This evaluation created a "blocked eval" (the eval 2e310f97) and the scheduler workers will re-evaluate it whenever resources have changed on the cluster. This blocked eval is not subject to the reschedule limit because the allocation has never been placed!

The reschedule block only impacts allocations that have actually been placed on a host and then subsequently failed. So if we change the resources block to the following:

resources {
  memory = 100
}

Then I get the following dispatched job:

$ nomad job status example/dispatch-1646687560-7c3ee652
ID            = example/dispatch-1646687560-7c3ee652
Name          = example/dispatch-1646687560-7c3ee652
Submit Date   = 2022-03-07T16:12:40-05:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
example     0       0         0        2       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
1617cbdc  ce37014d  example     0        run      failed  38s ago  37s ago
35b7559e  a1ee6964  example     0        stop     failed  43s ago  38s ago

So there's exactly one rescheduled allocation, and the allocation status for that final alloc looks like the following. Note the "Reschedule Attempts" line.

$ nomad alloc status 1617
ID                  = 1617cbdc-5cd9-9b55-c0c1-3ce3f48781ee
Eval ID             = 54f5f9d3
Name                = example/dispatch-1646687560-7c3ee652.example[0]
Node ID             = ce37014d
Node Name           = nomad-client1
Job ID              = example/dispatch-1646687560-7c3ee652
Job Version         = 0
Client Status       = failed
Client Description  = Failed tasks
Desired Status      = run
Desired Description = <none>
Created             = 1m10s ago
Modified            = 1m9s ago
Reschedule Attempts = 1/1

Task "example" is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/100 MiB  300 MiB

Task Events:
Started At     = 2022-03-07T21:12:46Z
Finished At    = 2022-03-07T21:12:46Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2022-03-07T16:12:46-05:00  Not Restarting  Policy allows no restarts
2022-03-07T16:12:46-05:00  Terminated      Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2022-03-07T16:12:46-05:00  Started         Task started by client
2022-03-07T16:12:46-05:00  Driver          Downloading image
2022-03-07T16:12:46-05:00  Task Setup      Building Task Directory
2022-03-07T16:12:46-05:00  Received        Task received by client

My evaluations list looks like the following:

$ nomad eval list
ID        Priority  Triggered By    Job ID                                Status    Placement Failures
54f5f9d3  50        alloc-failure   example/dispatch-1646687560-7c3ee652  complete  false
929282a3  50        alloc-failure   example/dispatch-1646687560-7c3ee652  complete  false
4992c684  50        job-register    example/dispatch-1646687560-7c3ee652  complete  false

Hello @tgross, thanks for your help analysing this problem :)

The screen I've pasted at the end of my first post is the equivalent of nomad eval list for one of the dispatched job. How I interpret the eval list is the following, tell me if i'm wrong:

when there is is triggered-by==alloc-failure and placement-failures=true, the alloc is queued and there is a re-evaluation.
when there is triggered-by==alloc-failure and placement-failures==false, the job has failed and thus it will be re-scheduled.

If this is correct, we can see on my screenshot that there is 3 re-schedules but my job is configured to limit to one. And thus, the underlying docker container of the task is run 3 times. The first time we have encountered this problem on a more complex infrastructure, the docker container was re-run indefinitely.

We see this behaviour when there is scarce resources and the evaluation after a reschedule fails. If the placements are successful two times in a row, then the job stops correctly after two fails.

If my example with 4 simultaneous batched job is not enough to have a failed evaluation after a reschedule, I think that if you run more than 4, maybe 10 or even more, it should trigger the problem.

How I interpret the eval list is the following, tell me if i'm wrong:

when there is is triggered-by==alloc-failure and placement-failures=true, the alloc is queued and there is a re-evaluation.

when there is triggered-by==alloc-failure and placement-failures==false, the job has failed and thus it will be re-scheduled.

Pretty close! Any time an allocation fails and it's eligible for a reschedule (ref node_endpoint.go#L1150 and structs.go#L9580-L9600, an evaluation is queued. The "eligible for reschedule" is doing a bit of work in that sentence, but that includes a job that hasn't been rescheduled more than its reschedule.attempts and isn't a system job.

So in both those cases there's been an allocation that failed and an evaluation created, which goes into the scheduler. The placement-failures value is probably an unfortunate name because we often use "placement failure" to mean the placement on the client. But what that value is the scheduler placement failure. As in, the scheduler tried to find a place for the allocation and could not. So it creates a new "blocked" eval and those are the evaluations that are shown in that UI view as queued-allocs

That means that a single allocation reschedule attempt could have 3 (or more!) evaluations and they can transition between different states. The timeline might look like:

The allocation fails.
The client sends a (*Node) UpdateAlloc RPC to the server.
The server checks the reschedule policy and if there are attempts remaining, it'll enqueue an eval with triggered-by alloc-failure
- This will start with placement-failures false, because it hasn't been processed yet!
Then a scheduler worker will process it. Let's suppose we can't place all the allocations, which means its placement-failures gets set to true.
The scheduler creates a new blocked eval (with trigered-by queued-allocs).
- That eval might succeed, so its placement-failures will be false.
- Or it might fail to place and have placement-failures true, at which point another blocked evaluation will get created.

That's 3 evaluations for a single reschedule check. It isn't until the new allocation is created and placed that we treat it as one "attempt" for purposes of reschedule.attempts.

Thanks for the detailed explanation :)

Yes, I see there is 3 evaluations in the list for one real attempt. I didn't talk about the queued-allocs since it doesn't give much more information after a placement-failure=True for my problem. If you look at my eval list in the first post, there is 10 evaluations: 3x3 + 1 -> 3 evaluations per attempts where there was a failing placement + the last failing attempt. This is the eval list for only one dispatched job. And I really need that there is only 2 attempts and not 4.

If this eval list isn't enough to show that there is more than 2 attempts, I can make another example where the container writes into a file and we can see that there is too much attempts.

The nomad configuration file for the agent:

data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"
bind_addr = "0.0.0.0"

datacenter = "dc1"

plugin "docker" {
  config {
    volumes {
        enabled = true
    }
  }
}

server {
  enabled = true
  bootstrap_expect = 1
}

client {
  enabled = true
  host_volume "host_files" {
    path      = "/opt/nomad/host_files"
    read_only = false
  }
}

The modified job file to have a container that writes to a file on the host:

job "example" {

  parameterized {
  }

  datacenters = ["dc1"]
  type = "batch"

  group "example" {

    reschedule {
      attempts       = 1
      interval       = "24h"
      unlimited      = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

    volume "host_files" {
      type      = "host"
      read_only = false
      source    = "host_files"
    }

    task "example" {
      driver = "docker"

      volume_mount {
        volume      = "host_files"
        destination = "/host_files"
        read_only   = false
      }

      config {
        image = "alpine"
        entrypoint = ["/bin/sh", "-c", "echo ${NOMAD_JOB_ID} >> /host_files/data && sleep 10 && false"]
      }

      resources {
        memory = 10000
      }

    }
  }
}

My list of jobs:

╰─ nomad status
ID                                    Type                 Priority  Status   Submit Date
example                           batch/parameterized  50        running  2022-03-16T11:19:42+01:00
example/dispatch-1647425989-04c89364  batch                50        pending  2022-03-16T11:19:49+01:00
example/dispatch-1647425989-4e26f991  batch                50        pending  2022-03-16T11:19:49+01:00
example/dispatch-1647425989-5cbba383  batch                50        running  2022-03-16T11:19:49+01:00
example/dispatch-1647425989-bd4607e2  batch                50        pending  2022-03-16T11:19:49+01:00

And after some time with the 4 dispatched job, the file on the host:

╰─ cat /opt/nomad/host_files/data
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364

And I can let it run for a long time, it will continue to make attempts and write job_ids in the file...

If this eval list isn't enough to show that there is more than 2 attempts, I can make another example where the container writes into a file and we can see that there is too much attempts.

It might be but it's hard to map the screenshot to jobs without having the list of allocations for the job as well. So the example you've given is very helpful.

And after some time with the 4 dispatched job, the file on the host:

Ok that's very interesting, as I see you've got 7 allocation runs per job there in the output. To try to replicate, I took your jobspec and I modified it in three small ways:

I had to change the volume.host_files.source for my cluster.
I changed the resources.memory to fit on my client (otherwise dispatch always fails).
I changed the shell script to write both the job ID and the alloc ID so we can verify these are reschedules and not simply restarts (I see you have restart.attempts = 0, but hey as long as we suspect there's buggy behavior anyways we should make sure it's not there).

jobspec

```hcl job "example" { parameterized { } datacenters = ["dc1"] type = "batch" group "example" { reschedule { attempts = 1 interval = "24h" unlimited = false } restart { attempts = 0 mode = "fail" } volume "host_files" { type = "host" read_only = false source = "shared_data" } task "example" { driver = "docker" volume_mount { volume = "host_files" destination = "/host_files" read_only = false } config { image = "alpine" entrypoint = ["/bin/sh", "-c", "echo \"job ${NOMAD_JOB_ID} - alloc ${NOMAD_ALLOC_ID}\" >> /host_files/data && sleep 10 && false"] } resources { memory = 128 } } } } ```

I register that job and dispatch it 4 times. I end up with the expected list of job status and evaluations:

``` $ nomad job status ID Type Priority Status Submit Date example batch/parameterized 50 running 2022-03-16T09:17:58-04:00 example/dispatch-1647436692-731197a9 batch 50 dead 2022-03-16T09:18:12-04:00 example/dispatch-1647436697-5400bcb9 batch 50 dead 2022-03-16T09:18:17-04:00 example/dispatch-1647436699-06971e3d batch 50 dead 2022-03-16T09:18:19-04:00 example/dispatch-1647436701-3cd48692 batch 50 dead 2022-03-16T09:18:21-04:00 $ nomad eval list ID Priority Triggered By Job ID Status Placement Failures a2d9100a 50 alloc-failure example/dispatch-1647436701-3cd48692 complete false 50fd4401 50 alloc-failure example/dispatch-1647436701-3cd48692 complete false 5abb936f 50 alloc-failure example/dispatch-1647436699-06971e3d complete false 099099c8 50 alloc-failure example/dispatch-1647436699-06971e3d complete false 8ead57e4 50 alloc-failure example/dispatch-1647436697-5400bcb9 complete false d2d2febc 50 alloc-failure example/dispatch-1647436697-5400bcb9 complete false 23c826b6 50 alloc-failure example/dispatch-1647436692-731197a9 complete false 369f22c9 50 alloc-failure example/dispatch-1647436692-731197a9 complete false cd9872da 50 job-register example/dispatch-1647436701-3cd48692 complete false 3179f8f8 50 job-register example/dispatch-1647436699-06971e3d complete false d8a5d1a2 50 job-register example/dispatch-1647436697-5400bcb9 complete false 6720a21d 50 job-register example/dispatch-1647436692-731197a9 complete false ```

And the contents of the data file are exactly what I'd expect. 2 allocations run for each job:

job example/dispatch-1647436692-731197a9 - alloc 3cdff6bf-181b-b914-558b-d6083d34c78d
job example/dispatch-1647436697-5400bcb9 - alloc 44e49152-57cf-75bd-ad0d-0855e560432c
job example/dispatch-1647436699-06971e3d - alloc d92df661-489f-2000-84d4-3280bf9d9a76
job example/dispatch-1647436701-3cd48692 - alloc 860a6808-10ed-04f5-9758-aba6f200b4e6
job example/dispatch-1647436692-731197a9 - alloc 09fbb62e-b2a1-5e22-eb9e-22174e9fc2d9
job example/dispatch-1647436697-5400bcb9 - alloc c04bdf71-d196-ce15-6f63-3f37c3e18390
job example/dispatch-1647436699-06971e3d - alloc 4dd0ac98-cb55-9cda-2570-41614bedbf71
job example/dispatch-1647436701-3cd48692 - alloc 65857c11-ebd6-7aa4-1969-f39ad86a0890

I've double checked this with both Nomad 1.2.6 and the current main and get the same result reliably. I'm assuming your Nomad binary is the one packaged by HashiCorp, and not something from your distros, right? Although I have no reason to believe it makes a difference, I've also tried with a standalone single server, in -dev mode, and with a full 3 server cluster + client. Works every time. :confused:

But it sounds you're seeing this problem consistently at your end too, which suggests it's not some weird concurrency problem.

The job never ends up getting marked as dead? Can you run with debug logs and then run a single dispatch, and watch the logs until it's done at least one attempt it shouldn't have? Maybe there's something in the logs that'll give us a clue here.

I have also tried today on a MacBook Pro with the binary from nomad and it looped like expected :sweat_smile:

What triggers me with the result of your nomad eval list is that you never have any placement failures and the bug is only when there is a placement failure after a failing attempt. The memory limit was set to 10GB on my job because it is more than half of what my computer has. Thus the memory is exhausted after the placement of the first dispatched job and the evaluation for the three other dispatched jobs fails.

I've used your job entrypoint. I've dispatched one job alone and it worked well. Then I've dispatched 4 at the same time and get back the infinite loop of attempts: they are never dead.

nomad status

``` ╰─ nomad status ID Type Priority Status Submit Date example batch/parameterized 50 running 2022-03-16T16:52:06+01:00 example/dispatch-1647445933-dccf0f7f batch 50 dead 2022-03-16T16:52:13+01:00 example/dispatch-1647445986-4b40a141 batch 50 pending 2022-03-16T16:53:06+01:00 example/dispatch-1647445986-51ccfdf4 batch 50 pending 2022-03-16T16:53:06+01:00 example/dispatch-1647445986-604d27ad batch 50 running 2022-03-16T16:53:06+01:00 example/dispatch-1647445986-76c80d9d batch 50 pending 2022-03-16T16:53:06+01:00 ```

We can see that it worked well when the job was dispatched once, but not when there was 4:

result:

``` ╰─ cat /opt/nomad/host_files/data job example/dispatch-1647445933-dccf0f7f - alloc 62f2427f-2aec-a604-2e73-00e6c65d545c job example/dispatch-1647445933-dccf0f7f - alloc e3cfda26-231b-4077-2216-c8cc26e8254d job example/dispatch-1647445986-4b40a141 - alloc 5f51afc4-9f92-4696-a8d9-bb0a9387538b job example/dispatch-1647445986-604d27ad - alloc 4a5b6487-60d1-7532-1d8e-d081b2d0eee0 job example/dispatch-1647445986-76c80d9d - alloc e8fe54a2-90bf-c7c0-cce6-d77a7346d5d3 job example/dispatch-1647445986-51ccfdf4 - alloc 69d7152c-dbb7-d999-7fe4-98cd5c9b5782 job example/dispatch-1647445986-4b40a141 - alloc 783ee4fa-4cfe-7d44-a464-51aad001e02e job example/dispatch-1647445986-604d27ad - alloc e650e876-318c-9f46-2c4b-e8534802acd7 job example/dispatch-1647445986-76c80d9d - alloc 548ab502-4ca9-0210-c5d6-25d5d77ba0e6 job example/dispatch-1647445986-51ccfdf4 - alloc c883608d-b64c-a518-7ab8-f899e1c97d48 job example/dispatch-1647445986-4b40a141 - alloc ef930cb2-0cfd-d8e7-5a7b-5183be396438 job example/dispatch-1647445986-604d27ad - alloc f4e34f11-2a87-aa9c-15b3-c037eb607c1d job example/dispatch-1647445986-76c80d9d - alloc 1b10f8fe-112a-fbc1-04c2-7692dcd946e3 job example/dispatch-1647445986-51ccfdf4 - alloc 9a1cd351-79b7-5966-999a-e391982e547e job example/dispatch-1647445986-4b40a141 - alloc 436306c8-9d31-3385-354a-9a0f43eb4047 job example/dispatch-1647445986-604d27ad - alloc 7f4a4a1f-46aa-91c4-dc92-6cd014c1159a job example/dispatch-1647445986-76c80d9d - alloc fa1beb45-c43d-1081-fb88-0c6fbb0dcc41 job example/dispatch-1647445986-51ccfdf4 - alloc 728b23e2-1208-bb83-5476-f454fd9f5e49 job example/dispatch-1647445986-604d27ad - alloc 8ea8ede7-3ec5-8cd4-99b2-cc0da5d5cab4 job example/dispatch-1647445986-4b40a141 - alloc 92b97b6a-cfae-79d3-fc79-9491516a0e43 job example/dispatch-1647445986-76c80d9d - alloc f0d110b6-0984-15f1-e4fa-35edfeebed9e job example/dispatch-1647445986-51ccfdf4 - alloc 2a5725ee-7c22-4c66-25f8-c4121da3fd41 job example/dispatch-1647445986-604d27ad - alloc c56336ab-4043-3eb6-5b1f-f1bfce020605 job example/dispatch-1647445986-4b40a141 - alloc f6c81180-4603-a54e-6021-a5bf2b257e91 job example/dispatch-1647445986-76c80d9d - alloc a564f0a6-2337-70ef-0dc4-431caa4f877e job example/dispatch-1647445986-51ccfdf4 - alloc c0282383-39c3-9721-9ecb-5806a1377800 job example/dispatch-1647445986-604d27ad - alloc 384564b4-ebb7-775f-bc7a-8636a4cd6a19 ```

I've attached the logs of nomad if you can find anything usefull to understand what is happening: nomad.log

I'm sorry I closed the nomad agent and forgot to call nomad eval list :(

What triggers me with the result of your nomad eval list is that you never have any placement failures and the bug is only when there is a placement failure after a failing attempt. The memory limit was set to 10GB on my job because it is more than half of what my computer has. Thus the memory is exhausted after the placement of the first dispatched job and the evaluation for the three other dispatched jobs fails.

Ah, yes that's I've been missing! I was finally able to reproduce by setting memory = 1200 (my node has 1983MB available), and then dispatching a handful in a row without waiting. I end up with lots and lots of evals, and jobs that get rescheduled multiple times:

$ nomad job status example/dispatch-1647539230-3e910a86
ID            = example/dispatch-1647539230-3e910a86
Name          = example/dispatch-1647539230-3e910a86
Submit Date   = 2022-03-17T13:47:10-04:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
example     1       0         0        4       0         0

Placement Failure
Task Group "example":
  * Resources exhausted on 1 nodes
  * Dimension "memory" exhausted on 1 nodes

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
05cdd5cd  8b4cb8ac  example     0        stop     failed  23s ago    7s ago
1f785e6a  8b4cb8ac  example     0        stop     failed  1m17s ago  1m1s ago
fc650e09  8b4cb8ac  example     0        stop     failed  2m10s ago  1m55s ago
1b4b7597  8b4cb8ac  example     0        stop     failed  3m4s ago   2m49s ago

I went through the logs you provided with a fine-toothed comb and there's enough interleaving to make it hard to see where the issue is. Fortunately, I was able to reproduce with just two dispatches. I've broken that down by hand into a detailed walk through the life of the evaluations (which I've left at the bottom of the page in a <details> block because there's quite a bit of it).

The tl;dr of all that is that we process an eval for a rescheduleable alloc we create a "follow-up" eval. If that eval ends up getting blocked it looks like we're losing track of reschedules at that point. When that happens and we get the (*Node) UpdateAlloc for the failed allocation, the check for rescheduling eligibility (ref node_endpoint.go#L1150 and structs.go#L9580-L9600) passes when it should not. This suggests to me that the RestartTracker that we attach to the allocation is getting wiped somehow. I haven't figured that out yet and I'm getting towards the end of my day, so I'll have to pick up the investigation from there tomorrow.

In any case, I suspect that the reason this is getting into a loop is because the two (or more) dispatched jobs just end up interleaving evaluations and so they keep knocking each other into the bad state.

(new job dispatch sent) Evals in heap: * queued (Eval 80c3e7d2 / Dispatch01 / job-register) > T+11.643Z schedule dequeues 80c3e7d2 > T+11.647Z scheduler processes 80c3e7d2D results=(place 1) > T+11.650Z scheduler submits plan for 80c3e7d2 > T+11.651Z scheduler sets status for 80c3e7d2 to complete > T+11.653Z scheduler acks 80c3e7d2 > T+11.657Z client starts task (alloc 271a5bc7 Dispatch01 attempt 0) (new job dispatch sent) Evals in heap: * queued (Eval ba4f9726 / Dispatch02 / job-register) > T+12.185Z scheduler dequeues ba4f9726 > T+12.185Z scheduler processes ba4f9726 results=(place 1) > T+12.190Z scheduler fails to place all allocs for ba4f9726 and creates blocked eval 14b2581c Evals in heap: * blocked (Eval 14b2581c / Dispatch02 / queued-allocs) * processing (Eval ba4f9726 / Dispatch02 / job-register) > T+12.190Z scheduler sets status for ba4f9726 to complete > T+12.191Z scheduler acks ba4f9726 Evals in heap: * blocked (Eval 14b2581c / Dispatch02 / queued-allocs) > T+22.335Z task fails, client does not restart (alloc 271a5bc7 Dispatch01 attempt 0) > T+22.548Z server gets update and creates new eval b31c782b for reschedule Evals in heap: * queued (Eval b31c782b / Dispatch01 / alloc-failure) * blocked (Eval 14b2581c / Dispatch02 / queued-allocs) > T+22.550Z scheduler dequeues b31c782b > T+22.551Z scheduler processes b31c782b results=(ignore 1) > T+22.553Z scheduler found rescheduleable allocs for b31c782b creates followup eval b2d66c9e Evals in heap: * follow-up (Eval b2d66c9e / Dispatch01 / alloc-failure) * processing (Eval b31c782b / Dispatch01 / alloc-failure) * blocked (Eval 14b2581c / Dispatch02 / queued-allocs) > T+22.556Z scheduler submits plan for b31c782b > T+22.557Z scheduler sets status for b31c782b to complete > T+22.558Z scheduler acks b31c782b Evals in heap: * follow-up (Eval b2d66c9e / Dispatch01 / alloc-failure) * blocked (Eval 14b2581c / Dispatch02 / queued-allocs) > T+22.559Z scheduler dequeues 14b2581c (the previously blocked Dispatch02 / queued-allocs) > T+22.559Z scheduler processes 14b2581c results=(place 1) > T+22.564Z scheduler submits plan for 14b2581c > T+22.564Z scheduler sets status for 14b2581c to complete > T+22.566Z scheduler acks 14b2581c Evals in heap: * follow-up (Eval b2d66c9e / Dispatch01 / alloc-failure) > T+22.569Z client starts task (alloc 160fe342 Dispatch02 attempt 0) > T+27.335Z scheduler dequeues b2d66c9e (follow-up for Dispatch01 / alloc-failure) > T+27.336Z scheduler processes b2d66c9e results=(place 1) (stop 1) > T+27.340Z scheduler fails to place all allocs for b2d66c9e and creates blocked eval 48378aed Evals in heap: * processing (Eval b2d66c9e / Dispatch01 / alloc-failure) * blocked (Eval 48378aed / Dispatch01 / queued-allocs) > T+27.342Z scheduler submits plan for b2d66c9e > T+27.342Z scheduler sets status for b2d66c9e to complete > T+27.343Z scheduler acks b2d66c9e Evals in heap: * blocked (Eval 48378aed / Dispatch01 / queued-allocs) > T+33.246Z task fails, client does not restart (alloc 160fe342 Dispatch02 attempt 0) > T+33.308Z server gets update and creates new eval c4236b42 for reschedule Evals in heap: * queued (Eval c4236b42 / Dispatch02 / alloc-failure) * blocked (Eval 48378aed / Dispatch01 / queued-allocs) > T+33.309Z scheduler dequeues c4236b42 > T+33.309Z scheduler processes c4236b42 results=(ignore 1) > T+33.309Z scheduler found rescheduleable allocs for c4236b42 creates followup eval 54c731a8 Evals in heap: * follow-up (Eval 54c731a8 / Dispatch02 / alloc-failure) * processing (Eval c4236b42 / Dispatch02 / alloc-failure) * blocked (Eval 48378aed / Dispatch01 / queued-allocs) > T+33.310Z scheduler submits plan for c4236b42 > T+33.310Z scheduler sets status for c4236b42 to complete > T+33.310Z scheduler acks c4236b42 Evals in heap: * follow-up (Eval 54c731a8 / Dispatch02 / alloc-failure) * blocked (Eval 48378aed / Dispatch01 / queued-allocs) > T+33.310Z scheduler dequeues 48378aed > T+33.310Z scheduler processes 48378aed results=(place 1) > T+33.310Z scheduler submits plan for 48378aed > T+33.310Z scheduler sets status for 48378aed to complete > T+33.311Z scheduler acks 48378aed Evals in heap: * follow-up (Eval 54c731a8 / Dispatch02 / alloc-failure) > T+33.317Z client starts task (alloc 584d2282 Dispatch01 attempt 1) > T+38.246Z scheduler dequeues 54c731a8 > T+38.247Z scheduler processes 54c731a8 results=(place 1) (stop 1) > T+38.250Z scheduler fails to place all allocs for 54c731a8 and creates blocked eval 9ee89c4e Evals in heap: * processing (Eval 54c731a8 / Dispatch02 / alloc-failure) * blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs) > T+38.251Z scheduler submits plan for 54c731a8 > T+38.251Z scheduler sets status for 54c731a8 to complete > T+38.252Z scheduler acks 54c731a8 Evals in heap: * blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs) > T+43.921Z task fails, client does not restart (alloc 584d2282 Dispatch01 attempt 1) > T+44.067Z server gets update and creates new eval f23ebc2a for reschedule <--- this is wrong!!! Evals in heap: * queued (Eval f23ebc2a / Dispatch01 / alloc-failure) * blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs) > T+44.069Z scheduler dequeues f23ebc2a > T+44.069Z scheduler processes f23ebc2a results=(ignore 1) > T+44.070Z scheduler found rescheduleable allocs for f23ebc2a creates followup eval 0a43a152 Evals in heap: * processing (Eval f23ebc2a / Dispatch01 / alloc-failure) * follow-up (Eval 0a43a152 / Dispatch01 / alloc-failure) * blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs) > T+44.071Z scheduler submits plan for f23ebc2a > T+44.072Z scheduler sets status for f23ebc2a to complete > T+44.073Z scheduler acks f23ebc2a Evals in heap: * follow-up (Eval 0a43a152 / Dispatch01 / alloc-failure) * blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs) > T+44.073Z scheduler dequeues 9ee89c4e > T+44.074Z scheduler processes 9ee89c4e results=(place 1) > T+44.077Z scheduler submits plan for 9ee89c4e > T+44.077Z scheduler sets status for 9ee89c4e to complete > T+44.078Z scheduler acks evaluation triggered_by=queued-allocs Evals in heap: * follow-up (Eval 0a43a152 / Dispatch01 / alloc-failure) * blocked (Eval 9ee89c4e / Dispatch02 / queued-allocs) > T+44.083Z client starts task (alloc e3ded3d6 Dispatch02 attempt 1) (and so on)

@avandierast I think I've got enough here to try to write a unit test in the scheduler package that'll reproduce this. I'll make an attempt at that tomorrow (assuming no other interruptions, of course 😁 )

Ok, I've got a failing unit test in https://github.com/hashicorp/nomad/pull/12319. While working on that I realized this bug is likely to impact a feature that my colleague @DerekStrickland is currently working on, so I'm going to pair up with him to work on a solution starting early next week.

Hello @tgross. Happy to hear that you were able to reproduce :)

I think I may found a similar situation that leads to the opposite result. In my case is also a batch job (meant to be run daily).

The job "hangs" at the templating step (in my case, because it can't find a vault-based secret which is not there) and never retries.

I mean:

at first run, it tries to template with vault secret, (some/path/secret) and stays in templating status never to fail.
If I go and create the 'some/path/secret' secret at vault, the batch job doesn't launch 24 later because of the first "blocked" run.

Can this behaviour be another aspect of this issue?

The 'restart' stanza from my job follows (while testing this I set the job to run every 5min). Current restart mode is "fail" but I also tested with 'mode=delay' with same results:

periodic {
    cron             = "*/5 * * * *"
    prohibit_overlap = "true"
}
(...)
restart {
    interval = "5m"
    attempts = 3
    delay    = "30s"
    mode     = "fail"
}

Hi @next-jesusmanuelnavarro! The restart block is all about restarting on the same client without going back to the scheduler. This issue is about the reschedule block, which is what happens after we give up on restart and have to go back to the scheduler. So that's unrelated and I've split it out to https://github.com/hashicorp/nomad/issues/12478

@avandierast the bug didn't turn out to be related to Derek's work, which just got merged. But we had an internal discussion about the bug and I think there's a couple approaches we can try, so I'll try to circle back to this issue shortly.

Quick update on this. I've got a working patch at https://github.com/hashicorp/nomad/pull/12319, but there's a lingering issue described in https://github.com/hashicorp/nomad/issues/20462#issuecomment-2142738356 which may be interrelated and I want to make sure I've done my due diligence here before trying to close this out.

This issue should be resolved in https://github.com/hashicorp/nomad/pull/12319

hashicorp / nomad