Closed avandierast closed 4 months ago
Hey @avandierast
Thanks for bringing this to our attention! We'll look into this and let you know what we find 👍
Hi @avandierast! I just tested this myself and wasn't able to reproduce any unexpected behavior. There's two things that might be at play here:
The first is the easy one: the failed dispatched jobs don't get GC'd until they reach the job_gc_threshold
. See https://github.com/hashicorp/nomad/issues/11256. So when you're looking at the eval list in the UI, make sure those "extra" evals you're seeing are actually for the correct dispatched job. If you use nomad eval list
it should make that fairly clear.
The second is a little more complicated. There's a subtle difference between rescheduling and re-evaluating that isn't particularly clear in the docs. So I just want to make sure I understand which of the two you're seeing. Supposing I have clients with only 1000MB of memory available, if I try to dispatch the following job:
I'll end up with the following results:
$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1646687115-e773ac03
Evaluation ID = e09ee549
==> 2022-03-07T16:05:16-05:00: Monitoring evaluation "e09ee549"
2022-03-07T16:05:16-05:00: Evaluation triggered by job "example/dispatch-1646687115-e773ac03"
==> 2022-03-07T16:05:17-05:00: Monitoring evaluation "e09ee549"
2022-03-07T16:05:17-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-03-07T16:05:17-05:00: Evaluation "e09ee549" finished with status "complete" but failed to place all allocations:
2022-03-07T16:05:17-05:00: Task Group "example" (failed to place 1 allocation):
* Resources exhausted on 3 nodes
* Class "vagrant" exhausted on 3 nodes
* Dimension "memory" exhausted on 3 nodes
2022-03-07T16:05:17-05:00: Evaluation "2e310f97" waiting for additional capacity to place remainder
This evaluation created a "blocked eval" (the eval 2e310f97
) and the scheduler workers will re-evaluate it whenever resources have changed on the cluster. This blocked eval is not subject to the reschedule
limit because the allocation has never been placed!
The reschedule
block only impacts allocations that have actually been placed on a host and then subsequently failed. So if we change the resources
block to the following:
resources {
memory = 100
}
Then I get the following dispatched job:
$ nomad job status example/dispatch-1646687560-7c3ee652
ID = example/dispatch-1646687560-7c3ee652
Name = example/dispatch-1646687560-7c3ee652
Submit Date = 2022-03-07T16:12:40-05:00
Type = batch
Priority = 50
Datacenters = dc1
Namespace = default
Status = dead
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
example 0 0 0 2 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
1617cbdc ce37014d example 0 run failed 38s ago 37s ago
35b7559e a1ee6964 example 0 stop failed 43s ago 38s ago
So there's exactly one rescheduled allocation, and the allocation status for that final alloc looks like the following. Note the "Reschedule Attempts" line.
$ nomad alloc status 1617
ID = 1617cbdc-5cd9-9b55-c0c1-3ce3f48781ee
Eval ID = 54f5f9d3
Name = example/dispatch-1646687560-7c3ee652.example[0]
Node ID = ce37014d
Node Name = nomad-client1
Job ID = example/dispatch-1646687560-7c3ee652
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 1m10s ago
Modified = 1m9s ago
Reschedule Attempts = 1/1
Task "example" is "dead"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 0 B/100 MiB 300 MiB
Task Events:
Started At = 2022-03-07T21:12:46Z
Finished At = 2022-03-07T21:12:46Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2022-03-07T16:12:46-05:00 Not Restarting Policy allows no restarts
2022-03-07T16:12:46-05:00 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2022-03-07T16:12:46-05:00 Started Task started by client
2022-03-07T16:12:46-05:00 Driver Downloading image
2022-03-07T16:12:46-05:00 Task Setup Building Task Directory
2022-03-07T16:12:46-05:00 Received Task received by client
My evaluations list looks like the following:
$ nomad eval list
ID Priority Triggered By Job ID Status Placement Failures
54f5f9d3 50 alloc-failure example/dispatch-1646687560-7c3ee652 complete false
929282a3 50 alloc-failure example/dispatch-1646687560-7c3ee652 complete false
4992c684 50 job-register example/dispatch-1646687560-7c3ee652 complete false
Hello @tgross, thanks for your help analysing this problem :)
The screen I've pasted at the end of my first post is the equivalent of nomad eval list
for one of the dispatched job.
How I interpret the eval list is the following, tell me if i'm wrong:
If this is correct, we can see on my screenshot that there is 3 re-schedules but my job is configured to limit to one. And thus, the underlying docker container of the task is run 3 times. The first time we have encountered this problem on a more complex infrastructure, the docker container was re-run indefinitely.
We see this behaviour when there is scarce resources and the evaluation after a reschedule fails. If the placements are successful two times in a row, then the job stops correctly after two fails.
If my example with 4 simultaneous batched job is not enough to have a failed evaluation after a reschedule, I think that if you run more than 4, maybe 10 or even more, it should trigger the problem.
How I interpret the eval list is the following, tell me if i'm wrong:
- when there is is triggered-by==alloc-failure and placement-failures=true, the alloc is queued and there is a re-evaluation.
- when there is triggered-by==alloc-failure and placement-failures==false, the job has failed and thus it will be re-scheduled.
Pretty close! Any time an allocation fails and it's eligible for a reschedule (ref node_endpoint.go#L1150
and structs.go#L9580-L9600
, an evaluation is queued. The "eligible for reschedule" is doing a bit of work in that sentence, but that includes a job that hasn't been rescheduled more than its reschedule.attempts
and isn't a system job.
So in both those cases there's been an allocation that failed and an evaluation created, which goes into the scheduler. The placement-failures
value is probably an unfortunate name because we often use "placement failure" to mean the placement on the client. But what that value is the scheduler placement failure. As in, the scheduler tried to find a place for the allocation and could not. So it creates a new "blocked" eval and those are the evaluations that are shown in that UI view as queued-allocs
That means that a single allocation reschedule attempt could have 3 (or more!) evaluations and they can transition between different states. The timeline might look like:
(*Node) UpdateAlloc
RPC to the server.alloc-failure
placement-failures
false, because it hasn't been processed yet!placement-failures
gets set to true
.queued-allocs
).
placement-failures
will be false.placement-failures
true, at which point another blocked evaluation will get created.That's 3 evaluations for a single reschedule check. It isn't until the new allocation is created and placed that we treat it as one "attempt" for purposes of reschedule.attempts
.
Thanks for the detailed explanation :)
Yes, I see there is 3 evaluations in the list for one real attempt. I didn't talk about the queued-allocs since it doesn't give much more information after a placement-failure=True for my problem. If you look at my eval list in the first post, there is 10 evaluations: 3x3 + 1 -> 3 evaluations per attempts where there was a failing placement + the last failing attempt. This is the eval list for only one dispatched job. And I really need that there is only 2 attempts and not 4.
If this eval list isn't enough to show that there is more than 2 attempts, I can make another example where the container writes into a file and we can see that there is too much attempts.
The nomad configuration file for the agent:
data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"
bind_addr = "0.0.0.0"
datacenter = "dc1"
plugin "docker" {
config {
volumes {
enabled = true
}
}
}
server {
enabled = true
bootstrap_expect = 1
}
client {
enabled = true
host_volume "host_files" {
path = "/opt/nomad/host_files"
read_only = false
}
}
The modified job file to have a container that writes to a file on the host:
job "example" {
parameterized {
}
datacenters = ["dc1"]
type = "batch"
group "example" {
reschedule {
attempts = 1
interval = "24h"
unlimited = false
}
restart {
attempts = 0
mode = "fail"
}
volume "host_files" {
type = "host"
read_only = false
source = "host_files"
}
task "example" {
driver = "docker"
volume_mount {
volume = "host_files"
destination = "/host_files"
read_only = false
}
config {
image = "alpine"
entrypoint = ["/bin/sh", "-c", "echo ${NOMAD_JOB_ID} >> /host_files/data && sleep 10 && false"]
}
resources {
memory = 10000
}
}
}
}
My list of jobs:
╰─ nomad status
ID Type Priority Status Submit Date
example batch/parameterized 50 running 2022-03-16T11:19:42+01:00
example/dispatch-1647425989-04c89364 batch 50 pending 2022-03-16T11:19:49+01:00
example/dispatch-1647425989-4e26f991 batch 50 pending 2022-03-16T11:19:49+01:00
example/dispatch-1647425989-5cbba383 batch 50 running 2022-03-16T11:19:49+01:00
example/dispatch-1647425989-bd4607e2 batch 50 pending 2022-03-16T11:19:49+01:00
And after some time with the 4 dispatched job, the file on the host:
╰─ cat /opt/nomad/host_files/data
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
example/dispatch-1647425989-4e26f991
example/dispatch-1647425989-bd4607e2
example/dispatch-1647425989-5cbba383
example/dispatch-1647425989-04c89364
And I can let it run for a long time, it will continue to make attempts and write job_ids in the file...
If this eval list isn't enough to show that there is more than 2 attempts, I can make another example where the container writes into a file and we can see that there is too much attempts.
It might be but it's hard to map the screenshot to jobs without having the list of allocations for the job as well. So the example you've given is very helpful.
And after some time with the 4 dispatched job, the file on the host:
Ok that's very interesting, as I see you've got 7 allocation runs per job there in the output. To try to replicate, I took your jobspec and I modified it in three small ways:
volume.host_files.source
for my cluster.resources.memory
to fit on my client (otherwise dispatch always fails).restart.attempts = 0
, but hey as long as we suspect there's buggy behavior anyways we should make sure it's not there).I register that job and dispatch it 4 times. I end up with the expected list of job status and evaluations:
And the contents of the data file are exactly what I'd expect. 2 allocations run for each job:
job example/dispatch-1647436692-731197a9 - alloc 3cdff6bf-181b-b914-558b-d6083d34c78d
job example/dispatch-1647436697-5400bcb9 - alloc 44e49152-57cf-75bd-ad0d-0855e560432c
job example/dispatch-1647436699-06971e3d - alloc d92df661-489f-2000-84d4-3280bf9d9a76
job example/dispatch-1647436701-3cd48692 - alloc 860a6808-10ed-04f5-9758-aba6f200b4e6
job example/dispatch-1647436692-731197a9 - alloc 09fbb62e-b2a1-5e22-eb9e-22174e9fc2d9
job example/dispatch-1647436697-5400bcb9 - alloc c04bdf71-d196-ce15-6f63-3f37c3e18390
job example/dispatch-1647436699-06971e3d - alloc 4dd0ac98-cb55-9cda-2570-41614bedbf71
job example/dispatch-1647436701-3cd48692 - alloc 65857c11-ebd6-7aa4-1969-f39ad86a0890
I've double checked this with both Nomad 1.2.6 and the current main
and get the same result reliably. I'm assuming your Nomad binary is the one packaged by HashiCorp, and not something from your distros, right? Although I have no reason to believe it makes a difference, I've also tried with a standalone single server, in -dev
mode, and with a full 3 server cluster + client. Works every time. :confused:
But it sounds you're seeing this problem consistently at your end too, which suggests it's not some weird concurrency problem.
The job never ends up getting marked as dead
? Can you run with debug logs and then run a single dispatch, and watch the logs until it's done at least one attempt it shouldn't have? Maybe there's something in the logs that'll give us a clue here.
I have also tried today on a MacBook Pro with the binary from nomad and it looped like expected :sweat_smile:
What triggers me with the result of your nomad eval list
is that you never have any placement failures and the bug is only when there is a placement failure after a failing attempt.
The memory limit was set to 10GB on my job because it is more than half of what my computer has. Thus the memory is exhausted after the placement of the first dispatched job and the evaluation for the three other dispatched jobs fails.
I've used your job entrypoint. I've dispatched one job alone and it worked well. Then I've dispatched 4 at the same time and get back the infinite loop of attempts: they are never dead.
We can see that it worked well when the job was dispatched once, but not when there was 4:
I've attached the logs of nomad if you can find anything usefull to understand what is happening: nomad.log
I'm sorry I closed the nomad agent and forgot to call nomad eval list
:(
What triggers me with the result of your nomad eval list is that you never have any placement failures and the bug is only when there is a placement failure after a failing attempt. The memory limit was set to 10GB on my job because it is more than half of what my computer has. Thus the memory is exhausted after the placement of the first dispatched job and the evaluation for the three other dispatched jobs fails.
Ah, yes that's I've been missing! I was finally able to reproduce by setting memory = 1200
(my node has 1983MB available), and then dispatching a handful in a row without waiting. I end up with lots and lots of evals, and jobs that get rescheduled multiple times:
$ nomad job status example/dispatch-1647539230-3e910a86
ID = example/dispatch-1647539230-3e910a86
Name = example/dispatch-1647539230-3e910a86
Submit Date = 2022-03-17T13:47:10-04:00
Type = batch
Priority = 50
Datacenters = dc1
Namespace = default
Status = pending
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
example 1 0 0 4 0 0
Placement Failure
Task Group "example":
* Resources exhausted on 1 nodes
* Dimension "memory" exhausted on 1 nodes
Allocations
ID Node ID Task Group Version Desired Status Created Modified
05cdd5cd 8b4cb8ac example 0 stop failed 23s ago 7s ago
1f785e6a 8b4cb8ac example 0 stop failed 1m17s ago 1m1s ago
fc650e09 8b4cb8ac example 0 stop failed 2m10s ago 1m55s ago
1b4b7597 8b4cb8ac example 0 stop failed 3m4s ago 2m49s ago
I went through the logs you provided with a fine-toothed comb and there's enough interleaving to make it hard to see where the issue is. Fortunately, I was able to reproduce with just two dispatches. I've broken that down by hand into a detailed walk through the life of the evaluations (which I've left at the bottom of the page in a <details>
block because there's quite a bit of it).
The tl;dr of all that is that we process an eval for a rescheduleable alloc we create a "follow-up" eval. If that eval ends up getting blocked it looks like we're losing track of reschedules at that point. When that happens and we get the (*Node) UpdateAlloc
for the failed allocation, the check for rescheduling eligibility (ref node_endpoint.go#L1150
and structs.go#L9580-L9600
) passes when it should not. This suggests to me that the RestartTracker
that we attach to the allocation is getting wiped somehow. I haven't figured that out yet and I'm getting towards the end of my day, so I'll have to pick up the investigation from there tomorrow.
In any case, I suspect that the reason this is getting into a loop is because the two (or more) dispatched jobs just end up interleaving evaluations and so they keep knocking each other into the bad state.
@avandierast I think I've got enough here to try to write a unit test in the scheduler
package that'll reproduce this. I'll make an attempt at that tomorrow (assuming no other interruptions, of course 😁 )
Ok, I've got a failing unit test in https://github.com/hashicorp/nomad/pull/12319. While working on that I realized this bug is likely to impact a feature that my colleague @DerekStrickland is currently working on, so I'm going to pair up with him to work on a solution starting early next week.
Hello @tgross. Happy to hear that you were able to reproduce :)
I think I may found a similar situation that leads to the opposite result. In my case is also a batch job (meant to be run daily).
The job "hangs" at the templating step (in my case, because it can't find a vault-based secret which is not there) and never retries.
I mean:
Can this behaviour be another aspect of this issue?
The 'restart' stanza from my job follows (while testing this I set the job to run every 5min). Current restart mode is "fail" but I also tested with 'mode=delay' with same results:
periodic {
cron = "*/5 * * * *"
prohibit_overlap = "true"
}
(...)
restart {
interval = "5m"
attempts = 3
delay = "30s"
mode = "fail"
}
Hi @next-jesusmanuelnavarro! The restart
block is all about restarting on the same client without going back to the scheduler. This issue is about the reschedule
block, which is what happens after we give up on restart
and have to go back to the scheduler. So that's unrelated and I've split it out to https://github.com/hashicorp/nomad/issues/12478
@avandierast the bug didn't turn out to be related to Derek's work, which just got merged. But we had an internal discussion about the bug and I think there's a couple approaches we can try, so I'll try to circle back to this issue shortly.
Quick update on this. I've got a working patch at https://github.com/hashicorp/nomad/pull/12319, but there's a lingering issue described in https://github.com/hashicorp/nomad/issues/20462#issuecomment-2142738356 which may be interrelated and I want to make sure I've done my due diligence here before trying to close this out.
This issue should be resolved in https://github.com/hashicorp/nomad/pull/12319
Nomad version
Operating system and Environment details
Fedora 35 with nomad agent --dev Also reproduced on ubuntu 20.04 with nomad server and client on two separated VMs.
Issue
When there is a placement failure of an allocation after a reschedule, the reschedule attempts number of the task group is not respected at all. We have this problem with parametrized batch, we don't know if it's the same with other job types.
Reproduction
Run agent:
Run a parametrized batch job with restart attempts = 0 and reschedule attempts > 0 that exhaust your resources and that always fail. For example with the following job definition (on a computer with less than 20GB of RAM for the node to be exhausted by the 10GB resources reservation):
Dispatch that job several times to make placement failures on reschedule.
Expected Result
The tasks are reschedule at least the number of attempts defined and then die, no matter if there was placement failure between the attempts. For our example, we expect the container to be run only twice per dispatch.
Actual Result
The tasks are reschedule a lot before dying and sometimes never die.
We can see the problem easily in nomad UI in the job evaluations tab. The job has a reschedule attempts of 1 per 24h but the dispatched job has 4 failures (alloc-failure with placement failures to False):