Open marvinchin opened 1 year ago
Hi @marvinchin 👋
Thank you for the detailed report! I was able to easily reproduce the issue following the steps you provided. As you noted, the miscount only happens if the blocked eval is for a system job, for service jobs the accounting seems to be correct.
A few possible solutions pop to mind, but I'm not 100% sure what's the best approach. Preventing multiple calls to Block()/Unblock()
is the more direct approach, but potentially brittle as it's easy to accidentally call them multiple times again.
Making these function somewhat idempotent by having a small dedup cache in BlockedStats
could help as well, but then it open questions about cache size, eviction etc.
I added this to our board for further triaging and roadmapping. Thank you again for the report!
Thanks for looking into this!
I agree that ensuring this can never happen in all cases sounds more involved.
I suspect that in this specific case if we call b.tracked.remove(e)
in UnblockNode
that should fix the issue for now. If you agree, I'm happy to submit a PR that makes this change.
Hi @marvinchin, apologies for the delay here. I was familiarizing myself with this part of the code and also checking with the team for extra guidance.
I think your assertion is correct, we should be removing evals from b.capture
on UnblockNode
, but, more broadly, I was trying to understand the general relationship between b.jobs
, b.captured
, b.escaped
, and b.system
. Looking at the code it seems like the relations are something like this:
b.jobs
:
https://github.com/hashicorp/nomad/blob/b17c0f7ff9fea7bfc465e6e0ffa959ffa2662c89/nomad/blocked_evals.go#L199b.escaped
or b.captured
and, if the eval is for a system job, it will also be part of b.system
:
https://github.com/hashicorp/nomad/blob/b17c0f7ff9fea7bfc465e6e0ffa959ffa2662c89/nomad/blocked_evals.go#L213-L231So b.jobs
is a superset of all blocked evals, b.captured
and b.escaped
are disjoint sets, and b.system
is subset of b.captured
.
One thing the team noticed is that it would make more sense for b.system
to overlap both b.captured
and b.escaped
, and so this block of code should come earlier:
https://github.com/hashicorp/nomad/blob/b17c0f7ff9fea7bfc465e6e0ffa959ffa2662c89/nomad/blocked_evals.go#L223-L227
But I think we can put a pin on that for now.
What these relationships illustrate is that, in order to achieve correctness, we must always keep these maps consistent, so removing something from b.capture
should always be followed by a removal in b.system
and b.jobs
. The same apply for b.system
, but, as you noticed, there are a few places that we don't do this 😅
So during the eval unblock something like this happens:
The negative metric reported here is caused by the overlap of the two hatched areas, where we decrement the metric twice, but there also two other regions where we leave the maps in an inconsistent state because we only remove evals from b.capture
or b.system
:
In practice,I'm not sure if the rightmost region actually exists because we call Unblock()
in a computed class that the node ID used in UnblockNode()
also has:
https://github.com/hashicorp/nomad/blob/b17c0f7ff9fea7bfc465e6e0ffa959ffa2662c89/nomad/fsm.go#L465-L466
But maybe in concurrent operations we may end up with UnblockNode
being called "first", in which case the leftmost region wouldn't exist.
All of these to say that I think we would need to make sure a delete(b.captured, ...)
is always followed by a b.system.Remove(...)
, and the other way around as well. For this bug in specific it would look something like this:
https://github.com/hashicorp/nomad/commit/a73cbafda1dbc34d7357e7842e033ee0013bd08d
The TotalQuotaLimit
is unrelated but also seemed wrong to me, feel free to ignore it in your work 😄
Does this all make sense to you?
Sorry, I somehow missed this notification. Thanks for the explanation - I'll put up a PR for this soon.
Awesome, thank you!
Let us know if you have any other question 🙂
While trying to implement this I came across something I didn't quite understand:
In L367 b.system.JobEvals
returns a list of evals, which implies that a single system job might have multiple blocked evals. However in L376 b.jobs
is a map of job id to eval, which implies that every job should have at most one blocked eval. I'm a little confused by this - do you happen to know what might be happening here?
Empirically, when I tried running a system job that had placement failures on two nodes, I only see one blocked evaluation being created which is attributed to only one of the two nodes. Dlv also shows that b.system.byJob
contains only one evaluation for the single node ID. I don't happen to have the output handy, but I'm happy to reproduce this again and share the output if you think it'll be helpful.
Hi @marvinchin, apologies for the delay here, Nomad 1.7 has kept me quite busy for the past weeks 😅
In L367 b.system.JobEvals returns a list of evals, which implies that a single system job might have multiple blocked evals. However in L376 b.jobs is a map of job id to eval, which implies that every job should have at most one blocked eval. I'm a little confused by this - do you happen to know what might be happening here?
I believe both assertions are correct, but they apply for different job types. System jobs have one blocked per eligible node, so, in total, a system job may have multiple blocked evals. I believe that's why there's an early return there in L372
. If the job is in b.system.JobEvals
then it's a system job, so it's handled differently.
The reason why a system job needs one blocked eval per eligible node is so that the reconciler is triggered if something changes in any of the nodes.
Service jobs however, can only have one blocked eval because the Nomad scheduler reconciler always processes all allocs, so a single eval will do.
Empirically, when I tried running a system job that had placement failures on two nodes, I only see one blocked evaluation being created which is attributed to only one of the two nodes.
Now, this is a bit surprising 🤔
Were both nodes configured the same? Meaning, were they both supposed to be eligible for placement of the system job?
Sorry for the delay here, it's been a busy couple of weeks!
Were both nodes configured the same? Meaning, were they both supposed to be eligible for placement of the system job? I recall that that's the case. Let me try to provide you with a repro that shows this in the next couple of days.
Here's a reproduction:
# Start a server
$ ./usr/bin/nomad agent -config server.conf
...
# Start two clients which are identical in configuration except for the ports/names
$ diff client-1.conf client-2.conf
1c1
< data_dir = "/tmp/nomad/data/client-1"
---
> data_dir = "/tmp/nomad/data/client-2"
16,18c16,18
< http = "9901"
< rpc = "9902"
< serf = "9903"
---
> http = "9801"
> rpc = "9802"
> serf = "9803"
$ ./usr/bin/nomad agent -config client-1.conf
...
$ ./usr/bin/nomad agent -config client-2.conf
...
# Submit a system job that requests for a static port
$ cat blocker.nomad
job "blocker" {
datacenters = ["dc1"]
type = "system"
group "test-group" {
task "test-task" {
driver = "raw_exec"
config {
command = "/usr/bin/sleep"
args = ["inf"]
}
kill_timeout = "1h"
}
network {
port "test" {
static = 10000
}
}
}
}
$ ./usr/bin/nomad job run blocker.nomad
==> 2023-12-29T18:06:11+08:00: Monitoring evaluation "79aa44b8"
2023-12-29T18:06:11+08:00: Evaluation triggered by job "blocker"
2023-12-29T18:06:12+08:00: Allocation "c4bbab73" created: node "c11029b2", group "test-group"
2023-12-29T18:06:12+08:00: Allocation "e18bc370" created: node "90844561", group "test-group"
2023-12-29T18:06:12+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-12-29T18:06:12+08:00: Evaluation "79aa44b8" finished with status "complete"
# Run another system job that requests for the same port (in our case, identical except for the job name), and see that it gets blocked
$ diff blocker.nomad blocked.nomad
1c1
< job "blocker" {
---
> job "blocked" {
$ ./usr/bin/nomad job run blocked.nomad
==> 2023-12-29T18:06:20+08:00: Monitoring evaluation "ad3046c2"
2023-12-29T18:06:20+08:00: Evaluation triggered by job "blocked"
2023-12-29T18:06:21+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-12-29T18:06:21+08:00: Evaluation "ad3046c2" finished with status "complete" but failed to place all allocations:
2023-12-29T18:06:21+08:00: Task Group "test-group" (failed to place 2 allocations):
# List the allocations and see that there is only one blocked evaluation for one of the two nodes
$ ./usr/bin/nomad eval list
ID Priority Triggered By Job ID Namespace Node ID Status Placement Failures
1773aff2 50 queued-allocs blocked default 90844561 blocked N/A - In Progress
ad3046c2 50 job-register blocked default <none> complete true
79aa44b8 50 job-register blocker default <none> complete false
Then, I attached dlv to the server process and inspected the state of a goroutine inside of BlockedEvals
:
(dlv) bt
0 0x0000000000442e4e in runtime.gopark
at runtime/proc.go:399
1 0x00000000004535c5 in runtime.selectgo
at runtime/select.go:327
2 0x0000000001db7890 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).GetDuplicates
at github.com/hashicorp/nomad/nomad/blocked_evals.go:679
3 0x0000000001e3cf05 in github.com/hashicorp/nomad/nomad.(*Server).reapDupBlockedEvaluations
at github.com/hashicorp/nomad/nomad/leader.go:1163
4 0x0000000001e37d25 in github.com/hashicorp/nomad/nomad.(*Server).establishLeadership.func6
at github.com/hashicorp/nomad/nomad/leader.go:386
5 0x0000000000476b21 in runtime.goexit
at runtime/asm_amd64.s:1650
(dlv) p b.system.byJob
map[github.com/hashicorp/nomad/nomad/structs.NamespacedID]map[string]string [
{ID: "blocked", Namespace: "default"}: [
"90844561-7e2c-ed7d-1a4f-d16f65606b5b": "1773aff2-0ccf-27ec-9547-9ea75739786a",
],
]
Again, we see only one evaluation.
Oh that's right. I think my mental model around this area was wrong. In the scenario you described only a single eval is created because the task group has already been marked as failed: https://github.com/hashicorp/nomad/blob/e7ca2b51ad1bde9d128d35108587e5eedd0b4458/scheduler/scheduler_system.go#L411-L416
But even if the job has multiple task groups only a single blocked eval is present because of the processBlockJobDuplicate
method.
https://github.com/hashicorp/nomad/blob/e7ca2b51ad1bde9d128d35108587e5eedd0b4458/nomad/blocked_evals.go#L234-L288
Evals added to b.duplicates
are eventually cancelled by the leader in reapDupBlockedEvaluations
.
https://github.com/hashicorp/nomad/blob/e7ca2b51ad1bde9d128d35108587e5eedd0b4458/nomad/leader.go#L1216-L1250
So I think there may be a small windows of time where a job can have multiple blocked evals until the leader reaps the duplicated ones. b.system.JobEvals()
also doesn't read directly from the state store, so there may a short window of time where it's not consistent with global state. Since b.stats.Unblock()
kind of operates on the same level as b.system.JobEvals()
(a kind of "cache" on top of global state) it seems OK to keep the loop there, even though we usually only expect there to be one eval.
Nomad version
1.5.5
Operating system and Environment details
Unix
Issue
The reported value of the
nomad.nomad.blocked_evals.total_blocked
metric becomes incorrect when a blocked evaluation for a system job is unblocked due to a node update. In particular, we've observed cases where the metric reported a negative count of blocked evaluations on the cluster which cannot be true.The effect of this is that it's hard to rely on this metric to monitor the health/capacity of the cluster.
Reproduction steps
Suspected Root Cause
I attached delve to the server process to try and figure out where the accounting might have gone wrong:
Delve script and output
``` $ cat delve.script trace blocked_evals_stats.go:42 on 1 print "block" on 1 print eval.ID on 1 bt trace blocked_evals_stats.go:50 on 2 print "unblock" on 2 print eval.ID on 2 bt $ dlv attach 50485 --init <(cat delve.script) Type 'help' for list of commands. Tracepoint 1 set at 0x1d6d0a6 for github.com/hashicorp/nomad/nomad.(*BlockedStats).Block() github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:42 Tracepoint 2 set at 0x1d6d146 for github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock() github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 (dlv) continue > goroutine(131): github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock(("*github.com/hashicorp/nomad/nomad.BlockedStats")(0xc000b54160), ("*github.com/hashicorp/nomad/nomad/structs.Evaluation")(0xc000003c80)) "unblock": "unblock" eval.ID: "30216c5a-d608-d321-d41b-843defbb1281" Stack: 0 0x0000000001d6d146 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 1 0x0000000001d6a60b in github.com/hashicorp/nomad/nomad.(*BlockedEvals).UnblockNode at github.com/hashicorp/nomad/nomad/blocked_evals.go:515 2 0x0000000001db51ba in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyNodeEligibilityUpdate at github.com/hashicorp/nomad/nomad/fsm.go:534 3 0x0000000001db20a5 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:276 4 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 5 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 6 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 7 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm atInterestingly, we see
30216c5a-d608-d321-d41b-843defbb1281
get unblocked twice, once viaUnblockNode
and once viawatchCapacity
.Looking at
processBlock
, whereb.stats.Block
was invoked, we see that it tracks the blocked system evaluation in 3 places:b.jobs
https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L196b.system
https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L223b.captured
https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L228However, in
UnblockNode
, when we unblock the the system evaluation we only remove it fromb.system
but notb.captured
: https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L513-L516And in
unblock
which is called bywatchCapacity
, it iterates over all the evaluations inb.captured
, which would include the system evaluation which might have already been unblocked byUnblockNode
as described above: https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L566-L585Which leads to
b.stats.Unblock
to be called on the same evaluation twice.Other Potential Issues
We noticed a similar pattern happening in the following places:
Untrack
removes evaluations fromb.system
but notb.tracked
: https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L364-L370processBlockJobDuplicate
callsb.stats.Unblock
if the duplicated evaluation is in the captured set, but not if it is in the escaped set: https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L248-L274We're not certain if these other cases could also be triggered in a way that cause double unblock accounting, but we thought it was worth mentioning.
Expected Result
The number of blocked evals as reported by the
nomad.nomad.blocked_evals.total_blocked
metric should be correct.Actual Result
The number of blocked evals reported is incorrect.