Open marvinchin opened 1 year ago
I believe that this same bug also affects nodes at the time of node restart (this is either the machine restart or the nomad client restart). This behavior is quite worrisome for a series of reasons:
I have read through the code mentioned in the bug report and the related code. It seems to me that the intent of the code is to only iterate through non-terminal allocs and only add evaluations for such nodes:
I may be misreading the intent here, but either way this behavior is causing me quite a bit of headache and we'd love to understand if this is an intended behavior (a feature) or a bug
Hi @marvinchin and @stswidwinski 👋
Thanks for the detailed investigation! Looking at the code history, this part was written by Armon himself over 8 years ago and has not changed much since then 😄
Part of the reason for that is that overall design of Nomad is to rely on the scheduler to process evals correctly and figure out what the state of the cluster should be. This can indeed wasteful in some cases, and we have identified some areas of improvement, but we usually lean towards emitting more evals than strictly necessary just in case there are edge cases that we haven't foreseen.
In general this extra evals should be no-ops, meaning that they should not alter the state of the cluster. You mentioned that:
This can lead to unexpected changes to status of a job (e.g. a "dead" job can become "pending" due to the presence of this new evaluation).
Is this something that you have observed in your cluster? If so, that's they type of problems we definitely don't want to happen as it violates the scheduler correctness. If you have more information of an instance of this problem we can investigate it further.
@stswidwinski
I believe that this same bug also affects nodes at the time of node restart (this is either the machine restart or the nomad client restart).
That's correct. createNodeEvals
is called in multiple RPC endpoints, including Node.Register
and Node.Deregister
.
The creation of new evaluations prolongs the lifetime of the job to which the evaluation belongs. If nodes restart more frequently than the GC limit, the objects will never be GCed even if they are in a terminal state.
I have not been able to observe this. Is this something that is happening with you? Would you be able to sare more information and maybe even reproduction steps? I have detailed my test below.
Restarts of nodes that are prepared for the restart (that is: drained) may cause large load on both the server and the node.
This is also true, but since these evals are supposed to be no-ops the scheduler should be able to process them quickly. We're also actively improving the scheduler and eval broker to make it even faster and more optimized. We actually just publish a blog post about it 🙂
I have read through the code mentioned in the bug report and the related code. It seems to me that the intent of the code is to only iterate through non-terminal allocs and only add evaluations for such nodes
That comment is a bit ambiguous but I read ignores the terminal status
as in don't take the allocation status in consideration and return all allocations. Looking at https://github.com/hashicorp/nomad/pull/822 when this method was introduced that was also when AllocsByNodeTerminal
was created, so I think the original intention was for AllocsByNode
to return all allocations. Using node_prefix
means that the conditional compound index for the allocation status is ignored, which is what the comment states.
this behavior is causing me quite a bit of headache
Would you mind detailing what the problem is? Is it the GC and heavy load after node restart?
Hey there!
Thank you for the reply! Let me do my best to reply to your questions, but as always let me know if anything is unclear.
In general this extra evals should be no-ops, meaning that they should not alter the state of the cluster.
I don't think that I have any particular philosophical opinion in any cases in which this is the case. However, as you hint -- I have observed a few separate instances in which this is not the case. Perhaps the most important ask in your comment is for particular examples and repros. Let me do my best to walk through the cases I've observed and provide concrete repros where possible.
Alloc lifetime extension.
While not exactly related to the issue at hand, this is a nice example of extraneous work leading to bad results. A restart of a node (drained or undrained) will cause all allocations that ever belonged to that node to have their lifetimes extended (the GC clock is reset). I have found the following bug report to describe this behavior in much detail: https://github.com/hashicorp/nomad/issues/16381
Job lifetime extension
A different angle of the same issue above is the extension of the lifetime of a batch
job in case of a node restart. It seems that the batch
part here is quite important (in fact... I must own up to the fact that I haven't even tried this with service jobs prior). I believe that this is due to the fact that the last evaluation for such a job seems to be special and serves as a sentinel for not scheduling another run of the batch job if no restarts are requested. I believe that this is the salient difference in Here is a repro using nomad
built from the v1.5.0
tag:
$ cat config.conf
{
"server": {
"enabled": true,
"job_gc_interval": "1m",
"job_gc_threshold": "1m",
"eval_gc_threshold": "1m",
"batch_eval_gc_threshold": "1m",
}
}
# Simple batch job
$ cat job.hcl
job "test_job" {
region = "global"
datacenters = ["dc1"]
type = "batch"
group "one_and_only" {
# Specify the number of these tasks we want.
count = 1
task "test" {
driver = "raw_exec"
config {
command = "/usr/bin/true"
args = [ "" ]
}
}
}
}
# Start the nomad agent
$ ./nomad agent -dev -config config.conf
# Run and stop the batch job
$ ./nomad job run job.hcl
$ ./nomad job stop test_job
# Re-register the node in a loop
while true; do ./nomad node meta apply test=$(date +%s); sleep 3; done
# Read the state of the job in a loop
$ while true; do date; ./nomad job status; sleep 3; done
The expectation here is for the GC to occur within 1 minute. In reality the GC will occur in anywhere between 12 minutes and 20 minutes. Here is what is observed:
eval
is created for the job on the nodefsm
TimeTable
(see: https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/fsm.go#L25 and bug: https://github.com/hashicorp/nomad/issues/16359)alloc
is GCed, there are no more new eval
s created, but the last eval is brand new at this time.eval
may be GCed and then the job
may be GCed.This means that even though we expect the job
to be GC
ed within max(job_gc_threshold, batch_eval_gc_threshold, 2 * min_time_table_granularity) = 5 minutes
we GC within that value + batch_eval_gc_threshold
. The default for batch_eval_gc_threshold
is 1 day which means that we would GC the job an entire 1 day
later than expected!
Hope this helps show the behavior a little bit better.
Jobs moving from dead into pending
I believe that this is a direct consequence of the repro within this bug report. Here is a repro, analogous with the repro above, but adding another server into the mix to ensure that there is added latency within the evaluation dequeue (that is: ensuring that the evaluation is pending for a longer time period). The job spec is identical and the config for the server and client is default:
# Start up one server and one client with the default config:
$ ./nomad agent -dev
# Start another server to introduce fake latency into the scheduling, dequeing and processing of evaluations
$ cat config.conf
{
"name": "server2",
"addresses": {
"http": "localhost",
"rpc": "localhost",
"serf": "localhost"
},
"advertise": {
"http": "localhost",
"rpc": "localhost",
"serf": "localhost"
},
"ports": {
"http":12345,
"rpc":12346,
"serf":12347
},
"server": {
"enabled": true,
"server_join" : {
"retry_join" : ["127.0.0.1:4648"],
"retry_max": 0,
"retry_interval": "1s"
}
}
}
$ ./nomad agent -server -config config.conf
# Start a job and stop it. Make sure that it is indeed stopped and dead.
$ ./nomad job run job.hcl
==> 2023-03-08T21:20:00-05:00: Monitoring evaluation "e949cb3c"
2023-03-08T21:20:00-05:00: Evaluation triggered by job "test_job"
2023-03-08T21:20:01-05:00: Allocation "015d3be2" created: node "5ae41c60", group "one_and_only"
2023-03-08T21:20:01-05:00: Evaluation status changed: "pending" -> "complete"
==> 2023-03-08T21:20:01-05:00: Evaluation "e949cb3c" finished with status "complete"
$ ./nomad job stop test_job
==> 2023-03-08T21:20:09-05:00: Monitoring evaluation "ad79bef2"
2023-03-08T21:20:09-05:00: Evaluation triggered by job "test_job"
2023-03-08T21:20:10-05:00: Evaluation status changed: "pending" -> "complete"
==> 2023-03-08T21:20:10-05:00: Evaluation "ad79bef2" finished with status "complete"
# Run a tight loop reading the status of the test job.
while true; do ./nomad operator api /v1/job/test_job | jq .Status; done
# Run a loop updating the state of the node (forcing re-registration)
$ while true; do ./nomad node meta apply test=$(date +%s); sleep .1; done
Now, what matters is the output of the tight loop reading the job state. Here is a sample output:
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
The job is bouncing between dead
and pending
state. This gets worse when the scheduler responsible for the evaluation processing is stuck, unable to send nack
to the leader causing it to leave the evaluation in pending
state (hence: the job is also pending
) for a long period of time (I believe that 5
minutes is the hard-coded timeout for eval re-submission on lack of ack
or nack
receipt: https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/config.go#L489-L495).
Hope this helps show this behavior as well.
Is this something that you have observed in your cluster? If so, that's they type of problems we definitely don't want to happen as it violates the scheduler correctness. If you have more information of an instance of this problem we can investigate it further.
Please see above. In general, I have seen that the added work which is considered to be a no-op is only a no-op assuming lack of any delays caused by network, disk, CPU scheduling and similar. In practice, the correctness is violated.
That's correct. createNodeEvals is called in multiple RPC endpoints, including Node.Register and Node.Deregister.
I think I've addressed this above.
This is also true, but since these evals are supposed to be no-ops the scheduler should be able to process them quickly. We're also actively improving the scheduler and eval broker to make it even faster and more optimized. We actually just publish a blog post about it slightly_smiling_face
I appreciate the transparency! While I am glad that the scheduling algorithm and architecture is becoming more efficient, my worry is tangential. I am observing that single node failure or even a small number of nodes restarting may cause global load and significant impact to the availability of the cluster. This bug seems to go into a lot more detail:
https://github.com/hashicorp/nomad/issues/16381
I have benchmarked the behavior in a modern, Intel CPU. Even if network round trip is entirely ignored, the node client takes over 2 minutes to process 1k allocations resulting in CPU utilization of over 300%, disk utilization of 100% due to DB writes and node unavailability (missed heartbeats).
This means that restarting nomad client is a dangerous operation which may impact not only local state, but also the global availability of the cluster.
Would you mind detailing what the problem is? Is it the GC and heavy load after node restart?
Please see above.
@lgfa29 just checking in to see if my reply made sense. Please let me know, happy to provide more details!
Yes, that makes sense thank you for the detailed answer. I marked this for further roadmaping.
We experience the same issue with jobs that failed multiple times and reach ReschedulePolicy.MaxDelay
. We've tried nomad job eval -force-reschedule job-name
, and tried to purge the job, then run gc, nothing helps.
$ nomad job eval -force-reschedule job-name
==> 2024-08-22T15:35:39+04:00: Monitoring evaluation "641f029b"
2024-08-22T15:35:40+04:00: Evaluation triggered by job "job-name"
2024-08-22T15:35:40+04:00: Evaluation status changed: "pending" -> "complete"
==> 2024-08-22T15:35:40+04:00: Evaluation "641f029b" finished with status "complete"
Nomad version
Nomad v1.4.2
Operating system and Environment details
Unix
Issue
When a client node eligibility is toggled, evaluations are created even for terminal allocations on the node. This can lead to unexpected changes to status of a job (e.g. a "dead" job can become "pending" due to the presence of this new evaluation).
This behavior is a little unexpected to me, but I could also just be misunderstanding something and this might be intended behavior.
Relevant Code
The part of the code that creates evaluations for allocations on a node calls
snap.AllocsByNode
to get both terminal and non-terminal allocations on the node (instead ofAllocsByNodeTerminal
which allows us to filter on the terminal status):https://github.com/hashicorp/nomad/blob/542b23e999b5338cdcfa56bfdae1db9aa834cb3c/nomad/node_endpoint.go#L1604-L1608
Reproduction steps
First, install and start the nomad agent in dev mode:
Next, we run a batch job which terminates immediately:
Then, we toggle node eligibility off and on again, and notice that an evaluation was created for the task group whose allocation had already completed on the node:
Expected Result
When a node is updated, only allocations on the node which are non-terminal should have evaluations created for them.
Actual Result
Evaluations are also created for terminal allocations which were previously scheduled on the node which was updated.