Unexpected evaluations created when client nodes are updated

marvinchin commented 1 year ago

Nomad version

Nomad v1.4.2

Operating system and Environment details

Unix

Issue

When a client node eligibility is toggled, evaluations are created even for terminal allocations on the node. This can lead to unexpected changes to status of a job (e.g. a "dead" job can become "pending" due to the presence of this new evaluation).

This behavior is a little unexpected to me, but I could also just be misunderstanding something and this might be intended behavior.

Relevant Code

The part of the code that creates evaluations for allocations on a node calls snap.AllocsByNode to get both terminal and non-terminal allocations on the node (instead of AllocsByNodeTerminal which allows us to filter on the terminal status):

https://github.com/hashicorp/nomad/blob/542b23e999b5338cdcfa56bfdae1db9aa834cb3c/nomad/node_endpoint.go#L1604-L1608

Reproduction steps

First, install and start the nomad agent in dev mode:

# Install nomad locally...
$ yumdownloader nomad-1.4.2-1
$ rpm2cpio  nomad-1.4.2-1.x86_64.rpm  | cpio -idmv --no-absolute-filenames

# Start the agent in dev mode with the default config
$ ./usr/bin/nomad agent -dev

Next, we run a batch job which terminates immediately:

# First show the job spec
$ cat example.nomad
job "example" {
  datacenters = ["dc1"]
  type = "batch"

  group "test-group" {
    task "test-task" {
      driver = "raw_exec"

      config {
    command = "/usr/bin/true"
      }
    }

    reschedule {
      attempts  = 0
      unlimited = false
    }
  }
}

# Submit the job, and note that the scheduled allocation terminates immediately
$ ./usr/bin/nomad job run example.nomad
==> 2023-02-27T18:00:35+08:00: Monitoring evaluation "dcf3acfb"
    2023-02-27T18:00:35+08:00: Evaluation triggered by job "example"
    2023-02-27T18:00:35+08:00: Allocation "24d06cad" created: node "f1b2898d", group "test-group"
    2023-02-27T18:00:36+08:00: Allocation "24d06cad" status changed: "pending" -> "complete" (All tasks have completed)
    2023-02-27T18:00:36+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-02-27T18:00:36+08:00: Evaluation "dcf3acfb" finished with status "complete"

# For good measure, we inspect the node status to verify that the allocation has status "complete"
$ nomad node status f1b2898d
ID              = f1b2898d-550b-9e25-b306-a8ab4fa3bcb3
Name            = <hostname>
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 61h54m24s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = raw_exec

Node Events
Time                       Subsystem  Message
2023-02-27T18:00:28+08:00  Cluster    Node registered

Allocated Resources
CPU           Memory      Disk
0/127744 MHz  0 B/94 GiB  0 B/886 GiB

Allocation Resource Utilization
CPU           Memory
0/127744 MHz  0 B/94 GiB

Host Resource Utilization
CPU             Memory         Disk
935/127744 MHz  73 GiB/94 GiB  229 GiB/1.1 TiB

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
24d06cad  f1b2898d  test-group  0        run      complete  14s ago  14s ago

Then, we toggle node eligibility off and on again, and notice that an evaluation was created for the task group whose allocation had already completed on the node:

# First make the node ineligible
$ nomad node eligibility -disable f1b2898d
Node "f1b2898d-550b-9e25-b306-a8ab4fa3bcb3" scheduling eligibility set: ineligible for scheduling

# No new evaluations were created yet
$ nomad eval list
ID        Priority  Triggered By  Job ID   Namespace  Node ID  Status    Placement Failures
dcf3acfb  50        job-register  example  default    <none>   complete  false

# Then, mark the node eligible again
nomad node eligibility -enable f1b2898d
Node "f1b2898d-550b-9e25-b306-a8ab4fa3bcb3" scheduling eligibility set: eligible for scheduling

# We see that marking the node eligible resulted in a new evaluation for the job being created
$ nomad eval list
ID        Priority  Triggered By  Job ID   Namespace  Node ID   Status    Placement Failures
0028d2c5  50        node-update   example  default    f1b2898d  complete  false
dcf3acfb  50        job-register  example  default    <none>    complete  false

Expected Result

When a node is updated, only allocations on the node which are non-terminal should have evaluations created for them.

Actual Result

Evaluations are also created for terminal allocations which were previously scheduled on the node which was updated.

stswidwinski commented 1 year ago

I believe that this same bug also affects nodes at the time of node restart (this is either the machine restart or the nomad client restart). This behavior is quite worrisome for a series of reasons:

The creation of new evaluations prolongs the lifetime of the job to which the evaluation belongs. If nodes restart more frequently than the GC limit, the objects will never be GCed even if they are in a terminal state.
Restarts of nodes that are prepared for the restart (that is: drained) may cause large load on both the server and the node.

I have read through the code mentioned in the bug report and the related code. It seems to me that the intent of the code is to only iterate through non-terminal allocs and only add evaluations for such nodes:

https://github.com/hashicorp/nomad/blob/542b23e999b5338cdcfa56bfdae1db9aa834cb3c/nomad/state/state_store.go#L4032-L4033

I may be misreading the intent here, but either way this behavior is causing me quite a bit of headache and we'd love to understand if this is an intended behavior (a feature) or a bug

lgfa29 commented 1 year ago

Hi @marvinchin and @stswidwinski 👋

Thanks for the detailed investigation! Looking at the code history, this part was written by Armon himself over 8 years ago and has not changed much since then 😄

Part of the reason for that is that overall design of Nomad is to rely on the scheduler to process evals correctly and figure out what the state of the cluster should be. This can indeed wasteful in some cases, and we have identified some areas of improvement, but we usually lean towards emitting more evals than strictly necessary just in case there are edge cases that we haven't foreseen.

In general this extra evals should be no-ops, meaning that they should not alter the state of the cluster. You mentioned that:

This can lead to unexpected changes to status of a job (e.g. a "dead" job can become "pending" due to the presence of this new evaluation).

Is this something that you have observed in your cluster? If so, that's they type of problems we definitely don't want to happen as it violates the scheduler correctness. If you have more information of an instance of this problem we can investigate it further.

@stswidwinski

I believe that this same bug also affects nodes at the time of node restart (this is either the machine restart or the nomad client restart).

That's correct. createNodeEvals is called in multiple RPC endpoints, including Node.Register and Node.Deregister.

The creation of new evaluations prolongs the lifetime of the job to which the evaluation belongs. If nodes restart more frequently than the GC limit, the objects will never be GCed even if they are in a terminal state.

I have not been able to observe this. Is this something that is happening with you? Would you be able to sare more information and maybe even reproduction steps? I have detailed my test below.

Reproduction

``` $ nomad agent -dev $ nomad job init $ nomad run example.nomad.hcl $ nomad stop example $ while true; do nomad node meta apply test=$(date +%s); sleep 3; done ``` After a while the job and its allocations were GC'ed, even with the extra `node-update` evals.

Restarts of nodes that are prepared for the restart (that is: drained) may cause large load on both the server and the node.

This is also true, but since these evals are supposed to be no-ops the scheduler should be able to process them quickly. We're also actively improving the scheduler and eval broker to make it even faster and more optimized. We actually just publish a blog post about it 🙂

I have read through the code mentioned in the bug report and the related code. It seems to me that the intent of the code is to only iterate through non-terminal allocs and only add evaluations for such nodes

That comment is a bit ambiguous but I read ignores the terminal status as in don't take the allocation status in consideration and return all allocations. Looking at https://github.com/hashicorp/nomad/pull/822 when this method was introduced that was also when AllocsByNodeTerminal was created, so I think the original intention was for AllocsByNode to return all allocations. Using node_prefix means that the conditional compound index for the allocation status is ignored, which is what the comment states.

this behavior is causing me quite a bit of headache

Would you mind detailing what the problem is? Is it the GC and heavy load after node restart?

stswidwinski commented 1 year ago

Hey there!

Thank you for the reply! Let me do my best to reply to your questions, but as always let me know if anything is unclear.

In general this extra evals should be no-ops, meaning that they should not alter the state of the cluster.

I don't think that I have any particular philosophical opinion in any cases in which this is the case. However, as you hint -- I have observed a few separate instances in which this is not the case. Perhaps the most important ask in your comment is for particular examples and repros. Let me do my best to walk through the cases I've observed and provide concrete repros where possible.

Alloc lifetime extension.

While not exactly related to the issue at hand, this is a nice example of extraneous work leading to bad results. A restart of a node (drained or undrained) will cause all allocations that ever belonged to that node to have their lifetimes extended (the GC clock is reset). I have found the following bug report to describe this behavior in much detail: https://github.com/hashicorp/nomad/issues/16381

Job lifetime extension

A different angle of the same issue above is the extension of the lifetime of a batch job in case of a node restart. It seems that the batch part here is quite important (in fact... I must own up to the fact that I haven't even tried this with service jobs prior). I believe that this is due to the fact that the last evaluation for such a job seems to be special and serves as a sentinel for not scheduling another run of the batch job if no restarts are requested. I believe that this is the salient difference in Here is a repro using nomad built from the v1.5.0 tag:

$ cat config.conf 
{
  "server": {
    "enabled": true,
    "job_gc_interval": "1m",
    "job_gc_threshold": "1m",
    "eval_gc_threshold": "1m",
    "batch_eval_gc_threshold": "1m",
  }
}

# Simple batch job
$ cat job.hcl
job "test_job" {
  region = "global"
  datacenters = ["dc1"]
  type = "batch"

  group "one_and_only" {
    # Specify the number of these tasks we want.
    count = 1

    task "test" {
      driver = "raw_exec"

      config {
        command = "/usr/bin/true"
        args = [ "" ]
      }
    }
  }
}

# Start the nomad agent
$ ./nomad agent -dev -config config.conf

# Run and stop the batch job
$ ./nomad job run job.hcl
$ ./nomad job stop test_job

# Re-register the node in a loop
while true; do  ./nomad node meta apply test=$(date +%s); sleep 3; done

# Read the state of the job in a loop
$ while true; do date; ./nomad job status; sleep 3; done

The expectation here is for the GC to occur within 1 minute. In reality the GC will occur in anywhere between 12 minutes and 20 minutes. Here is what is observed:

Every time the node is re-registered a new eval is created for the job on the node
The above behavior continues until the original alloc is GCed. This happens 10 minutes after the job is stopped due to hard-coded granularity of fsm TimeTable (see: https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/fsm.go#L25 and bug: https://github.com/hashicorp/nomad/issues/16359)
After the initial alloc is GCed, there are no more new evals created, but the last eval is brand new at this time.
Hence we need to wait another 1-2 minutes until the eval may be GCed and then the job may be GCed.

This means that even though we expect the job to be GCed within max(job_gc_threshold, batch_eval_gc_threshold, 2 * min_time_table_granularity) = 5 minutes we GC within that value + batch_eval_gc_threshold. The default for batch_eval_gc_threshold is 1 day which means that we would GC the job an entire 1 day later than expected!

Hope this helps show the behavior a little bit better.

Jobs moving from dead into pending

I believe that this is a direct consequence of the repro within this bug report. Here is a repro, analogous with the repro above, but adding another server into the mix to ensure that there is added latency within the evaluation dequeue (that is: ensuring that the evaluation is pending for a longer time period). The job spec is identical and the config for the server and client is default:

# Start up one server and one client with the default config:
$ ./nomad agent -dev 

# Start another server to introduce fake latency into the scheduling, dequeing and processing of evaluations
$ cat config.conf
{
  "name": "server2",
  "addresses": {
    "http": "localhost",
    "rpc": "localhost",
    "serf": "localhost"
  },
  "advertise": {
    "http": "localhost",
    "rpc": "localhost",
    "serf": "localhost"
  },
    "ports": {
      "http":12345,
      "rpc":12346,
      "serf":12347
    },
  "server": {
    "enabled": true,
    "server_join" : { 
      "retry_join" : ["127.0.0.1:4648"],
      "retry_max": 0,
      "retry_interval": "1s"
    }
  }
}
$ ./nomad agent -server -config config.conf

# Start a job and stop it. Make sure that it is indeed stopped and dead.
$ ./nomad job run  job.hcl                                                                                                                                                                                                                        
==> 2023-03-08T21:20:00-05:00: Monitoring evaluation "e949cb3c"
    2023-03-08T21:20:00-05:00: Evaluation triggered by job "test_job"
    2023-03-08T21:20:01-05:00: Allocation "015d3be2" created: node "5ae41c60", group "one_and_only"                                                
    2023-03-08T21:20:01-05:00: Evaluation status changed: "pending" -> "complete"                                                                  
==> 2023-03-08T21:20:01-05:00: Evaluation "e949cb3c" finished with status "complete"                                                               
$ ./nomad job stop test_job                                                                                                                                                                                                                       
==> 2023-03-08T21:20:09-05:00: Monitoring evaluation "ad79bef2"
    2023-03-08T21:20:09-05:00: Evaluation triggered by job "test_job"                                                                                                                                                                                                                                 
    2023-03-08T21:20:10-05:00: Evaluation status changed: "pending" -> "complete"                                                                  
==> 2023-03-08T21:20:10-05:00: Evaluation "ad79bef2" finished with status "complete"                                       

# Run a tight loop reading the status of the test job.
while true; do ./nomad operator api /v1/job/test_job | jq .Status; done

# Run a loop updating the state of the node (forcing re-registration)
$ while true; do  ./nomad node meta apply test=$(date +%s); sleep .1; done

Now, what matters is the output of the tight loop reading the job state. Here is a sample output:

"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"pending"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"
"dead"

The job is bouncing between dead and pending state. This gets worse when the scheduler responsible for the evaluation processing is stuck, unable to send nack to the leader causing it to leave the evaluation in pending state (hence: the job is also pending) for a long period of time (I believe that 5 minutes is the hard-coded timeout for eval re-submission on lack of ack or nack receipt: https://github.com/hashicorp/nomad/blob/v1.5.0/nomad/config.go#L489-L495).

Hope this helps show this behavior as well.

Is this something that you have observed in your cluster? If so, that's they type of problems we definitely don't want to happen as it violates the scheduler correctness. If you have more information of an instance of this problem we can investigate it further.

Please see above. In general, I have seen that the added work which is considered to be a no-op is only a no-op assuming lack of any delays caused by network, disk, CPU scheduling and similar. In practice, the correctness is violated.

That's correct. createNodeEvals is called in multiple RPC endpoints, including Node.Register and Node.Deregister.

I think I've addressed this above.

This is also true, but since these evals are supposed to be no-ops the scheduler should be able to process them quickly. We're also actively improving the scheduler and eval broker to make it even faster and more optimized. We actually just publish a blog post about it slightly_smiling_face

I appreciate the transparency! While I am glad that the scheduling algorithm and architecture is becoming more efficient, my worry is tangential. I am observing that single node failure or even a small number of nodes restarting may cause global load and significant impact to the availability of the cluster. This bug seems to go into a lot more detail:

https://github.com/hashicorp/nomad/issues/16381

I have benchmarked the behavior in a modern, Intel CPU. Even if network round trip is entirely ignored, the node client takes over 2 minutes to process 1k allocations resulting in CPU utilization of over 300%, disk utilization of 100% due to DB writes and node unavailability (missed heartbeats).

This means that restarting nomad client is a dangerous operation which may impact not only local state, but also the global availability of the cluster.

Would you mind detailing what the problem is? Is it the GC and heavy load after node restart?

Please see above.

stswidwinski commented 1 year ago

@lgfa29 just checking in to see if my reply made sense. Please let me know, happy to provide more details!

lgfa29 commented 1 year ago

Yes, that makes sense thank you for the detailed answer. I marked this for further roadmaping.

Himura2la commented 4 weeks ago

We experience the same issue with jobs that failed multiple times and reach ReschedulePolicy.MaxDelay. We've tried nomad job eval -force-reschedule job-name, and tried to purge the job, then run gc, nothing helps.

$ nomad job eval -force-reschedule job-name
==> 2024-08-22T15:35:39+04:00: Monitoring evaluation "641f029b"
    2024-08-22T15:35:40+04:00: Evaluation triggered by job "job-name"
    2024-08-22T15:35:40+04:00: Evaluation status changed: "pending" -> "complete"
==> 2024-08-22T15:35:40+04:00: Evaluation "641f029b" finished with status "complete"

hashicorp / nomad