Periodic sysbatch jobs run much more frequently than the spec expresses.

stswidwinski commented 1 year ago

Nomad version

1.5.5 but also the tip.

Issue

Due to the garbage collection logic applied to periodic sysbatch jobs (and sysbatch jobs in general), the sysbatch jobs will run much more frequently than the job spec expresses. In particular, consider the following:

Job GC period of X
Eval GC period of Y

If X < Y then every periodic run of the sysbatch job will run on every node multiple times, as long as allocations for the a sysbatch job do not end at the exact same time. This may lead to infinite accumulation of the number of periodic jobs and an infinite number of allocations run for each of them on every node. Please see the repro for details.

Reproduction steps

Start a server and two client nodes

# Local server
$ cat server_config.hcl 
data_dir = "/tmp/nomad/server"
log_level = "TRACE"

advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}

server {
  enabled = true
  bootstrap_expect = 1
  job_gc_interval = "1m"
  job_gc_threshold = "24h"
  eval_gc_threshold = "1m" 
}
$ ./nomad agent -config server_config.hcl 

# Local client no. 1
$ cat client_config.hcl
data_dir = "/tmp/nomad/client-1"
log_level = "debug"

advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}

ports {
  http = "9876"
  rpc = "9875"
  serf = "9874"
}

client {
  enabled = true
  servers = ["127.0.0.1"]
  gc_max_allocs = 1
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}
$ ./nomad agent -config client_config.hcl 

$ cat client_config_2.hcl 
data_dir = "/tmp/nomad/client-2"
log_level = "debug"

advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}

ports {
  http = "8876"
  rpc = "8875"
  serf = "8874"
}

client {
  enabled = true
  servers = ["127.0.0.1"]
  gc_max_allocs = 1
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

Please note that we will start client node no. 2 in a way that naturally keeps all of its allocations alive forever. This is due to https://github.com/hashicorp/nomad/issues/16381 and we use it just because it's convenient. In production this can be easiest emulated by having sysbatch jobs whose runtimes are non-uniform and simply splay across a large-ish period (say, 10 minutes). All we want for this node is for it to maintain its allocations and not let them be GCed.

$ while true; do timeout 45 ./nomad agent -config client_config_2.hcl

Now, let us start a periodic job:

# Job
$ cat job.hcl
job "example" {
  datacenters = ["dc1"]
  type = "sysbatch"

  periodic {
    cron = "*/10 * * * * *"
  }

  group "test-group" {
    task "test-task" {
      driver = "raw_exec"

      config {
        command = "/usr/bin/echo"
        args = [ "I ran!" ]
      }
    }
  }
}

$ ./nomad job run job.hcl                                                                                                 
Job registration successful
Approximate next launch time: 2023-06-01T21:20:00Z (9m25s from now)

Now, all we need to do is wait. The Evaluations are only ever GCed every 5 minutes and the GC is approximate based on the raft index:

 [DEBUG] core.sched: eval GC found eligibile objects: evals=6 allocs=3

I left this running overnight, but realistically one may also just add artificial activity onto the node so that the index moves forward. If we now look at some of the periodic job runs they will have a lot of complete allocations (many many more than nodes -- in fact, we can make them have an arbitrary number!):

$ ./nomad job status example/periodic-1685712600
ID            = example/periodic-1685712600
Name          = example/periodic-1685712600
Submit Date   = 2023-06-02T09:30:00-04:00
Type          = sysbatch
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
test-group  0       0         0        0       8         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
7f95450b  274e6daa  test-group  0        run      complete  41s ago     41s ago
4b36a17c  75560bfc  test-group  0        run      complete  42m21s ago  42s ago

The record-holding job had >1000 completed allocs in a system with just 2 nodes and there was an average of 1 alloc running every 1 second for this sysbatch job which is configured to run every 10 minutes.

I expect a periodic sysbatch run to ever only have number_of_nodes allocations complete. Since this happens for every periodic run in addition to new runs we now get infinite number of sysbatch periodic runs on every node (these jobs are never garbage collected and their number grows without bound).

Expected Result

Each periodic sysbatch job instance runs on every node in the system only once.

Actual Result

Each periodic sysbatch job instance runs a large number of times on every node.

Root cause

The root cause is a combination of that in https://github.com/hashicorp/nomad/issues/17395 and the fact that the garbage collection for sysbatch jobs is different from that for batch jobs.

Batch jobs maintain at least one allocation per task group ran so that they are not rescheduled when the exit code is 0 (that is -- they are expected not to run again):

https://github.com/hashicorp/nomad/blob/v1.5.5/nomad/core_sched.go#L306-L313

However, this logic does not exist for sysbatch jobs which causes the above behavior.

jrasell commented 1 year ago

Hi @stswidwinski and thanks for raising this issue with great detail. I'll add this to our backlog for potential future roadmapping.

hxt365 commented 11 months ago

I'm also having this issue

hashicorp / nomad