Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
Due to the garbage collection logic applied to periodic sysbatch jobs (and sysbatch jobs in general), the sysbatch jobs will run much more frequently than the job spec expresses. In particular, consider the following:
Job GC period of X
Eval GC period of Y
If X < Y then every periodic run of the sysbatch job will run on every node multiple times, as long as allocations for the a sysbatch job do not end at the exact same time. This may lead to infinite accumulation of the number of periodic jobs and an infinite number of allocations run for each of them on every node. Please see the repro for details.
Please note that we will start client node no. 2 in a way that naturally keeps all of its allocations alive forever. This is due to https://github.com/hashicorp/nomad/issues/16381 and we use it just because it's convenient. In production this can be easiest emulated by having sysbatch jobs whose runtimes are non-uniform and simply splay across a large-ish period (say, 10 minutes). All we want for this node is for it to maintain its allocations and not let them be GCed.
$ while true; do timeout 45 ./nomad agent -config client_config_2.hcl
Now, all we need to do is wait. The Evaluations are only ever GCed every 5 minutes and the GC is approximate based on the raft index:
[DEBUG] core.sched: eval GC found eligibile objects: evals=6 allocs=3
I left this running overnight, but realistically one may also just add artificial activity onto the node so that the index moves forward. If we now look at some of the periodic job runs they will have a lot of complete allocations (many many more than nodes -- in fact, we can make them have an arbitrary number!):
$ ./nomad job status example/periodic-1685712600
ID = example/periodic-1685712600
Name = example/periodic-1685712600
Submit Date = 2023-06-02T09:30:00-04:00
Type = sysbatch
Priority = 50
Datacenters = dc1
Namespace = default
Status = dead
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
test-group 0 0 0 0 8 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
7f95450b 274e6daa test-group 0 run complete 41s ago 41s ago
4b36a17c 75560bfc test-group 0 run complete 42m21s ago 42s ago
The record-holding job had >1000 completed allocs in a system with just 2 nodes and there was an average of 1 alloc running every 1 second for this sysbatch job which is configured to run every 10 minutes.
I expect a periodic sysbatch run to ever only have number_of_nodes allocations complete. Since this happens for every periodic run in addition to new runs we now get infinite number of sysbatch periodic runs on every node (these jobs are never garbage collected and their number grows without bound).
Expected Result
Each periodic sysbatch job instance runs on every node in the system only once.
Actual Result
Each periodic sysbatch job instance runs a large number of times on every node.
Root cause
The root cause is a combination of that in https://github.com/hashicorp/nomad/issues/17395 and the fact that the garbage collection for sysbatch jobs is different from that for batch jobs.
Batch jobs maintain at least one allocation per task group ran so that they are not rescheduled when the exit code is 0 (that is -- they are expected not to run again):
Nomad version
1.5.5 but also the tip.
Issue
Due to the garbage collection logic applied to periodic sysbatch jobs (and sysbatch jobs in general), the sysbatch jobs will run much more frequently than the job spec expresses. In particular, consider the following:
If
X < Y
then every periodic run of the sysbatch job will run on every node multiple times, as long as allocations for the a sysbatch job do not end at the exact same time. This may lead to infinite accumulation of the number of periodic jobs and an infinite number of allocations run for each of them on every node. Please see the repro for details.Reproduction steps
Start a server and two client nodes
Please note that we will start client node no. 2 in a way that naturally keeps all of its allocations alive forever. This is due to https://github.com/hashicorp/nomad/issues/16381 and we use it just because it's convenient. In production this can be easiest emulated by having sysbatch jobs whose runtimes are non-uniform and simply splay across a large-ish period (say, 10 minutes). All we want for this node is for it to maintain its allocations and not let them be GCed.
Now, let us start a periodic job:
Now, all we need to do is wait. The Evaluations are only ever GCed every
5 minutes
and the GC is approximate based on the raft index:I left this running overnight, but realistically one may also just add artificial activity onto the node so that the index moves forward. If we now look at some of the periodic job runs they will have a lot of complete allocations (many many more than nodes -- in fact, we can make them have an arbitrary number!):
The record-holding job had >
1000
completed allocs in a system with just 2 nodes and there was an average of 1 alloc running every 1 second for this sysbatch job which is configured to run every10
minutes.I expect a periodic sysbatch run to ever only have
number_of_nodes
allocations complete. Since this happens for every periodic run in addition to new runs we now get infinite number of sysbatch periodic runs on every node (these jobs are never garbage collected and their number grows without bound).Expected Result
Each periodic sysbatch job instance runs on every node in the system only once.
Actual Result
Each periodic sysbatch job instance runs a large number of times on every node.
Root cause
The root cause is a combination of that in https://github.com/hashicorp/nomad/issues/17395 and the fact that the garbage collection for sysbatch jobs is different from that for
batch
jobs.Batch jobs maintain at least one allocation per task group ran so that they are not rescheduled when the exit code is 0 (that is -- they are expected not to run again):
https://github.com/hashicorp/nomad/blob/v1.5.5/nomad/core_sched.go#L306-L313
However, this logic does not exist for sysbatch jobs which causes the above behavior.