Open johnzhanghua opened 3 years ago
thank you, @johnzhanghua , we'll look into it.
just to be sure, these nodes that you are restarting are just clients, not servers? and do you see this problem if you remove gc_max_allocs
config? (can you say why you're using that config?)
The nodes both have client and server, the restart is the sudo reboot now
, reboots the whole node.
The gc_max_allocs
config should not be related. I saw the problem on our env with the default config, which is 50
.
It's easier to reproduce the issue with more system jobs, I've tried duplicate the job files, by only changing the name, with test1
, test2
, test3
. It ends up something like below after several node reboot.
nomad job status test2
ID = test2
Name = test2
Submit Date = 2021-01-20T06:43:33Z
Type = system
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
test2 0 1 1 0 3 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
147a9998 1d358dc0 test2 0 run pending 25m54s ago 1s ago
4e970e97 8af71708 test2 0 run running 27m46s ago 6m36s ago
65f0ea98 1d358dc0 test2 0 run complete 27m46s ago 6m54s ago
c9c75f7b 93eba225 test2 0 run complete 27m46s ago 6m41s ago
nomad job status test
ID = test
Name = test
Submit Date = 2021-01-18T23:29:56Z
Type = system
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
test 0 0 2 0 13 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
e3170151 93eba225 test 0 run complete 47m13s ago 9m34s ago
f17b4b18 93eba225 test 0 stop complete 48m48s ago 9m51s ago
4e0ac9a8 8af71708 test 0 run running 4h4m ago 9m34s ago
027c4532 1d358dc0 test 0 run running 1d5h ago 9m51s ago
91b19a4b 93eba225 test 0 stop complete 1d7h ago 9m51s ago
As an extra data point, I brought up a 1 server, 2 client virtualbox cluster with https://github.com/krishicks/vagrant-nomad, deployed the above test job as test1, test2, etc, and tried rebooting the cluster repeatedly via sudo reboot
. I rebooted just the clients, as well as clients and server, and was not able to reproduce the above issue.
@krishicks Thanks. Looks you using nomad version 1.0.2.
We will try version 1.0.3. If you bring nomad down to 0.12.0, can you reproduce it ?
I can try doing this, but you can also do it if you want to try reproducing it; just download whichever Nomad binary you want and put it in the root folder before vagrant up
; it will replace the Nomad binary with the given one.
It would be really great if you could find a reproduction in that environment!
Found a similar opened issue. https://github.com/hashicorp/nomad/issues/2419
Nomad version
Nomad v0.12.0 (8f7fbc8e7b5a4ed0d0209968faf41b238e6d5817)
Operating system and Environment details
CentOs 7.5 VM on Virtualbox 6.1, 3 nodes cluster, server and client on same node
Issue
Nomad schedules more instances of system task per node after node reboot
Reproduction steps
client {gc_max_allocs = 1}
nomad job run <job_file>
Summary Task Group Queued Starting Running Failed Complete Lost test 0 0 3 0 0 0
Allocations ID Node ID Task Group Version Desired Status Created Modified 0a861162 1d358dc0 test 0 run running 26m24s ago 24m45s ago 3989bac4 93eba225 test 0 run running 26m24s ago 24m48s ago d5f19cff 8af71708 test 0 run running 26m24s ago 24m44s ago
nomad job status test ID = test Name = test Submit Date = 2021-01-18T23:29:56Z Type = system Priority = 50 Datacenters = dc1 Namespace = default Status = running Periodic = false Parameterized = false
Summary Task Group Queued Starting Running Failed Complete Lost test 0 1 2 0 2 0
Allocations ID Node ID Task Group Version Desired Status Created Modified 3cd842f1 1d358dc0 test 0 run running 4m54s ago 4m42s ago 0a861162 1d358dc0 test 0 run complete 32m54s ago 4m52s ago 3989bac4 93eba225 test 0 run running 32m54s ago 4m36s ago d5f19cff 8af71708 test 0 run pending 32m54s ago 43s ago
nomad job status test ID = test Name = test Submit Date = 2021-01-18T23:29:56Z Type = system Priority = 50 Datacenters = dc1 Namespace = default Status = running Periodic = false Parameterized = false
Summary Task Group Queued Starting Running Failed Complete Lost test 0 2 3 0 5 0
Allocations ID Node ID Task Group Version Desired Status Created Modified 05d00f19 8af71708 test 0 run running 27m26s ago 27m26s ago ae08bf8e 1d358dc0 test 0 run pending 27m26s ago 27m26s ago 91b19a4b 93eba225 test 0 run running 30m10s ago 27m10s ago 3cd842f1 1d358dc0 test 0 run pending 36m26s ago 8s ago 0a861162 1d358dc0 test 0 run complete 1h4m ago 27m22s ago 3989bac4 93eba225 test 0 run complete 1h4m ago 27m22s ago d5f19cff 8af71708 test 0 run running 1h4m ago 27m26s ago
job "test" { datacenters = ["dc1"] type = "system"
group "test" { restart { interval = "6m" attempts = 10 delay = "10s" mode = "delay" }
} }