Closed tantra35 closed 5 years ago
Perhaps, one of one way to reproduce is to make a node drain in moment of rolling update. But before, when we saw this behavior, we doesn't do any node drain
@tantra35 system jobs can only have a count of 1. Do you have a full job spec you can post here?
@preetapan Sorry I misspelled it. I mean job running on 3 instances. And here job spec
job "msgserver"
{
# region = "global"
datacenters = ["aws"]
type = "system"
priority = 70
constraint
{
attribute = "${attr.kernel.name}"
value = "linux"
}
constraint
{
attribute = "${node.class}"
value = "msgserver"
}
update
{
stagger = "30m"
max_parallel = 1
}
group "msgserver"
{
task "msgserver"
{
leader = true
driver = "raw_exec"
kill_timeout = "1m"
artifact
{
source = "s3::https://s3.amazonaws.com/nomad-artifacts/msgserver/msgserver.14.tar.gz"
}
config
{
command = "start.sh"
}
service
{
name = "srvname"
port = "MSGSERVER"
}
logs
{
max_files = 3
max_file_size = 10
}
resources
{
memory = 1024
network
{
mbits = 10
port "appport"
{
static = 8192
}
port "MSGSERVER"
{
}
}
}
}
task "fluend"
{
driver = "raw_exec"
artifact
{
source = "s3::https://s3.amazonaws.com/nomad-artifacts/td-agent/td-agent.3.tar.gz"
}
config
{
command = "bin/td-agent"
args = ["--no-supervisor", "-c", "${NOMAD_TASK_DIR}/td-agent.conf"]
}
template {
data = <<EOH
<source>
@type tail
read_from_head true
path "{{ env "NOMAD_ALLOC_DIR" }}/logs/msgserver.stdout.*"
pos_file "{{ env "NOMAD_ALLOC_DIR" }}/logs/msgserver.stdout-log.pos"
tag "prod.msgserver.{{ env "APPNAME" }}.out"
format multiline
format_firstline /\d{4}\.\d{2}\.\d{2} \d{2}:\d{2}:\d{2} \d+ \|/
format1 /(?<time>\d{4}\.\d{2}\.\d{2} \d{2}:\d{2}:\d{2}) (?<pid>\d+) \| (?<message>.*)/
time_format "%Y.%m.%d %H:%M:%S"
</source>
<source>
@type tail
read_from_head true
path "{{ env "NOMAD_ALLOC_DIR" }}/logs/msgserver.stderr.*"
pos_file "{{ env "NOMAD_ALLOC_DIR" }}/logs/msgserver.stderr-log.pos"
tag "prod.msgserver.{{ env "APPNAME" }}.err"
multiline_flush_interval 3s
format multiline
format_firstline /^\S+/
format1 /(?<message>.*)/
</source>
<filter prod.msgserver.**>
@type record_transformer
<record>
appname "{{ env "APPNAME" }}"
hostname ${hostname}
</record>
</filter>
<match **>
@type forward
heartbeat_interval 20s
phi_threshold 16
send_timeout 5s
recover_wait 30s
flush_interval 5s
retry_wait 10s
max_retry_wait 5m
<server>
name graphite
host fluentd.service.consul
port 24224
</server>
</match>
EOH
destination = "local/td-agent.conf"
}
env
{
APPNAME = "appname"
}
logs
{
max_files = 3
max_file_size = 10
}
resources
{
memory = 150
cpu = 50
}
}
}
}
@preetapan @cgbaker Just checked this on test stand, and we made conclusions that rolling updates are fully broken for all types of jobs.
We test with follow job definition
job "vault_debug-00"
{
datacenters = ["test"]
priority = 50
constraint
{
attribute = "${attr.kernel.name}"
value = "linux"
}
constraint
{
distinct_hosts = true
}
update
{
stagger = "1800s"
max_parallel = 1
}
meta
{
CONTAINER_VERSION = "02"
}
group test
{
count = 3
task "vault_debug_task"
{
driver = "docker"
kill_timeout = "1m"
artifact
{
source = "http://docker.service.consul/playrix-vault_debug-${NOMAD_META_CONTAINER_VERSION}.tar.gz"
options
{
archive=false
}
}
config
{
command = "/sbin/init_plrx"
args = ["-c", "/sbin/runsvdir -P /etc/service"]
}
vault
{
policies = ["bind_password","vault_debug-consul"]
change_mode = "noop"
}
template
{
data = <<EOH
CONSUL_TOKEN="{{with secret "secrets/consul/plr/global/creds/vault_debug-consul"}}{{.Data.token}}{{end}}"
EOH
destination = "secrets/consul_token.env"
env = true
}
logs
{
max_files = 3
max_file_size = 10
}
resources
{
memory = 500
cpu = 200
}
}
}
}
Then to test rolling updates we begin reduce memory from 500 to 400 then to 300 then to 200, and see follow
vagrant@consulnomad-1:~$ nomad status vault_debug-00
ID = vault_debug-00
Name = vault_debug-00
Submit Date = 2018-10-17T23:42:32+03:00
Type = service
Priority = 50
Datacenters = test
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
test 0 0 3 0 9 0
Latest Deployment
ID = dd1173e2
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
test 3 3 3 0 2018-10-17T23:53:32+03:00
Allocations
ID Node ID Task Group Version Desired Status Created Modified
4c29e949 79e25c0e test 3 run running 6m33s ago 6m15s ago
d248228b 51d21e3d test 3 run running 6m56s ago 6m34s ago
67b721a6 f6f07f1f test 3 run running 7m15s ago 6m57s ago
7c1f6d1d a7e6f34b test 2 stop complete 9m7s ago 6m28s ago
8cbe7324 79e25c0e test 2 stop complete 9m26s ago 6m51s ago
8f6e8db7 51d21e3d test 2 stop complete 9m45s ago 7m11s ago
800287bf f6f07f1f test 1 stop complete 12m27s ago 9m3s ago
8aa67916 a7e6f34b test 1 stop complete 12m46s ago 9m22s ago
171a7060 79e25c0e test 1 stop complete 13m9s ago 9m41s ago
25ace39b f6f07f1f test 0 stop complete 49m13s ago 12m41s ago
cd147c95 51d21e3d test 0 stop complete 49m13s ago 12m21s ago
7fc23361 a7e6f34b test 0 stop complete 49m13s ago 13m4s ago
Distance between allocations launch, not 30 minutes, and much less
https://www.nomadproject.io/docs/job-specification/update.html
stagger (string: "30s") - Specifies the delay between migrating allocations off nodes marked for draining. This is specified using a label suffix like "30s" or "1h".
You probably care about min_healthy_time
for your service jobs :)
(string: "10s") - Specifies the minimum time the allocation must be in the healthy state before it is marked as healthy and unblocks further allocations from being updated. This is specified using a label suffix like "30s" or "15m".
@jippi yes you right, for service job type new update block work, and reaction on min_healthy_time
(in my last post i made incorrect conclude) are happens, but for system job types this new parameters absolutely ignores, and only stager
work fo them(unfortunately, the documentation here does not indicate the true state of things),
On test stand we 100% reproduce situation when one of nodes are draining and system scheduler restart all allocation for a short time, without waiting required timeouts. Also we found, that rolling update for service
job types also behave incorrectly, when during this process one of node are draining.
To demonstrate what actually goes wrong, on test stand we launch nomad with debug logs. Configurations of test stand is follow 3 nomad servers and 4 nomad clients.
We launch system job, with follow update stanza:
update
{
stagger = "30m"
max_parallel = 1
}
then make some change in it and launch nomad run -detach ./vault_debug.test-05.nomad
to trigger rolling update. After that logs from nomad server looks like this:
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.907400 [DEBUG] worker: dequeued evaluation e75af621-7ffc-5e35-239f-8623515bc939
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.921771 [DEBUG] sched: <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">: allocs: (place 0) (update 4) (migrate 0) (stop 0) (ignore 0) (lost 0)
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.921881 [DEBUG] sched: <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">: 0 in-place updates of 4
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.923550 [DEBUG] worker: created evaluation <Eval "97ef3105-79dc-9249-a230-736e8783079b" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.923574 [DEBUG] sched: <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">: rolling update limit reached, next eval '97ef3105-79dc-9249-a230-736e8783079b' created
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.929813 [DEBUG] worker: submitted plan at index 83 for evaluation e75af621-7ffc-5e35-239f-8623515bc939
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.930056 [DEBUG] sched: <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">: setting status to complete
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.932338 [DEBUG] worker: updated evaluation <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:23:59.932584 [DEBUG] worker: ack for evaluation e75af621-7ffc-5e35-239f-8623515bc939
then on some node we made node drain (nomad node drain -enable -yes c45751e3
), and additional log output appear:
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.476138 [DEBUG] worker: dequeued evaluation be59677d-5400-42f4-0ca0-b698d0002662
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.476654 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: allocs: (place 0) (update 2) (migrate 1) (stop 0) (ignore 1) (lost 0)
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.476988 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: 0 in-place updates of 2
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.478065 [INFO] nomad.drain: node "c45751e3-1c5e-e529-7215-172d74db0063" completed draining at index 94
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.478721 [DEBUG] worker: created evaluation <Eval "22500713-3736-4df0-f619-17c544c75237" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.478966 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: rolling update limit reached, next eval '22500713-3736-4df0-f619-17c544c75237' created
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.490021 [DEBUG] worker: submitted plan at index 96 for evaluation be59677d-5400-42f4-0ca0-b698d0002662
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.490042 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: setting status to complete
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.492300 [DEBUG] worker: updated evaluation <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.492698 [DEBUG] worker: ack for evaluation be59677d-5400-42f4-0ca0-b698d0002662
And as expected one of allocations are updated, but this it too early(in job rolling updates are made in 30m intervals with count of 1)
then we undrain node (nomad node drain -disable -yes c45751e3
), which was previously drained, and again additional log appear
2018/10/19 14:25:26.332766 [DEBUG] worker: dequeued evaluation 73071556-fd12-0ede-f979-dbb376fa795f
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:26.333250 [DEBUG] sched: <Eval "73071556-fd12-0ede-f979-dbb376fa795f" JobID: "vault_debug-05" Namespace: "default">: allocs: (place 1) (update 1) (migrate 0) (stop 0) (ignore 2) (lost 0)
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:26.333427 [DEBUG] sched: <Eval "73071556-fd12-0ede-f979-dbb376fa795f" JobID: "vault_debug-05" Namespace: "default">: 0 in-place updates of 1
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:26.336965 [DEBUG] worker: submitted plan at index 110 for evaluation 73071556-fd12-0ede-f979-dbb376fa795f
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:26.336990 [DEBUG] sched: <Eval "73071556-fd12-0ede-f979-dbb376fa795f" JobID: "vault_debug-05" Namespace: "default">: setting status to complete
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:26.339395 [DEBUG] worker: updated evaluation <Eval "73071556-fd12-0ede-f979-dbb376fa795f" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:26.339639 [DEBUG] worker: ack for evaluation 73071556-fd12-0ede-f979-dbb376fa795f
and again one of allocations updated too early
So as i final we have follow picture:
vagrant@consulnomad-1:~$ nomad status vault_debug-05
ID = vault_debug-05
Name = vault_debug-05
Submit Date = 2018-10-19T14:23:59+03:00
Type = system
Priority = 50
Datacenters = test
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
test 0 0 4 0 4 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
8be8ad29 c45751e3 test 1 run running 14m11s ago 14m8s ago
0c7d54de e722631d test 1 run running 14m11s ago 14m2s ago
033d869d b98fe747 test 1 run running 14m31s ago 14m20s ago
6afae84f c80b0cb2 test 1 run running 15m37s ago 15m29s ago
286e7662 b98fe747 test 0 stop complete 18m11s ago 14m24s ago
2dd59904 e722631d test 0 stop complete 18m11s ago 14m6s ago
36a4618f c45751e3 test 0 stop complete 18m11s ago 14m24s ago
a717ab2f c80b0cb2 test 0 stop complete 18m11s ago 15m32s ago
Distance between allocations Creation
time very low(unexpected), but must be about 30m
But i must said that in our production, similar situation happens without node drain(here i mention about this only as a key to reproduce).
The key reason why this happens is understandably,
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]: 2018/10/19 14:25:06.476654 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: allocs: (place 0) (update 2) (migrate 1) (stop 0) (ignore 1) (lost 0)
nomad server must ignore all remaining allocations and not update them if this not needed, but if any event happens in cluster(for example nodes flapping(which throw node-update) due bad network - this now happens often in aws environment) with nodes on which allocations are placed and that must be rolling updated, this allocations will be relaunched too early
So rolling updates are partially works wrong(behaive unstable if comparable with previous versions of nomad)
@tantra35 sorry for the late reply. This is a known issue with the system scheduler, the stagger
parameter creates a follow up evaluation for the scheduler at a later time, but any extraneous event such as node drain or a lost node that causes the scheduler to look at that job again would still cause an alloc update.
We are tracking improvements to the system scheduler in #4740 where we will add all recent enhancements to deployments for batch and service jobs to the system scheduler. I am closing this issue in favor of #4740
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
0.8.6
We have system job with follow roling update config
Job have one task group with count of 3. When we update it some times nomad doens't wait
30m
between task restarts, and update all task at one time:In output above we can see, that distance between task restarts much less that must be(30m)