Closed jozef-slezak closed 3 years ago
Root cause of this problem (see following app logs) - nomad started the task but we do not see it Nomad console:
Errno::EADDRINUSE: Address already in use - bind(2) for "0.0.0.0" port 3200
systemctl status nomad | grep 3200
├─8812 puma 3.12.0 (tcp://0.0.0.0:3200) [workflow worker:3200]
├─8813 /bin/bash -c cd /opt/web/current; exec &> >(tee -i -a /var/log/innrm/web_mq_workflow_worker-127.0.0.1-3200.log); exec /bin/bash ./bin/production_run_worker_mq.sh --type workflow
├─8814 tee -i -a /var/log/innrm/web_mq_workflow_worker-127.0.0.1-3200.log
Nomad console displays duplicit alocation that will never succeed a therefore it is DEAD.
Nomad config:
addresses = {
http = "0.0.0.0"
}
advertise = {
http = "127.0.0.1"
rpc = "127.0.0.1"
serf = "127.0.0.1"
}
bind_addr = "127.0.0.1"
client = {
enabled = true
network_interface = "lo"
options = {
driver.raw_exec.enable = 1
driver.raw_exec.no_cgroups = 1
}
}
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
consul {
server_auto_join = false
}
server = {
bootstrap_expect = 1
enabled = true
encrypt = "..."
server_join {
retry_join = [ "127.0.0.1" ]
retry_max = 0 # infinite
retry_interval = "3s"
}
}
Systemd nomad.service
systemctl show nomad.service | grep Kill
KillMode=control-group
KillSignal=15
Relates to #5919
Hi @jozef-slezak! From what I can see here, the sequence seems like it was:
web-worker
workload restarted but was not healthy because of a port collisionweb-delayed-job-web-cache
and web-workflow-worker
jobs are restarted but not healthy, but there's not enough information to tell me why they're unhealthy other than a broken connection. Do they require a connection to the web-worker
job?A few things we could use from you:
I needed to repeat same test again and found #8899. Checked eligible using web console and reproduce same behaviour - address already in use (wrong reconciliation after restart using Virtual Box power off). I modified also nomad.hcl and nomad.service.
Application log (different job on port 3201
Errno::EADDRINUSE: Address already in use - bind(2) for "0.0.0.0" port 3201
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:272:in `initialize'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:272:in `new'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:272:in `add_tcp_listener'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:105:in `block in parse'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:88:in `each'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:88:in `parse'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/runner.rb:151:in `load_and_bind'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/single.rb:96:in `run'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/launcher.rb:184:in `run'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/cli.rb:78:in `run'
/opt/web/web_core/lib/tasks/workers.rake:6:in `block (2 levels) in <top (required)>'
/opt/web/web_core/lib/tasks/workers.rake:14:in `block (2 levels) in <top (required)>'
/opt/web/web_core/lib/tasks/run.rake:25:in `block (3 levels) in <top (required)>'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/gems/rake-13.0.1/exe/rake:27:in `<top (required)>'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/bin/ruby_executable_hooks:24:in `eval'
/opt/web/web_customization/vendor/bundle/ruby/2.6.0/bin/ruby_executable_hooks:24:in `<main>'
Tasks: TOP => workers:run
(See full trace by running task with --trace)
/usr/lib/systemd/system/nomad.service
GNU nano 2.3.1 File: /usr/lib/systemd/system/nomad.service
[Unit]
Description=HashiCorp Nomad
After=network-online.target
After=consul.service
Wants=consul.service
Requires=network-online.target
[Service]
Type=simple
ExecStart=/usr/sbin/nomad agent -config=/etc/nomad.d
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
LimitNOFILE=65536
LimitNPROC=65536
KillMode=process
KillSignal=SIGINT
[Install]
WantedBy=multi-user.target
/etc/nomad.d
addresses = {
http = "0.0.0.0"
}
advertise = {
http = "127.0.0.1"
rpc = "127.0.0.1"
serf = "127.0.0.1"
}
bind_addr = "127.0.0.1"
client = {
enabled = true
network_interface = "lo"
options = {
driver.raw_exec.enable = 1
driver.raw_exec.no_cgroups = 0
}
}
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
consul {
server_auto_join = false
}
server = {
bootstrap_expect = 1
enabled = true
encrypt = "WvSG0VnCVqZ8j8CnZNYfpA=="
server_join {
retry_join = [ "127.0.0.1" ]
retry_max = 0 # infinite
retry_interval = "3s"
}
}
Client logs:
Server logs:
@jozef-slezak I've wrapped your logs in <detail>
blocks to make it a little easier to navigate the issue. I'm realizing when I did so that you've got a single node that's running both the server and the client, so there's no need to provide separate logs (they're identical).
Having only a single node means that it's not possible for the server to have marked the client as lost during power shutdown, as they would both be shut down.
I see the task is a raw_exec
, and I'm wondering if the workload got orphaned somehow. Can you post a process tree for the process (ex. ps afx
)?
After you've got that but before continuing to debug, I would recommend shutting down Nomad (not the host) and verifying that there are no tasks still running that could be binding to that port (check with netstat
). At this point you already know the host is in a weird state so we should make sure the previous problem isn't influencing any further tests.
ps afx (reproduced again)
nomad alloc status ca0da6da-6856-f79a-6219-662bb1c130fb
ID = ca0da6da-6856-f79a-6219-662bb1c130fb
Eval ID = 66435672
Name = idispatcher.dispatcher[0]
Node ID = 9ee95205
Node Name = demo-inno77
Job ID = idispatcher
Job Version = 1
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
Created = 1h9m ago
Modified = 8m49s ago
Deployment ID = cc9abc15
Deployment Health = healthy
Task "dispatcher" is "dead"
Task Resources
CPU Memory Disk Addresses
87/100 MHz 254 MiB/100 MiB 150 MiB DISPATCHER: 127.0.0.1:21233
Task Events:
Started At = 2020-09-16T13:53:24Z
Finished At = 2020-09-16T14:53:55Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2020-09-16T14:53:55Z Killed Task successfully killed
2020-09-16T14:53:55Z Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Canceled desc = grpc: the client connection is closing"
2020-09-16T14:53:55Z Killing Sent interrupt. Waiting 30s before force killing
2020-09-16T14:53:12Z Started Task started by client
2020-09-16T14:53:11Z Task Setup Building Task Directory
2020-09-16T14:20:38Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline
2020-09-16T14:15:42Z Started Task started by client
2020-09-16T14:15:38Z Task Setup Building Task Directory
2020-09-16T14:15:37Z Received Task received by client
2020-09-16T14:01:42Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline
[demo@demo-inno77 ~]$ ps aux | grep 21233
e 2112 0.0 0.0 115412 656 ? S 14:53 0:00 /bin/bash -c cd /opte/natives; exec &> >(tee -i -a /var/log/e/dispatcher-127.0.0.1-21233.log); exec ./IDispatcher -c /etcconf.d/idispatcher.xml -d 1
e 2117 0.0 0.0 108056 668 ? S 14:53 0:00 tee -i -a /var/log/e/dispatcher-127.0.0.1-21233.log
demo 4163 0.0 0.0 112816 972 pts/0 S+ 15:03 0:00 grep --color=auto 21233
[demo@demo-inno77 ~]$ ps afx 2117
PID TTY STAT TIME COMMAND
2 ? S 0:00 [kthreadd]
4 ? S< 0:00 \_ [kworker/0:0H]
6 ? S 0:00 \_ [ksoftirqd/0]
7 ? S 0:00 \_ [migration/0]
8 ? S 0:00 \_ [rcu_bh]
9 ? S 0:01 \_ [rcu_sched]
10 ? S< 0:00 \_ [lru-add-drain]
11 ? S 0:00 \_ [watchdog/0]
12 ? S 0:00 \_ [watchdog/1]
13 ? S 0:00 \_ [migration/1]
14 ? S 0:00 \_ [ksoftirqd/1]
16 ? S< 0:00 \_ [kworker/1:0H]
17 ? S 0:00 \_ [watchdog/2]
18 ? S 0:00 \_ [migration/2]
19 ? S 0:00 \_ [ksoftirqd/2]
21 ? S< 0:00 \_ [kworker/2:0H]
22 ? S 0:01 \_ [watchdog/3]
23 ? S 0:00 \_ [migration/3]
24 ? S 0:00 \_ [ksoftirqd/3]
26 ? S< 0:00 \_ [kworker/3:0H]
28 ? S 0:00 \_ [kdevtmpfs]
29 ? S< 0:00 \_ [netns]
30 ? S 0:00 \_ [khungtaskd]
31 ? S< 0:00 \_ [writeback]
32 ? S< 0:00 \_ [kintegrityd]
33 ? S< 0:00 \_ [bioset]
34 ? S< 0:00 \_ [bioset]
35 ? S< 0:00 \_ [bioset]
36 ? S< 0:00 \_ [kblockd]
37 ? S< 0:00 \_ [md]
38 ? S< 0:00 \_ [edac-poller]
39 ? S< 0:00 \_ [watchdogd]
40 ? S 0:00 \_ [kworker/3:1]
45 ? S 0:00 \_ [kswapd0]
46 ? SN 0:00 \_ [ksmd]
47 ? SN 0:00 \_ [khugepaged]
48 ? S< 0:00 \_ [crypto]
56 ? S< 0:00 \_ [kthrotld]
57 ? S 0:00 \_ [kworker/u8:1]
58 ? S< 0:00 \_ [kmpath_rdacd]
59 ? S< 0:00 \_ [kaluad]
61 ? S< 0:00 \_ [kpsmoused]
63 ? S< 0:00 \_ [ipv6_addrconf]
76 ? S< 0:00 \_ [deferwq]
113 ? S 0:00 \_ [kauditd]
229 ? S 0:00 \_ [kworker/1:1]
293 ? S< 0:00 \_ [ata_sff]
306 ? S 0:00 \_ [scsi_eh_0]
307 ? S< 0:00 \_ [scsi_tmf_0]
308 ? S 0:00 \_ [scsi_eh_1]
309 ? S< 0:00 \_ [scsi_tmf_1]
311 ? S 0:00 \_ [kworker/u8:3]
312 ? S 0:00 \_ [scsi_eh_2]
313 ? S< 0:00 \_ [scsi_tmf_2]
342 ? S< 0:00 \_ [kworker/0:1H]
346 ? S< 0:00 \_ [kworker/1:1H]
386 ? S< 0:00 \_ [kdmflush]
387 ? S< 0:00 \_ [bioset]
398 ? S< 0:00 \_ [kdmflush]
399 ? S< 0:00 \_ [bioset]
412 ? S< 0:00 \_ [bioset]
413 ? S< 0:00 \_ [xfsalloc]
414 ? S< 0:00 \_ [xfs_mru_cache]
415 ? S< 0:00 \_ [xfs-buf/dm-0]
416 ? S< 0:00 \_ [xfs-data/dm-0]
417 ? S< 0:00 \_ [xfs-conv/dm-0]
418 ? S< 0:00 \_ [xfs-cil/dm-0]
419 ? S< 0:00 \_ [xfs-reclaim/dm-]
420 ? S< 0:00 \_ [xfs-log/dm-0]
421 ? S< 0:00 \_ [xfs-eofblocks/d]
422 ? S 0:00 \_ [xfsaild/dm-0]
453 ? S 0:00 \_ [kworker/1:2]
612 ? S 0:00 \_ [kworker/0:3]
627 ? S< 0:00 \_ [xfs-buf/sda1]
628 ? S< 0:00 \_ [xfs-data/sda1]
629 ? S< 0:00 \_ [xfs-conv/sda1]
630 ? S< 0:00 \_ [xfs-cil/sda1]
631 ? S< 0:00 \_ [xfs-reclaim/sda]
632 ? S< 0:00 \_ [xfs-log/sda1]
633 ? S< 0:00 \_ [xfs-eofblocks/s]
634 ? S 0:00 \_ [xfsaild/sda1]
714 ? S< 0:00 \_ [kworker/2:1H]
1047 ? S 0:00 \_ [kworker/2:2]
1070 ? S< 0:00 \_ [kworker/3:1H]
3626 ? S 0:00 \_ [kworker/3:0]
3627 ? S 0:00 \_ [kworker/0:0]
3912 ? S 0:00 \_ [kworker/2:0]
4143 ? S 0:00 \_ [kworker/3:2]
4144 ? S 0:00 \_ [kworker/0:1]
1 ? Ss 0:01 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
503 ? Ss 0:00 /usr/lib/systemd/systemd-journald
528 ? Ss 0:00 /usr/sbin/lvmetad -f
539 ? Ss 0:00 /usr/lib/systemd/systemd-udevd
657 ? S<sl 0:00 /sbin/auditd
680 ? Ssl 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
683 ? Ss 0:00 /usr/sbin/irqbalance --foreground
685 ? Ss 0:00 /usr/lib/systemd/systemd-logind
692 ? Ssl 0:00 /usr/lib/polkit-1/polkitd --no-debug
693 ? Ss 0:00 /usr/sbin/abrtd -d -s
694 ? Ss 0:00 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption
699 ? Ss 0:00 /usr/sbin/crond -n
703 ? Ss 0:00 /usr/sbin/acpid
711 ? Ss 0:00 login -- demo
3756 tty1 Ss+ 0:00 \_ -bash
3795 tty1 S+ 0:00 \_ /bin/bash /home/demo/.menu.sh
713 ? Ss 0:00 /usr/sbin/ntpd -u ntp:ntp -g
733 ? Ssl 0:00 /usr/bin/python2 -Es /usr/sbin/firewalld --nofork --nopid
735 ? Ssl 0:00 /usr/sbin/NetworkManager --no-daemon
860 ? S 0:00 \_ /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-enp0s3.pid -lf /var/lib/NetworkManager/dhclient-95ec9
869 ? S 0:00 \_ /usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces --pid-file=/var/run/NetworkManager/dnsmasq.pid -
1041 ? Ssl 0:00 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
1044 ? Ss 0:00 /usr/sbin/sshd -D
4061 ? Ss 0:00 \_ sshd: demo [priv]
4065 ? S 0:00 \_ sshd: demo@pts/0
4066 pts/0 Ss 0:00 \_ -bash
4284 pts/0 R+ 0:00 \_ ps afx 2117
1045 ? Ssl 0:00 /usr/sbin/rsyslogd -n
1046 ? Ssl 0:00 /usr/bin/redis-server 127.0.0.1:6379
1048 ? Ssl 0:14 /usr/sbin/consul agent -config-dir=/etcconsul.d -ui-content-path=/consul/ui
1049 ? Ssl 0:02 /usr/sbin/traefik --consul --consul.endpoint=127.0.0.1:8500
1054 ? Ssl 0:24 /usr/sbin/nomad agent -config=/etcnomad.d
1947 ? Sl 0:01 \_ /usr/sbin/nomad logmon
1970 ? Sl 0:01 \_ /usr/sbin/nomad logmon
2000 ? Sl 0:01 \_ /usr/sbin/nomad logmon
2035 ? Ssl 0:12 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/3954c06e-8fbc-c467-e2f0-c15fb0863f5a/jaeger-collector/exec
2063 ? Sl 0:00 | \_ /usr/bin/jaeger-all-in-one --memory.max-traces=10000
2069 ? S 0:00 | \_ /bin/bash -c exec &> >(tee -i -a /var/log/e/jaeger-collector-127.0.0.1-14267.log); exec /usr/bin/jaeger-all-in-one -
2070 ? S 0:00 | \_ tee -i -a /var/log/e/jaeger-collector-127.0.0.1-14267.log
2088 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/ca0da6da-6856-f79a-6219-662bb1c130fb/dispatcher/executor.o
2110 ? Sl 0:20 | \_ ./IDispatcher -c /etcconf.d/idispatcher.xml -d 1
2112 ? S 0:00 | \_ /bin/bash -c cd /opte/natives; exec &> >(tee -i -a /var/log/e/dispatcher-127.0.0.1-21233.log); exe
2117 ? S 0:00 | \_ tee -i -a /var/log/e/dispatcher-127.0.0.1-21233.log
2129 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/efbb3557-0148-9ddc-617b-8eb7eba053d6/web-webserver/exe
2158 ? Sl 0:44 | \_ puma 3.12.0 (tcp:0.0.0.0:3001) [webserver:3001]
2163 ? S 0:00 | \_ /bin/bash -c cd /optweb/current; exec &> >(tee -i -a /var/log/innrm/web_webserver-127.0.0.1-3001.lo
2167 ? S 0:00 | \_ tee -i -a /var/log/innrm/web_webserver-127.0.0.1-3001.log
2182 ? Sl 0:02 \_ /usr/sbin/nomad logmon
2191 ? Sl 0:01 \_ /usr/sbin/nomad logmon
2197 ? Sl 0:01 \_ /usr/sbin/nomad logmon
2218 ? Sl 0:01 \_ /usr/sbin/nomad logmon
2228 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/010bca37-809b-4db6-057a-223bd9e0769e/web-mq-workflow-w
2246 ? Sl 0:32 | \_ puma 3.12.0 (tcp:0.0.0.0:3200) [workflow worker:3200]
2251 ? S 0:00 | \_ /bin/bash -c cd /optweb/current; exec &> >(tee -i -a /var/log/innrm/web_mq_workflow_worker-127.0.0.
2253 ? S 0:00 | \_ tee -i -a /var/log/innrm/web_mq_workflow_worker-127.0.0.1-3200.log
2236 ? Ssl 0:14 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/efb90a4d-ba33-3595-3680-f081bd428b54/web-hitlist-worke
2266 ? S 0:00 | \_ /bin/bash ./bin/production_run_webserver.sh
2293 ? S 0:00 | \_ /bin/bash -c cd /optweb/hitlist_worker; exec &> >(tee -i -a /var/log/innrm/web_hitlist_worker-127.0
2297 ? S 0:00 | | \_ tee -i -a /var/log/innrm/web_hitlist_worker-127.0.0.1-3400.log
2302 ? Sl 0:10 | \_ puma 3.12.0 (tcp:0.0.0.0:3400) [hitlist_worker]
2239 ? Sl 0:01 \_ /usr/sbin/nomad logmon
2260 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/da054000-3d76-7eb1-8b57-5e580849e3de/web-delayed-job/e
2281 ? Sl 0:42 | \_ puma 3.12.0 (tcp:0.0.0.0:3300) [delayed_job:3300]
2284 ? S 0:00 | \_ /bin/bash -c cd /optweb/current; exec &> >(tee -i -a /var/log/innrm/web_delayed_job-127.0.0.1-3300.
2286 ? S 0:00 | \_ tee -i -a /var/log/innrm/web_delayed_job-127.0.0.1-3300.log
2287 ? Ssl 0:12 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/0027b62a-6483-5a64-d746-ceb8e8b13e86/web--worker/e
2315 ? Sl 0:32 | \_ puma 3.12.0 (tcp:0.0.0.0:3202) [ worker:3202]
2321 ? S 0:00 | \_ /bin/bash -c cd /optweb/current; exec &> >(tee -i -a /var/log/innrm/web__worker-127.0.0.1-3202.
2323 ? S 0:00 | \_ tee -i -a /var/log/innrm/web__worker-127.0.0.1-3202.log
2306 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/54c0d11a-06ea-46df-49c9-3487cd77452a/web-delayed-job-w
2334 ? Sl 0:41 | \_ puma 3.12.0 (tcp:0.0.0.0:3203) [delayed_job:3203]
2339 ? S 0:00 | \_ /bin/bash -c cd /optweb/web_customization && exec &> >(tee -i -a /var/log/innrm/WEB_DELAYED_JOB
2342 ? S 0:00 | \_ tee -i -a /var/log/innrm/WEB_DELAYED_JOB_WEB_CACHE-127.0.0.1-3203.log
2352 ? Sl 0:02 \_ /usr/sbin/nomad logmon
2356 ? Sl 0:01 \_ /usr/sbin/nomad logmon
2376 ? Sl 0:01 \_ /usr/sbin/nomad logmon
2391 ? Sl 0:02 \_ /usr/sbin/nomad logmon
2423 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/19368a06-1e30-432b-3b25-4f8a0609b8fe/message-processor/exe
2471 ? Sl 1:04 | \_ java -Xmx1024m -Djava.net.preferIPv4Stack=true -XX:-OmitStackTraceInFastThrow -D.cluster_name=DEMO-CLUSTER -D.rest_h
2482 ? S 0:00 | \_ /bin/bash -c exec &> >(tee -i -a /var/log/e/e-service-127.0.0.1-8585.log); exec /bin/bash /opte
2488 ? S 0:00 | \_ tee -i -a /var/log/e/e-service-127.0.0.1-8585.log
2429 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/60c5e684-7314-5859-6ac1-2e66b4d2b943/fast-service/executor
2477 ? Sl 0:15 | \_ java -Xmx512m -Djava.net.preferIPv4Stack=true -XX:-OmitStackTraceInFastThrow -XX:+ExitOnOutOfMemoryError -Djava.awt.headless
2481 ? S 0:00 | \_ /bin/bash -c exec &> >(tee -i -a /var/log/e/fast-service-127.0.0.1-22907.log); exec /bin/bash /opteafi
2487 ? S 0:00 | \_ tee -i -a /var/log/e/fast-service-127.0.0.1-22907.log
2435 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/cabc6570-da2e-a842-d6de-e84874e5c6a4/matcher-numa0/executo
2480 ? Sl 0:00 | \_ ./INode -c /etcconf.d/inode0.xml -d 1
2493 ? S 0:00 | \_ /bin/bash -c cd /opte/natives; exec &> >(tee -i -a /var/log/e/matcher-127.0.0.1-25780.log); exec ./
2496 ? S 0:00 | \_ tee -i -a /var/log/e/matcher-127.0.0.1-25780.log
2458 ? Ssl 0:11 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/650c389a-686c-497c-f62b-80bd26c6a642/pre-extractor/executo
2489 ? Sl 0:15 | \_ java -Xmx512m -Djava.net.preferIPv4Stack=true -XX:-OmitStackTraceInFastThrow -XX:+ExitOnOutOfMemoryError -Djava.awt.headless
2497 ? S 0:00 | \_ /bin/bash -c exec &> >(tee -i -a /var/log/e/fast-service-pre-extractor-127.0.0.1-20189.log); exec /bin/bash /opt/in
2504 ? S 0:00 | \_ tee -i -a /var/log/e/fast-service-pre-extractor-127.0.0.1-20189.log
3055 ? Sl 0:01 \_ /usr/sbin/nomad logmon
3076 ? Ssl 0:10 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/bf0705b4-7ad5-3b35-8071-31642cfb21ac/activemq/executor.out
3085 ? Sl 0:28 | \_ /usr/bin/java -Xmx2048M -Dorg.apache.activemq.UseDedicatedTaskRunner=true -Djava.util.logging.config.file=logging.properties
3086 ? S 0:00 | \_ /bin/bash -c exec &> >(tee -i -a /var/log/activemq/activemq-127.0.0.1-61616.log); exec /bin/bash /optextern
3089 ? S 0:00 | \_ tee -i -a /var/log/activemq/activemq-127.0.0.1-61616.log
3094 ? Sl 0:01 \_ /usr/sbin/nomad logmon
3113 ? Ssl 0:10 \_ /usr/sbin/nomad executor {"LogFile":"/var/libnomad/alloc/b33dc2b6-4cdd-9c1a-09ca-9124ff41e52b/monitoring/execu
3126 ? Sl 0:07 \_ java -Xms64m -Xmx256m -Djava.net.preferIPv4Stack=true -XX:-OmitStackTraceInFastThrow -XX:+ExitOnOutOfMemoryError -Djava.awt.
3129 ? S 0:00 \_ /bin/bash -c exec &> >(tee -i -a /var/log/innrm/monitoring-127.0.0.1-8787.log); exec /bin/bash /optabi
3133 ? S 0:00 \_ tee -i -a /var/log/innrm/monitoring-127.0.0.1-8787.log
1096 ? S 0:00 /usr/bin/postgres -D /var/lib/pgsql/data -p 5432
1158 ? Ss 0:00 \_ postgres: logger process
1614 ? Ss 0:00 \_ postgres: checkpointer process
1615 ? Ss 0:00 \_ postgres: writer process
1616 ? Ss 0:00 \_ postgres: wal writer process
1617 ? Ss 0:00 \_ postgres: autovacuum launcher process
1618 ? Ss 0:00 \_ postgres: stats collector process
2702 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57638) idle
2703 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57640) idle
2704 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57642) idle
2706 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57644) idle
2707 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57646) idle
2708 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57648) idle
2714 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57650) idle
2717 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57652) idle
2965 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57892) idle
2966 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57894) idle
2967 ? Ss 0:00 \_ postgres: e e 127.0.0.1(57896) idle
3022 ? Ss 0:00 \_ postgres: e e 127.0.0.1(58020) idle
3023 ? Ss 0:00 \_ postgres: e e 127.0.0.1(58022) idle
3024 ? Ss 0:00 \_ postgres: e e 127.0.0.1(58024) idle
3025 ? Ss 0:00 \_ postgres: e e 127.0.0.1(58026) idle
3026 ? Ss 0:00 \_ postgres: e e 127.0.0.1(58028) idle
3037 ? Ss 0:00 \_ postgres: e e 127.0.0.1(58052) idle
3386 ? Ss 0:00 \_ postgres: e e 127.0.0.1(58366) idle
3400 ? Ss 0:00 \_ postgres: e e 127.0.0.1(58368) idle
3937 ? Ss 0:00 \_ postgres: e e 127.0.0.1(60100) idle
4230 ? Ss 0:00 \_ postgres: e e 127.0.0.1(33442) idle
4231 ? Ss 0:00 \_ postgres: e e 127.0.0.1(33444) idle
1370 ? Ss 0:00 /usr/libexec/postfix/master -w
1394 ? S 0:00 \_ pickup -l -t unix -u
1395 ? S 0:00 \_ qmgr -l -t unix -u
The process is definitely orphaned. What about the wrapping shell script - could it cause some trouble? Basically every time I power off I can reproduce the problem but with different job.
I am going to reproduce that again using your advice: a. shutting down nomad (not host) b. netstat c. power off
I stopped nomad using systemclt and then killed all nomad, java and puma processes. Process tree look like this and ports are not opened anymore.
ps afx 2117
PID TTY STAT TIME COMMAND
2 ? S 0:00 [kthreadd]
4 ? S< 0:00 \_ [kworker/0:0H]
6 ? S 0:00 \_ [ksoftirqd/0]
7 ? S 0:00 \_ [migration/0]
8 ? S 0:00 \_ [rcu_bh]
9 ? S 0:03 \_ [rcu_sched]
10 ? S< 0:00 \_ [lru-add-drain]
11 ? S 0:00 \_ [watchdog/0]
12 ? S 0:00 \_ [watchdog/1]
13 ? S 0:00 \_ [migration/1]
14 ? S 0:00 \_ [ksoftirqd/1]
16 ? S< 0:00 \_ [kworker/1:0H]
17 ? S 0:00 \_ [watchdog/2]
18 ? S 0:00 \_ [migration/2]
19 ? S 0:00 \_ [ksoftirqd/2]
21 ? S< 0:00 \_ [kworker/2:0H]
22 ? S 0:01 \_ [watchdog/3]
23 ? S 0:00 \_ [migration/3]
24 ? S 0:00 \_ [ksoftirqd/3]
26 ? S< 0:00 \_ [kworker/3:0H]
28 ? S 0:00 \_ [kdevtmpfs]
29 ? S< 0:00 \_ [netns]
30 ? S 0:00 \_ [khungtaskd]
31 ? S< 0:00 \_ [writeback]
32 ? S< 0:00 \_ [kintegrityd]
33 ? S< 0:00 \_ [bioset]
34 ? S< 0:00 \_ [bioset]
35 ? S< 0:00 \_ [bioset]
36 ? S< 0:00 \_ [kblockd]
37 ? S< 0:00 \_ [md]
38 ? S< 0:00 \_ [edac-poller]
39 ? S< 0:00 \_ [watchdogd]
45 ? S 0:00 \_ [kswapd0]
46 ? SN 0:00 \_ [ksmd]
47 ? SN 0:00 \_ [khugepaged]
48 ? S< 0:00 \_ [crypto]
56 ? S< 0:00 \_ [kthrotld]
57 ? S 0:00 \_ [kworker/u8:1]
58 ? S< 0:00 \_ [kmpath_rdacd]
59 ? S< 0:00 \_ [kaluad]
61 ? S< 0:00 \_ [kpsmoused]
63 ? S< 0:00 \_ [ipv6_addrconf]
76 ? S< 0:00 \_ [deferwq]
113 ? S 0:00 \_ [kauditd]
293 ? S< 0:00 \_ [ata_sff]
306 ? S 0:00 \_ [scsi_eh_0]
307 ? S< 0:00 \_ [scsi_tmf_0]
308 ? S 0:00 \_ [scsi_eh_1]
309 ? S< 0:00 \_ [scsi_tmf_1]
311 ? S 0:00 \_ [kworker/u8:3]
312 ? S 0:00 \_ [scsi_eh_2]
313 ? S< 0:00 \_ [scsi_tmf_2]
342 ? S< 0:00 \_ [kworker/0:1H]
346 ? S< 0:00 \_ [kworker/1:1H]
386 ? S< 0:00 \_ [kdmflush]
387 ? S< 0:00 \_ [bioset]
398 ? S< 0:00 \_ [kdmflush]
399 ? S< 0:00 \_ [bioset]
412 ? S< 0:00 \_ [bioset]
413 ? S< 0:00 \_ [xfsalloc]
414 ? S< 0:00 \_ [xfs_mru_cache]
415 ? S< 0:00 \_ [xfs-buf/dm-0]
416 ? S< 0:00 \_ [xfs-data/dm-0]
417 ? S< 0:00 \_ [xfs-conv/dm-0]
418 ? S< 0:00 \_ [xfs-cil/dm-0]
419 ? S< 0:00 \_ [xfs-reclaim/dm-]
420 ? S< 0:00 \_ [xfs-log/dm-0]
421 ? S< 0:00 \_ [xfs-eofblocks/d]
422 ? S 0:00 \_ [xfsaild/dm-0]
453 ? S 0:00 \_ [kworker/1:2]
612 ? S 0:00 \_ [kworker/0:3]
627 ? S< 0:00 \_ [xfs-buf/sda1]
628 ? S< 0:00 \_ [xfs-data/sda1]
629 ? S< 0:00 \_ [xfs-conv/sda1]
630 ? S< 0:00 \_ [xfs-cil/sda1]
631 ? S< 0:00 \_ [xfs-reclaim/sda]
632 ? S< 0:00 \_ [xfs-log/sda1]
633 ? S< 0:00 \_ [xfs-eofblocks/s]
634 ? S 0:00 \_ [xfsaild/sda1]
714 ? S< 0:00 \_ [kworker/2:1H]
1047 ? S 0:00 \_ [kworker/2:2]
1070 ? S< 0:00 \_ [kworker/3:1H]
3912 ? S 0:00 \_ [kworker/2:0]
4353 ? S 0:00 \_ [kworker/3:1]
4393 ? S 0:00 \_ [kworker/1:0]
4494 ? S 0:00 \_ [kworker/0:1]
4540 ? S 0:00 \_ [kworker/3:0]
4611 ? S 0:00 \_ [kworker/0:0]
4653 ? S 0:00 \_ [kworker/3:2]
4663 ? S 0:00 \_ [kworker/2:1]
1 ? Ss 0:01 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
503 ? Ss 0:01 /usr/lib/systemd/systemd-journald
528 ? Ss 0:00 /usr/sbin/lvmetad -f
539 ? Ss 0:00 /usr/lib/systemd/systemd-udevd
657 ? S<sl 0:00 /sbin/auditd
680 ? Ssl 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
683 ? Ss 0:00 /usr/sbin/irqbalance --foreground
685 ? Ss 0:00 /usr/lib/systemd/systemd-logind
692 ? Ssl 0:00 /usr/lib/polkit-1/polkitd --no-debug
693 ? Ss 0:00 /usr/sbin/abrtd -d -s
694 ? Ss 0:00 /usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected ernel BUG at list_del corruption list_add corruption do_IRQ: stack overflow: ear stack overflow (cur: eneral p
699 ? Ss 0:00 /usr/sbin/crond -n
703 ? Ss 0:00 /usr/sbin/acpid
711 ? Ss 0:00 login -- demo
3756 tty1 Ss+ 0:00 \_ -bash
3795 tty1 S+ 0:00 \_ /bin/bash /home/demo/.menu.sh
713 ? Ss 0:00 /usr/sbin/ntpd -u ntp:ntp -g
733 ? Ssl 0:00 /usr/bin/python2 -Es /usr/sbin/firewalld --nofork --nopid
735 ? Ssl 0:00 /usr/sbin/NetworkManager --no-daemon
860 ? S 0:00 \_ /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-enp0s3.pid -lf /var/lib/NetworkManager/dhclient-95ec9a00-abd1-4be9-a1e7-6f6d34d6cce7-enp0s3.lease -cf /var/lib/NetworkManager/dhcli
869 ? S 0:00 \_ /usr/sbin/dnsmasq --no-resolv --keep-in-foreground --no-hosts --bind-interfaces --pid-file=/var/run/NetworkManager/dnsmasq.pid --listen-address=127.0.0.1 --cache-size=400 --clear-on-reload --conf-file=/dev/
1041 ? Ssl 0:00 /usr/bin/python2 -Es /usr/sbin/tuned -l -P
1044 ? Ss 0:00 /usr/sbin/sshd -D
4061 ? Ss 0:00 \_ sshd: demo [priv]
4065 ? S 0:00 \_ sshd: demo@pts/0
4066 pts/0 Ss 0:00 \_ -bash
4676 pts/0 R+ 0:00 \_ ps afx 2117
1045 ? Ssl 0:00 /usr/sbin/rsyslogd -n
1046 ? Ssl 0:01 /usr/bin/redis-server 127.0.0.1:6379
1048 ? Ssl 0:20 /usr/sbin/consul agent -config-dir=/etc/consul.d -ui-content-path=/consul/ui
1049 ? Ssl 0:03 /usr/sbin/traefik --consul --consul.endpoint=127.0.0.1:8500
1096 ? S 0:00 /usr/bin/postgres -D /var/lib/pgsql/data -p 5432
1158 ? Ss 0:00 \_ postgres: logger process
1614 ? Ss 0:00 \_ postgres: checkpointer process
1615 ? Ss 0:00 \_ postgres: writer process
1616 ? Ss 0:00 \_ postgres: wal writer process
1617 ? Ss 0:00 \_ postgres: autovacuum launcher process
1618 ? Ss 0:00 \_ postgres: stats collector process
1370 ? Ss 0:00 /usr/libexec/postfix/master -w
1394 ? S 0:00 \_ pickup -l -t unix -u
1395 ? S 0:00 \_ qmgr -l -t unix -u
2063 ? Sl 0:01 /usr/bin/jaeger-all-in-one --memory.max-traces=10000
2069 ? S 0:00 \_ /bin/bash -c exec &> >(tee -i -a /var/log/e/jaeger-collector-127.0.0.1-14267.log); exec /usr/bin/jaeger-all-in-one --memory.max-traces=10000
2070 ? S 0:00 \_ tee -i -a /var/log/e/jaeger-collector-127.0.0.1-14267.log
and going to power off => same behaviour - orphan on different job
Application log
Puma starting in single mode...
* Version 3.12.0 (ruby 2.6.3-p62), codename: Llamas in Pajamas
* Min threads: 0, max threads: 16
* Environment: production
Warning: External plugins not supported in local idkit library
web_vars.yml: config/web_vars.yml
database.yml: config/database.yml
* Listening on tcp://0.0.0.0:3400
/opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:272:in `initialize': Address already in use - bind(2) for "0.0.0.0" port 3400 (Errno::EADDRINUSE)
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:272:in `new'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:272:in `add_tcp_listener'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:105:in `block in parse'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:88:in `each'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/binder.rb:88:in `parse'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/runner.rb:151:in `load_and_bind'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/single.rb:96:in `run'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/launcher.rb:184:in `run'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/lib/puma/cli.rb:78:in `run'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/gems/puma-3.12.0/bin/puma:10:in `<top (required)>'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/bin/puma:23:in `load'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/bin/puma:23:in `<main>'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/bin/ruby_executable_hooks:24:in `eval'
from /opt/web/hitlist_worker/vendor/bundle/ruby/2.6.0/bin/ruby_executable_hooks:24:in `<main>'
Raw starts processes using exec like this. Not sure if it is ok for Nomad.
"Tasks": [
{
"Name": "dispatcher",
"Driver": "raw_exec",
"User": "e",
"Config": {
"command": "/bin/bash",
"args": [
"-c",
"cd /opt/e/natives; exec &> >(tee -i -a /var/log/e/dispatcher-${NOMAD_IP_DISPATCHER}-${NOMAD_PORT_DISPATCHER}.log); exec ./IDispatcher -c /etc//conf.d/idispatcher.xml -d 1"
]
},
exec bundle exec rake production_run:delayed_job[${WEB_PORT}]
Today I was able to reproduce this behavior with jaeger-controller HCL (raw-exec). I am intentionally not using any shell script (like above).
{
"Stop": false,
"Region": "global",
"Namespace": "default",
"ID": "jaeger-collector",
"ParentID": "",
"Name": "jaeger-collector",
"Type": "service",
"Priority": 50,
"AllAtOnce": false,
"Datacenters": [
"dc1"
],
"Constraints": null,
"Affinities": null,
"Spreads": null,
"TaskGroups": [
{
"Name": "jaeger-collector",
"Count": 1,
"Update": {
"Stagger": 30000000000,
"MaxParallel": 1,
"HealthCheck": "checks",
"MinHealthyTime": 10000000000,
"HealthyDeadline": 300000000000,
"ProgressDeadline": 600000000000,
"AutoRevert": false,
"AutoPromote": false,
"Canary": 0
},
"Migrate": {
"MaxParallel": 1,
"HealthCheck": "checks",
"MinHealthyTime": 10000000000,
"HealthyDeadline": 300000000000
},
"Constraints": null,
"Scaling": null,
"RestartPolicy": {
"Attempts": 2,
"Interval": 1800000000000,
"Delay": 15000000000,
"Mode": "fail"
},
"Tasks": [
{
"Name": "jaeger-collector",
"Driver": "raw_exec",
"User": "jaeger",
"Config": {
"command": "/usr/bin/jaeger-all-in-one",
"args": [
"--memory.max-traces=10000"
]
},
"Env": {
"SPAN_STORAGE_TYPE": "memory"
},
"Services": [
{
"Name": "jaeger-collector",
"TaskName": "",
"PortLabel": "JAEGER_COLLECTOR",
"AddressMode": "auto",
"EnableTagOverride": false,
"Tags": null,
"CanaryTags": null,
"Checks": [
{
"Name": "Jaeger Collector Check",
"Type": "tcp",
"Command": "",
"Args": null,
"Path": "",
"Protocol": "",
"PortLabel": "",
"Expose": false,
"AddressMode": "",
"Interval": 5000000000,
"Timeout": 3000000000,
"InitialStatus": "",
"TLSSkipVerify": false,
"Method": "",
"Header": null,
"CheckRestart": null,
"GRPCService": "",
"GRPCUseTLS": false,
"TaskName": "",
"SuccessBeforePassing": 0,
"FailuresBeforeCritical": 0
}
],
"Connect": null,
"Meta": null,
"CanaryMeta": null
}
],
"Vault": null,
"Templates": null,
"Constraints": null,
"Affinities": null,
"Resources": {
"CPU": 100,
"MemoryMB": 100,
"DiskMB": 0,
"IOPS": 0,
"Networks": [
{
"Mode": "",
"Device": "",
"CIDR": "",
"IP": "",
"MBits": 10,
"DNS": null,
"ReservedPorts": [
{
"Label": "JAEGER_COLLECTOR",
"Value": 14267,
"To": 0,
"HostNetwork": "default"
}
],
"DynamicPorts": null
}
],
"Devices": null
},
"RestartPolicy": {
"Attempts": 2,
"Interval": 1800000000000,
"Delay": 15000000000,
"Mode": "fail"
},
"DispatchPayload": null,
"Lifecycle": null,
"Meta": null,
"KillTimeout": 30000000000,
"LogConfig": {
"MaxFiles": 10,
"MaxFileSizeMB": 10
},
"Artifacts": null,
"Leader": false,
"ShutdownDelay": 0,
"VolumeMounts": null,
"KillSignal": "",
"Kind": "",
"CSIPluginConfig": null
}
],
"EphemeralDisk": {
"Sticky": false,
"SizeMB": 150,
"Migrate": false
},
"Meta": null,
"ReschedulePolicy": {
"Attempts": 0,
"Interval": 0,
"Delay": 30000000000,
"DelayFunction": "exponential",
"MaxDelay": 3600000000000,
"Unlimited": true
},
"Affinities": null,
"Spreads": null,
"Networks": null,
"Services": null,
"Volumes": null,
"ShutdownDelay": null,
"StopAfterClientDisconnect": null
}
],
"Update": {
"Stagger": 30000000000,
"MaxParallel": 1,
"HealthCheck": "",
"MinHealthyTime": 0,
"HealthyDeadline": 0,
"ProgressDeadline": 0,
"AutoRevert": false,
"AutoPromote": false,
"Canary": 0
},
"Multiregion": null,
"Periodic": null,
"ParameterizedJob": null,
"Dispatched": false,
"Payload": null,
"Meta": null,
"ConsulToken": "",
"VaultToken": "",
"VaultNamespace": "",
"NomadTokenID": "",
"Status": "dead",
"StatusDescription": "",
"Stable": true,
"Version": 1,
"SubmitTime": 1600334945802876700,
"CreateIndex": 73,
"ModifyIndex": 3225,
"JobModifyIndex": 2767
}
Logs:
After few minutes I needed to force evaluation:
curl -XPOST http://localhost:4646/v1/node/d130775b-36ae-57e4-000c-7c5761ed589e/evaluate
node not found[demo@demo-inno-77 ~]$ curl -XPOST http://localhost:4646/v1/node/9ee95205-0945-8494-15b6-74400d24799e/evaluate
{"HeartbeatTTL":0,"EvalIDs":["d927e570-94ef-ffeb-68f0-593ee70df3d1","69c7b976-febf-0ea5-1d80-55743579b1df","80673e08-ef75-d300-157d-6fb67dd7fa2c","c2678871-ebd6-d9d6-2612-ee742b92ec42","d97aa742-c2c1-b41e-b83d-81b3eaa78488","657a3073-ea0e-2bfa-317c-6a6555e3b599","759ab869-8cdc-7625-823e-8cd044df5bef","8f021acc-80ee-b362-c841-9a0027823ae8","481771d2-3f5e-46da-dbd6-4ad87ed04e8c","9c864f9c-5a81-44f3-6af4-231bb84df7a3","378d915b-1b64-2947-125f-4ce419a3f79a","f6d8209a-83c1-98bd-7b82-abc53f23f337","b7f52c37-87ba-cf70-5ba4-0ae847d92ed3","4e3a909c-fe90-2d4f-97be-d9a242c8c4ac","573365a8-d601-91cf-80cc-5221accf49c6"],"EvalCreateIndex":3316,"NodeModifyIndex":0,"Features":0,"LeaderRPCAddr":"127.0.0.1:4647","NumNodes":1,"Servers":[{"RPCAdvertiseAddr":"127.0.0.1:4647","RPCMajorVersion":1,"RPCMinorVersion":1,"Datacenter":"dc1"}],"Index":3316,"LastContact":0,"KnownLeader":false}
I am going to reproduce it using exec driver (instead raw_exec).
Could high CPU load at startup cause problem to Nomad?
@jozef-slezak I've wrapped your logs in a <details>
block again for legibility. You've been trying to change a bunch of different things with the task drivers, but I want to narrow down the behavior rather than changing it.
The behavior you should be seeing on a poweroff is as follows:
dead
.The behavior it seems you're seeing is all the same except that when Nomad reschedules the task, the task can't run because somehow the previous task still somehow has a resource locked up after the poweroff. This shouldn't be possible unless the process has on-disk artifacts like a PID file.
So what we need to do here is:
nomad alloc status
of the running job so that we can see the existing allocation before the poweroff.ps
)nomad alloc logs
for the rescheduled but failed allocation so we can see what resources it's trying to grab that it can't.Here's an example of what we'd expect to see in this scenario:
$ nomad job run ./example.nomad
==> Monitoring evaluation "b2a9ee29"
Evaluation triggered by job "example"
Evaluation within deployment: "e26b15de"
Allocation "5d7bd025" created: node "97386544", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "b2a9ee29" finished with status "complete"
$ nomad alloc status 5d7
...
Recent Events:
Time Type Description
2020-09-17T09:54:31-04:00 Started Task started by client
2020-09-17T09:54:31-04:00 Task Setup Building Task Directory
2020-09-17T09:54:31-04:00 Received Task received by client
Then the host is powered off (not gracefully shut down), and restarted:
$ nomad alloc status 5d7
...
Recent Events:
Time Type Description
2020-09-17T09:55:46-04:00 Started Task started by client
2020-09-17T09:55:29-04:00 Restarting Task restarting in 16.680143874s
2020-09-17T09:55:29-04:00 Terminated Exit Code: 255, Exit Message: "Docker container exited with non-zero exit code: 255"
2020-09-17T09:54:31-04:00 Started Task started by client
2020-09-17T09:54:31-04:00 Task Setup Building Task Directory
2020-09-17T09:54:31-04:00 Received Task received by client
Note from the logs the line around "Reattachment process not found": this is where the task is being marked as lost so it can be rescheduled, and what we'd expect to see in your case. You'll want to provide the logs around the restoration process so we can debug this further.
It looks like we never got a chance to follow-up here. I'm going to close this issue for now but please feel free to re-open if you can reproduce with the requested diagnostics.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)
Operating system and Environment details
Centos 7
Issue
DEAD job/task after power of restart
Reproduction steps
Repeat power off a Virutal Box VM with Nomad and running Jobs
Alloc Status
Client logs
server logs