hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.97k stars 1.96k forks source link

Rogue Processes #6980

Closed vincenthuynh closed 6 months ago

vincenthuynh commented 4 years ago

Nomad version

Nomad v0.9.6 (1f8eddf2211d064b150f141c86e30d9fceabec89)

Operating system and Environment details

Debian 9.11

Issue

We are noticing on some long running clients, there were rogue processes which have been disassociated from the Nomad process. We first discovered this when we had reports of inconsistency in our production application which could only be explained by an old version of an application running somewhere.

The rogue processes were all from Service jobs, using the Java driver.

Here is an example with a process tree from one of our Nomad clients: VM uptime: 59+ days Nomad agent uptime: Since end of Dec 2019

The nomad agent was not restarted/killed - which we thought could explain the run-away processes.

The rogue processes (5 of them) are at the very bottom of the list:

root       726  2.2  0.1 2324544 102664 ?      Ssl   2019 1884:38 /usr/bin/dockerd -H fd:// -H tcp://127.0.0.1:2376
root       834  0.6  0.0 979068 17808 ?        Ssl   2019 586:06  \_ docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/contain
root     30890  0.2  0.0 356032  3852 ?        Sl   Jan11  43:35  |   \_ docker-containerd-shim a35d82c4ad8e3f68f06afc4cbaa293041c2491661c51b70ed0d6e53e4985e11a /var/run/docker/libcontainerd/a35d82c4ad8e3f68f06afc4cbaa293041c2491661c51b70e
consul   30909  0.0  0.0   4332   716 ?        Ss   Jan11   0:00  |   |   \_ /bin/sh -c /entrypoint.sh
consul   30934  0.0  0.0  21772  3172 ?        S    Jan11   0:00  |   |       \_ /bin/bash -x /entrypoint.sh
consul   30981  3.6  6.8 10644212 3671764 ?    Sl   Jan11 642:56  |   |           \_ /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -server -Xms4g -Xmx4g -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshol
root      1195  0.0  0.0 413372  3820 ?        Sl   Jan11   0:41  |   \_ docker-containerd-shim 82877e0bd7ecee15d090a1d8f7e7ab255756c4403c2b51591d1612ce0a191a1d /var/run/docker/libcontainerd/82877e0bd7ecee15d090a1d8f7e7ab255756c4403c2b5159
root      1213  0.9  0.1 858964 67304 ?        Ssl  Jan11 174:22  |   |   \_ filebeat -e
root      1389  0.0  0.0 415292  3896 ?        Sl   Jan14   0:08  |   \_ docker-containerd-shim 194ff7254722d84d6330b5228695b16788504fb10719ee35ae3f8dd428f0181f /var/run/docker/libcontainerd/194ff7254722d84d6330b5228695b16788504fb10719ee35
mnguyen   1407  0.0  0.0  18052  2964 ?        Ss   Jan14   0:00  |       \_ /bin/bash /opt/bin/entry_point.sh
mnguyen   1452  0.0  0.0   4504  1640 ?        S    Jan14   0:00  |           \_ /bin/sh /usr/bin/xvfb-run -n 99 --server-args=-screen 0 1360x1020x24 -ac +extension RANDR java -jar /opt/selenium/selenium-server-standalone.jar -role node -h
mnguyen   1463  0.0  0.0 250068 42996 ?        Sl   Jan14   1:07  |               \_ Xvfb :99 -screen 0 1360x1020x24 -ac +extension RANDR -nolisten tcp -auth /tmp/xvfb-run.IvXjC8/Xauthority
mnguyen   1478  0.1  0.7 19186364 399060 ?     Sl   Jan14  24:29  |               \_ java -jar /opt/selenium/selenium-server-standalone.jar -role node -hub http://selenium-hub.service.consul:4444/grid/register -nodeConfig /opt/selenium/con
mnguyen   3122  0.0  0.0 106920  7020 ?        Sl   Jan17   0:00  |                   \_ /opt/selenium/chromedriver-2.29 --port=23631
root     30871  0.0  0.0 118340  3560 ?        Sl   Jan11   0:00  \_ /usr/bin/docker-proxy -proto tcp -host-ip 10.201.20.39 -host-port 8983 -container-ip 172.17.0.2 -container-port 8983
root     30883  0.0  0.0 190408  3392 ?        Sl   Jan11   0:00  \_ /usr/bin/docker-proxy -proto udp -host-ip 10.201.20.39 -host-port 8983 -container-ip 172.17.0.2 -container-port 8983
root      1369  0.0  0.0 108480  3536 ?        Sl   Jan14   0:00  \_ /usr/bin/docker-proxy -proto tcp -host-ip 10.201.20.39 -host-port 25597 -container-ip 172.17.0.3 -container-port 25597
root      1380  0.0  0.0 190408  3388 ?        Sl   Jan14   0:00  \_ /usr/bin/docker-proxy -proto udp -host-ip 10.201.20.39 -host-port 25597 -container-ip 172.17.0.3 -container-port 25597
root       732  0.0  0.0 118640 15560 ?        Ssl   2019  10:12 /usr/bin/google_osconfig_agent
ntp        787  0.0  0.0  97896  3684 ?        Ssl   2019   5:15 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -c /run/ntp.conf.dhcp -u 106:110
consul    1030  0.8  0.1 182772 61556 ?        Ssl   2019 729:04 /usr/local/bin/consul agent -config-dir=/etc/consul
root     12377  2.7  0.1 3922380 97740 ?       Ssl   2019 1674:46 /usr/local/bin/nomad agent -config=/etc/nomad/config.hcl
root      9011  0.3  0.0 665396 17472 ?        Sl   Jan11  63:06  \_ /usr/local/bin/nomad logmon
root      9025  0.2  0.0 1100776 27644 ?       Ssl  Jan11  42:36  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/959fd52e-5658-24af-13e6-f39229e54b00/statsite/executor.out","LogLevel":"debug","FSIsolation":true}
nobody    9041  0.1  0.0 207192  2508 ?        Ss   Jan11  19:45  |   \_ /local/statsite -f local/statsite.conf
root     10058  0.3  0.0 664884 17276 ?        Sl   Jan11  63:50  \_ /usr/local/bin/nomad logmon
root     10085  0.2  0.0 1362920 27676 ?       Ssl  Jan11  43:28  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/406d6e65-8241-b879-8b9e-0dc18f06970b/cloud-config/executor.out","LogLevel":"debug","FSIsolation":true}
nobody   10102  0.7  1.4 4705208 755628 ?      Ssl  Jan11 124:37  |   \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx512m -server -jar local/cloud-config-server-1.0.0-2a64a95.jar
root     30366  0.4  0.0 1141212 18164 ?       Sl   Jan11  78:23  \_ /usr/local/bin/nomad logmon
root     30938  0.1  0.0 581804 18548 ?        Sl   Jan11  21:03  \_ /usr/local/bin/nomad docker_logger
root      1129  0.3  0.0 729268 17688 ?        Sl   Jan11  55:08  \_ /usr/local/bin/nomad logmon
root      1241  0.0  0.0 507816 18136 ?        Sl   Jan11   0:10  \_ /usr/local/bin/nomad docker_logger
root     12209  0.3  0.0 598452 18128 ?        Sl   Jan11  62:31  \_ /usr/local/bin/nomad logmon
root     12443  0.2  0.0 1305580 28372 ?       Ssl  Jan11  44:02  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/e2eacc16-de4b-416d-95c6-b4af7644672d/cobwebs-operations/executor.out","LogLevel":"debug","FSIsolation":true
nobody   12460  0.3  1.9 10270904 1021288 ?    Ssl  Jan11  57:15  |   \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx4096m -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/cobwebs-operations-gateway-7.0.31
root       569  0.3  0.0 665140 18620 ?        Sl   Jan14  44:25  \_ /usr/local/bin/nomad logmon
root      1435  0.0  0.0 509736 18656 ?        Sl   Jan14   0:04  \_ /usr/local/bin/nomad docker_logger
root     15625  0.4  0.0 655792 17268 ?        Sl   Jan21  12:45  \_ /usr/local/bin/nomad logmon
root     15638  0.2  0.0 1231848 28036 ?       Ssl  Jan21   7:56  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/f64e8b49-df3d-bad2-fb82-a11bfb549f56/gcsproxy/executor.out","LogLevel":"debug","FSIsolation":true}
nobody   15653  0.0  0.0 481244 14596 ?        Ssl  Jan21   0:11  |   \_ /local/gcsproxy -b 127.0.0.1:38080
root      2300  0.4  0.0 663732 18188 ?        Sl   Jan22   7:53  \_ /usr/local/bin/nomad logmon
root     28687  0.2  0.0 1051088 27224 ?       Ssl  16:51   0:01  \_ /usr/local/bin/nomad executor {"LogFile":"/var/lib/nomad/alloc/921c59b3-d46a-6091-699f-6e36e3a7981d/report-web-9-avatica/executor.out","LogLevel":"debug","FSIsolation":tr
nobody   28703 21.0  1.3 9999776 716224 ?      Ssl  16:51   1:54      \_ /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx4g -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true -jar local/report-web-9.2.1-440a00da2-b150.jar
nobody    4928  0.4  4.5 10318440 2417572 ?    Ssl   2019 209:13 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx4096m -server -Dnetworkaddress.cache.ttl=30 -Dnewrelic.config.max_stack_trace_lines=200 -javaagent:local/newrelic/newrelic.jar -jar loc
root     24087  0.0  0.0 776188  4524 ?        Ssl   2019   7:54 /usr/sbin/collectd
root     25192  0.1  0.1 1093904 56120 ?       Ssl   2019  78:59 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C /etc/stackdriver/collectd.conf -P /var/run/stackdriver-agent.pid
nobody    7514  0.9  2.3 10324996 1259880 ?    Ssl   2019 492:31 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx4096m -server -Dnetworkaddress.cache.ttl=30 -Dnewrelic.config.max_stack_trace_lines=200 -javaagent:local/newrelic/newrelic.jar -jar loc
nobody   28793  3.2  1.8 8024688 982448 ?      Ssl  Jan11 580:25 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.76.0.jar
nobody    6505  0.0  1.3 6328908 698560 ?      Ssl  Jan11  15:39 /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Dnetworkaddress.cache.ttl=30 -Xmx1024m -jar local/search-service-gateway-1.75.0.jar
nobody   18169  0.2  1.3 7307912 706768 ?      Ssl  Jan17  19:14 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.76.0.jar
nobody    6299  0.2  1.2 7307912 686040 ?      Ssl  Jan11  18:58 /usr/lib/jvm/jdk1.8.0_161/bin/java -Xmx2048m -server -Dnetworkaddress.cache.ttl=30 -javaagent:local/newrelic/newrelic.jar -jar local/auth-service-gateway-1.75.0.jar

Reproduction steps

N/A

Job file (if appropriate)

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

tgross commented 4 years ago

Hi @vincenthuynh! Thanks for reporting this!

Nomad supervises exec and java with an executor process (you can see this in the lines for /usr/local/bin/nomad executor). It looks like if this executor process dies unexpectedly, its child process -- the task workload -- is re-parented to PID1 and not reaped. You can reproduce this pretty easily by running kill -9 against the executor.

This looks like a bug. If you were to do the same to the container-shim process running for a Docker task, you'd see that containerd remains running and becomes the parent of a new container-shim process when Nomad reconciles the state of the world and restarts the allocation. I'll mark this as a bug and make sure it gets into our schedule.

tgross commented 4 years ago

Oh, also noting here that this is reproducible on any recent version of Nomad 0.10.x as well!

notnoop commented 4 years ago

Wanted to follow up here with some info. We don't expect executor process to die unexpectedly in normal operations. And if executor dies, I'd expect the java process stdout|stderr to close and for the process to receive a SIGPIPE signal on the next write and die (unless they have custom signal handler).

Would love some info about how executor dies? Would you be able to dig into nomad logs to find any indication of how it's killed e.g. it panicked vs got killed by OOM? The cause can help us have some targeted remediation. Can you inspect where stdout is pointing to (e.g. sudo ls -lha /proc/18169/fd/)? Does it not emit any logs to stdout|stderr?

This reminds me of https://github.com/hashicorp/nomad/pull/5434 - I'll resurrect the PR and aim to include it to address this bug. Thanks for raising it.

vincenthuynh commented 4 years ago

We went through and reaped any run-away processes this morning. I'll keep an eye on it again after the weekend and grab some logs if I find any.

vincenthuynh commented 4 years ago

I was able to find a run-away process on one of our other nodes:

$ps
...
nobody   30815  0.1  1.0 6982276 554908 ?      Ssl  Jan11  45:51 /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx1024m -jar local/email-gateway-1.28.0.jar

$ls -lha /proc/30815/fd/
lrwx------ 1 nobody nogroup 64 Jan 11 11:01 0 -> /null
l-wx------ 1 nobody nogroup 64 Jan 11 11:01 1 -> /var/lib/nomad/alloc/6ec14a7a-bc14-7f6c-d9b2-7db650e8f85c/alloc/logs/.email-service.stdout.fifo (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 10 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 11 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 12 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 13 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 14 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 15 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/sunec.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 16 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/sunjce_provider.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 17 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/resources.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 18 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/charsets.jar (deleted)
l-wx------ 1 nobody nogroup 64 Jan 11 11:01 2 -> /var/lib/nomad/alloc/6ec14a7a-bc14-7f6c-d9b2-7db650e8f85c/alloc/logs/.email-service.stderr.fifo (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 3 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/rt.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 4 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/jfxrt.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 5 -> /local/email-gateway-1.28.0.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 6 -> /local/email-gateway-1.28.0.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 7 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/jce.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 8 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/jsse.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 9 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 91 -> /local/email-gateway-1.28.0.jar (deleted)

I will dump the nomad agent logs (in debug level!) and send it over.

vincenthuynh commented 4 years ago

Hi again, We have some more information that might help with debugging the orphaned processes:

Below is a job that uses a static port. It unexpectedly terminates and then is unable to restart on the same client due to a port conflict. The original process is left running on the client despite the parent process being killed.

Alloc id: 5b3d3abb

Recent Events:
Time                       Type            Description
2020-02-27T11:01:46-05:00  Killing         Sent interrupt. Waiting 5s before force killing
2020-02-27T11:01:46-05:00  Not Restarting  Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2020-02-27T11:01:46-05:00  Terminated      Exit Code: 1
2020-02-27T11:01:25-05:00  Started         Task started by client
2020-02-27T11:00:59-05:00  Restarting      Task restarting in 18.599244521s
2020-02-27T11:00:59-05:00  Terminated      Exit Code: 1
2020-02-27T11:00:30-05:00  Started         Task started by client
2020-02-27T11:00:12-05:00  Restarting      Task restarting in 18.212077249s
2020-02-27T11:00:12-05:00  Terminated      Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = transport is closing"
2020-02-23T08:06:08-05:00  Started         Task started by client

Relevant client logs:

Feb 27 11:00:11 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:11.398-0500 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=319.024µs
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.160-0500 [WARN]  client.alloc_runner.task_runner.task_hook.script_checks: tasklet timed out: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app timeout=3s
Feb 27 11:00:12 nmd-z8g1 consul[1070]:     2020/02/27 11:00:12 [INFO] agent: Synced check "_nomad-check-96ccbf0069efd2438c488867684b1e5ffb1af85d"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: panic: send on closed channel: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: : alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: goroutine 11935 [running]:: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.(*LibcontainerExecutor).handleExecWait(0xc00014c620, 0xc000468ba0, 0xc00010c640): alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:549 +0x11c: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: created by github.com/hashicorp/nomad/drivers/shared/executor.(*LibcontainerExecutor).Exec: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:470 +0x2ae: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.635-0500 [ERROR] client.driver_mgr.java: error receiving stream from Stats executor RPC, closing stream: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app error="rpc error: code = Unavailable desc = transport is closing"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.635-0500 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /tmp/plugin732811969: connect: connection refused""
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.635-0500 [DEBUG] client.driver_mgr.java.executor: plugin process exited: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app path=/usr/local/bin/nomad pid=25671 error="exit status 2"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]:     2020-02-27T11:00:12.637-0500 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app reason="Restart within policy" delay=18.212077249s
Feb 27 11:00:12 nmd-z8g1 consul[1070]:     2020/02/27 11:00:12 [INFO] agent: Deregistered service "_nomad-task-5b3d3abb-4175-c539-7a6d-07d6c2713d3e-timespent-app-timespent-actuator-http"
Feb 27 11:00:12 nmd-z8g1 consul[1070]:     2020/02/27 11:00:12 [INFO] agent: Deregistered check "_nomad-check-96ccbf0069efd2438c488867684b1e5ffb1af85d"
vincenthuynh commented 4 years ago

Hi @tgross @notnoop Could someone shed some light on what may be causing this error with the Java driver? We tend to see the rouge processes happen on a client when an allocation has this event message.

In the meantime, we are going to stop using the Java driver until this is resolved as the unregistered services are draining the resources from our clients.

Thanks!

2020-02-27T11:00:12-05:00  Terminated      Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = transport is closing"
danlsgiga commented 4 years ago

Just had the same issue for an exec task running on Nomad v0.12.5

manveru commented 3 years ago

In case anyone is interested... alleviating it using this script until we get a proper fix:

#!/usr/bin/env bash
set -exuo pipefail

while true; do
  for pid in $(ps -eo 'ppid= uid= pid=' | egrep '^\s*1 65534' | awk '{ print $3 }'); do
    echo "killing $pid"
    ps "$pid"
    kill "$pid"
    sleep 30
    kill -9 "$pid" || true
  done

  sleep 60
done
manveru commented 3 years ago

So after a lot of fighting this issue and cleaning up after Nomad, here's some code that helps prevent the issue when you're using bash for your command and have trouble getting all its child processes to die:

cleanup() {
    # kill all processes whose parent is this process
    pkill -P $$
}

for sig in INT QUIT HUP TERM; do
  trap "
    cleanup
    trap - $sig EXIT
    kill -s $sig "'"$$"' "$sig"
done
trap cleanup EXIT

credit: https://aweirdimagination.net/2020/06/28/kill-child-jobs-on-script-exit/

I still think that this should be handled better by Nomad, the process tree under the nomad executor should hopefully be relatively easy to identify and force kill.

martinmcnulty commented 3 years ago

Think I've seen the same issue on Nomad 1.1.2 with an exec task (on CentOS 7.4). I'll try to keep an eye out and get some more information if it happens again.

zemm commented 1 year ago

We've had this happen for couple of years, currently on Nomad 1.4.2.

It happens fairly consistently with few (like 5 out of 40) exec-tasks that are using bash-script or tini running bash-script as command. tini as an entrypoint was a failed attempt to mitigate this problem before understanding it any better, and even though it did not help, it was left in use. At least two of these tasks are always the same.

This only happens when our agent certificates get rotated by Ansible, which replaces the certificate files on the disk (before the old got invalid), and restarts those Nomad agents that had their certificates changed. The restart is done one agent at a time using service=name state=restarted, which I believe does systemctl restart. Nomad agents are running as SystemD services using mostly the SystemD service unit from the Nomad Deployment Guide.

I have rarely had this happen to me when doing manual restarts, but for those few tasks it happens on almost every time Ansible does the restart. But it still is random, and not 100% consistent.

Here's a pstree -p showing that the executor has died from 4199:

           ├─nomad(3762493)─┬─nomad(505783)─┬─{nomad}(505784)
           │                │               ├─{nomad}(505785)
           │                │               ├─{nomad}(505786)
...
           │                ├─nomad(505910)─┬─tini(505936)───.graphql-oidc-a(505943)
           │                │               ├─{nomad}(505911)
           │                │               ├─{nomad}(505912)
...
           │                ├─nomad(505956)─┬─tini(505969)───python(505975)
           │                │               ├─{nomad}(505957)
           │                │               ├─{nomad}(505958)
...
           |-tini(4058)---python(4199)
...

For now I've just cleaned the current tasks, and don't have timestamps on hand for the last time this happened (to check the logs), but if there's something I should check that would help with this, the next time this should happen is in 2-3 weeks.

Running on RHEL 8.7.

zemm commented 1 year ago

The Nomad task for the forementioned process that most of the time seems to get it's executor killed is:

    task "zeoserver" {
      driver = "exec"
      resources {
        cpu = 100
        memory = 512
        memory_max = 768
      }
      volume_mount {
        volume = "var"
        destination = "/var/lib/plone"
        read_only = false
      }
      env {
        PLONE_ZEOSERVER_ADDRESS = "${NOMAD_ADDR_zeo}"
        PLONE_VAR = "/var/lib/plone"
      }
      config {
        command = "/usr/local/bin/tini"
        args = [
          "--",
          "/usr/local/bin/sh",
          "-c",
          <<EOH
set -e

if [ ! -e /bin ]; then
  ln -s /usr/local/bin /bin
fi

exec plonectl-zeoserver
EOH
        ]
      }
      artifact {
        source = "${local.repository}/${var.artifact}/${var.artifact}-${var.version}.tar.gz"
        destination = "/"
      }
    }

The /bin linking is because we are using extremely stripped down chroot on Nomad, since all of our artifacts are created with Nix, and are mostly self-contained. As said earlier, the run-away problem also happened before tini when we used /usr/local/bin/sh as command.

plonectl-zeoserver at the end is in itself a wrapper bash script running the actual exec /nix/store/.../bin/python -m plonectl.cli ... at the end of itself.

Other run-away tasks that have been caught includes Nginx, Redis and some Java apps run in the same fashion.

danlsgiga commented 1 year ago

@zemm give the -g flag in tini a try. This will propagate any signals to all processes in the same process group. This will potentially prevent these rogue processes from being orphaned and become zombies.

Ref: https://github.com/krallin/tini#process-group-killing

danlsgiga commented 1 year ago

I just realized you are using exec, which has a different behaviour compared to running the process in the background with &. Try adding the -g flag to tini and change exec plonectl-zeoserver to

plonectl-zeoserver &

wait

This will ensure that the plonectl-zeoserver process stays in the same process group as your sh so when tini propagates the Nomad signal, it is properly handled!

Anyway, my .2c!

rismoney commented 1 year ago

I see in 1.4.2 that docker orphans its containers. I am not sure if it is related to a restart of the nomad service client (not the node itself), but we definitely see this when containers are static ports, and obviously the port is already in use.

mikenomitch commented 1 year ago

Note for internal HashiCorp people, see NOMAD-146 in for more info from a customer.

For anybody following: we're likely picking this up next week.

lgfa29 commented 1 year ago

Noting what seems to be another report in https://github.com/hashicorp/nomad/issues/19062

angrycub commented 10 months ago

Further research

The syscall.SysProcAttr struct contains a Pdeathsig Signal field that requests the OS to signal any children spawned by the process when the parent dies and provides the Signal to send to them.

NOTE: There is a significant comment embedded in the struct

// Pdeathsig, if non-zero, is a signal that the kernel will send to  
// the child process when the creating thread dies. Note that the signal.
// is sent on thread termination, which may happen before process termination.
// There are more details at https://go.dev/issue/27505.

However, this seems to be the correct way to watchdog the child work using the OS. There is prior art in containerd/go-runc, containerd/containerd (a test), & moby/moby

os/exec

For os/exec.Start, the provided Cmd a SysProcAttr field that takes a pointer to a syscall.SysProcAttr struct.

libcontainer

The libcontainer/configs.Config struct contains a ParentDeathSignal field that performs the same function. It appears that (l *LibcontainerExecutor) newLibcontainerConfig is where we build the Config that is ultimately run.

There are further comments of note in libcontainer in the newParentProcess function:

https://github.com/opencontainers/runc/blob/ab146f23357d867f3ed9a15c4e15507f78eba3b2/libcontainer/container_linux.go#L512-L514

echoed later in

https://github.com/opencontainers/runc/blob/ab146f23357d867f3ed9a15c4e15507f78eba3b2/libcontainer/container_linux.go#L521-L525

Unfortunately I ran out of time to work further on this issue during this period. Hopefully these additional notes might help the next person who picks up this issue.