Closed vincenthuynh closed 6 months ago
Hi @vincenthuynh! Thanks for reporting this!
Nomad supervises exec
and java
with an executor
process (you can see this in the lines for /usr/local/bin/nomad executor
). It looks like if this executor process dies unexpectedly, its child process -- the task workload -- is re-parented to PID1 and not reaped. You can reproduce this pretty easily by running kill -9
against the executor.
This looks like a bug. If you were to do the same to the container-shim
process running for a Docker task, you'd see that containerd
remains running and becomes the parent of a new container-shim
process when Nomad reconciles the state of the world and restarts the allocation. I'll mark this as a bug and make sure it gets into our schedule.
Oh, also noting here that this is reproducible on any recent version of Nomad 0.10.x as well!
Wanted to follow up here with some info. We don't expect executor
process to die unexpectedly in normal operations. And if executor dies, I'd expect the java process stdout|stderr
to close and for the process to receive a SIGPIPE signal on the next write and die (unless they have custom signal handler).
Would love some info about how executor dies? Would you be able to dig into nomad logs to find any indication of how it's killed e.g. it panicked vs got killed by OOM? The cause can help us have some targeted remediation. Can you inspect where stdout
is pointing to (e.g. sudo ls -lha /proc/18169/fd/
)? Does it not emit any logs to stdout|stderr
?
This reminds me of https://github.com/hashicorp/nomad/pull/5434 - I'll resurrect the PR and aim to include it to address this bug. Thanks for raising it.
We went through and reaped any run-away processes this morning. I'll keep an eye on it again after the weekend and grab some logs if I find any.
I was able to find a run-away process on one of our other nodes:
$ps
...
nobody 30815 0.1 1.0 6982276 554908 ? Ssl Jan11 45:51 /usr/lib/jvm/jdk1.8.0_161/bin/java -server -Xmx1024m -jar local/email-gateway-1.28.0.jar
$ls -lha /proc/30815/fd/
lrwx------ 1 nobody nogroup 64 Jan 11 11:01 0 -> /null
l-wx------ 1 nobody nogroup 64 Jan 11 11:01 1 -> /var/lib/nomad/alloc/6ec14a7a-bc14-7f6c-d9b2-7db650e8f85c/alloc/logs/.email-service.stdout.fifo (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 10 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 11 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 12 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 13 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 14 -> /urandom
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 15 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/sunec.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 16 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/sunjce_provider.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 17 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/resources.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 18 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/charsets.jar (deleted)
l-wx------ 1 nobody nogroup 64 Jan 11 11:01 2 -> /var/lib/nomad/alloc/6ec14a7a-bc14-7f6c-d9b2-7db650e8f85c/alloc/logs/.email-service.stderr.fifo (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 3 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/rt.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 4 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/ext/jfxrt.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 5 -> /local/email-gateway-1.28.0.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 6 -> /local/email-gateway-1.28.0.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 11 11:01 7 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/jce.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 8 -> /usr/lib/jvm/jdk1.8.0_161/jre/lib/jsse.jar (deleted)
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 9 -> /random
lr-x------ 1 nobody nogroup 64 Jan 27 15:42 91 -> /local/email-gateway-1.28.0.jar (deleted)
I will dump the nomad agent logs (in debug level!) and send it over.
Hi again, We have some more information that might help with debugging the orphaned processes:
Below is a job that uses a static port. It unexpectedly terminates and then is unable to restart on the same client due to a port conflict. The original process is left running on the client despite the parent process being killed.
Alloc id: 5b3d3abb
Recent Events:
Time Type Description
2020-02-27T11:01:46-05:00 Killing Sent interrupt. Waiting 5s before force killing
2020-02-27T11:01:46-05:00 Not Restarting Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2020-02-27T11:01:46-05:00 Terminated Exit Code: 1
2020-02-27T11:01:25-05:00 Started Task started by client
2020-02-27T11:00:59-05:00 Restarting Task restarting in 18.599244521s
2020-02-27T11:00:59-05:00 Terminated Exit Code: 1
2020-02-27T11:00:30-05:00 Started Task started by client
2020-02-27T11:00:12-05:00 Restarting Task restarting in 18.212077249s
2020-02-27T11:00:12-05:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = transport is closing"
2020-02-23T08:06:08-05:00 Started Task started by client
Relevant client logs:
Feb 27 11:00:11 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:11.398-0500 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=319.024µs
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.160-0500 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: tasklet timed out: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app timeout=3s
Feb 27 11:00:12 nmd-z8g1 consul[1070]: 2020/02/27 11:00:12 [INFO] agent: Synced check "_nomad-check-96ccbf0069efd2438c488867684b1e5ffb1af85d"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: panic: send on closed channel: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: : alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: goroutine 11935 [running]:: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.(*LibcontainerExecutor).handleExecWait(0xc00014c620, 0xc000468ba0, 0xc00010c640): alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:549 +0x11c: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: created by github.com/hashicorp/nomad/drivers/shared/executor.(*LibcontainerExecutor).Exec: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.631-0500 [DEBUG] client.driver_mgr.java.executor.nomad: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:470 +0x2ae: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.635-0500 [ERROR] client.driver_mgr.java: error receiving stream from Stats executor RPC, closing stream: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app error="rpc error: code = Unavailable desc = transport is closing"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.635-0500 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /tmp/plugin732811969: connect: connection refused""
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.635-0500 [DEBUG] client.driver_mgr.java.executor: plugin process exited: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e driver=java task_name=timespent-app path=/usr/local/bin/nomad pid=25671 error="exit status 2"
Feb 27 11:00:12 nmd-z8g1 nomad[29934]: 2020-02-27T11:00:12.637-0500 [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=5b3d3abb-4175-c539-7a6d-07d6c2713d3e task=timespent-app reason="Restart within policy" delay=18.212077249s
Feb 27 11:00:12 nmd-z8g1 consul[1070]: 2020/02/27 11:00:12 [INFO] agent: Deregistered service "_nomad-task-5b3d3abb-4175-c539-7a6d-07d6c2713d3e-timespent-app-timespent-actuator-http"
Feb 27 11:00:12 nmd-z8g1 consul[1070]: 2020/02/27 11:00:12 [INFO] agent: Deregistered check "_nomad-check-96ccbf0069efd2438c488867684b1e5ffb1af85d"
Hi @tgross @notnoop Could someone shed some light on what may be causing this error with the Java driver? We tend to see the rouge processes happen on a client when an allocation has this event message.
In the meantime, we are going to stop using the Java driver until this is resolved as the unregistered services are draining the resources from our clients.
Thanks!
2020-02-27T11:00:12-05:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = transport is closing"
Just had the same issue for an exec
task running on Nomad v0.12.5
In case anyone is interested... alleviating it using this script until we get a proper fix:
#!/usr/bin/env bash
set -exuo pipefail
while true; do
for pid in $(ps -eo 'ppid= uid= pid=' | egrep '^\s*1 65534' | awk '{ print $3 }'); do
echo "killing $pid"
ps "$pid"
kill "$pid"
sleep 30
kill -9 "$pid" || true
done
sleep 60
done
So after a lot of fighting this issue and cleaning up after Nomad, here's some code that helps prevent the issue when you're using bash for your command and have trouble getting all its child processes to die:
cleanup() {
# kill all processes whose parent is this process
pkill -P $$
}
for sig in INT QUIT HUP TERM; do
trap "
cleanup
trap - $sig EXIT
kill -s $sig "'"$$"' "$sig"
done
trap cleanup EXIT
credit: https://aweirdimagination.net/2020/06/28/kill-child-jobs-on-script-exit/
I still think that this should be handled better by Nomad, the process tree under the nomad executor should hopefully be relatively easy to identify and force kill.
Think I've seen the same issue on Nomad 1.1.2 with an exec task (on CentOS 7.4). I'll try to keep an eye out and get some more information if it happens again.
We've had this happen for couple of years, currently on Nomad 1.4.2.
It happens fairly consistently with few (like 5 out of 40) exec-tasks that are using bash
-script or tini
running bash
-script as command. tini
as an entrypoint was a failed attempt to mitigate this problem before understanding it any better, and even though it did not help, it was left in use. At least two of these tasks are always the same.
This only happens when our agent certificates get rotated by Ansible, which replaces the certificate files on the disk (before the old got invalid), and restarts those Nomad agents that had their certificates changed. The restart is done one agent at a time using service=name state=restarted
, which I believe does systemctl restart
. Nomad agents are running as SystemD services using mostly the SystemD service unit from the Nomad Deployment Guide.
I have rarely had this happen to me when doing manual restarts, but for those few tasks it happens on almost every time Ansible does the restart. But it still is random, and not 100% consistent.
Here's a pstree -p
showing that the executor has died from 4199
:
├─nomad(3762493)─┬─nomad(505783)─┬─{nomad}(505784)
│ │ ├─{nomad}(505785)
│ │ ├─{nomad}(505786)
...
│ ├─nomad(505910)─┬─tini(505936)───.graphql-oidc-a(505943)
│ │ ├─{nomad}(505911)
│ │ ├─{nomad}(505912)
...
│ ├─nomad(505956)─┬─tini(505969)───python(505975)
│ │ ├─{nomad}(505957)
│ │ ├─{nomad}(505958)
...
|-tini(4058)---python(4199)
...
For now I've just cleaned the current tasks, and don't have timestamps on hand for the last time this happened (to check the logs), but if there's something I should check that would help with this, the next time this should happen is in 2-3 weeks.
Running on RHEL 8.7.
The Nomad task for the forementioned process that most of the time seems to get it's executor killed is:
task "zeoserver" {
driver = "exec"
resources {
cpu = 100
memory = 512
memory_max = 768
}
volume_mount {
volume = "var"
destination = "/var/lib/plone"
read_only = false
}
env {
PLONE_ZEOSERVER_ADDRESS = "${NOMAD_ADDR_zeo}"
PLONE_VAR = "/var/lib/plone"
}
config {
command = "/usr/local/bin/tini"
args = [
"--",
"/usr/local/bin/sh",
"-c",
<<EOH
set -e
if [ ! -e /bin ]; then
ln -s /usr/local/bin /bin
fi
exec plonectl-zeoserver
EOH
]
}
artifact {
source = "${local.repository}/${var.artifact}/${var.artifact}-${var.version}.tar.gz"
destination = "/"
}
}
The /bin
linking is because we are using extremely stripped down chroot on Nomad, since all of our artifacts are created with Nix, and are mostly self-contained. As said earlier, the run-away problem also happened before tini
when we used /usr/local/bin/sh
as command
.
plonectl-zeoserver
at the end is in itself a wrapper bash script running the actual exec /nix/store/.../bin/python -m plonectl.cli ...
at the end of itself.
Other run-away tasks that have been caught includes Nginx, Redis and some Java apps run in the same fashion.
@zemm give the -g
flag in tini a try. This will propagate any signals to all processes in the same process group. This will potentially prevent these rogue processes from being orphaned and become zombies.
I just realized you are using exec
, which has a different behaviour compared to running the process in the background with &
. Try adding the -g
flag to tini and change exec plonectl-zeoserver
to
plonectl-zeoserver &
wait
This will ensure that the plonectl-zeoserver process stays in the same process group as your sh
so when tini
propagates the Nomad signal, it is properly handled!
Anyway, my .2c!
I see in 1.4.2 that docker orphans its containers. I am not sure if it is related to a restart of the nomad service client (not the node itself), but we definitely see this when containers are static ports, and obviously the port is already in use.
Note for internal HashiCorp people, see NOMAD-146 in for more info from a customer.
For anybody following: we're likely picking this up next week.
Noting what seems to be another report in https://github.com/hashicorp/nomad/issues/19062
I was able to verify that when the executor process is killed the workload process continues running and is reparented to systemd.
This issue does not occur on -dev
nodes because they have special logic to completely kill workloads started by the node before exiting.
There appears to be configuration options that can be passed to both golang's os.Exec and libcontainer's Run calls.
The syscall.SysProcAttr
struct contains a Pdeathsig Signal
field that requests the OS to signal any children spawned by the process when the parent dies and provides the Signal to send to them.
NOTE: There is a significant comment embedded in the struct
// Pdeathsig, if non-zero, is a signal that the kernel will send to
// the child process when the creating thread dies. Note that the signal.
// is sent on thread termination, which may happen before process termination.
// There are more details at https://go.dev/issue/27505.
However, this seems to be the correct way to watchdog the child work using the OS. There is prior art in containerd/go-runc, containerd/containerd (a test), & moby/moby
For os/exec.Start
, the provided Cmd
a SysProcAttr
field that takes a pointer to a syscall.SysProcAttr
struct.
The libcontainer/configs.Config
struct contains a ParentDeathSignal
field that performs the same function. It appears that (l *LibcontainerExecutor) newLibcontainerConfig
is where we build the Config that is ultimately run.
There are further comments of note in libcontainer in the newParentProcess
function:
echoed later in
Unfortunately I ran out of time to work further on this issue during this period. Hopefully these additional notes might help the next person who picks up this issue.
Nomad version
Nomad v0.9.6 (1f8eddf2211d064b150f141c86e30d9fceabec89)
Operating system and Environment details
Debian 9.11
Issue
We are noticing on some long running clients, there were rogue processes which have been disassociated from the Nomad process. We first discovered this when we had reports of inconsistency in our production application which could only be explained by an old version of an application running somewhere.
The rogue processes were all from Service jobs, using the Java driver.
Here is an example with a process tree from one of our Nomad clients: VM uptime:
59+ days
Nomad agent uptime:Since end of Dec 2019
The nomad agent was not restarted/killed - which we thought could explain the run-away processes.
The rogue processes (5 of them) are at the very bottom of the list:
Reproduction steps
N/A
Job file (if appropriate)
Nomad Client logs (if appropriate)
Nomad Server logs (if appropriate)