[JENKINS-68626] BUILD_ID=dontKillMe does not preserve processes when job is aborted

Setting the BUILD_ID environment variable in a process to dontKillMe is not preventing it from being killed by ProcessTreeKiller as described in the link.

Given a freestyle project with the following build script:

#!/bin/bash

set -x

exec 2>/tmp/abort-test.debug
exec 1>&2

trap 'echo "Got INTR"; exit 1' SIGINT
trap 'echo "Got TERM"; exit 1' SIGTERM
trap 'echo "Got HUP"; exit 1' SIGHUP

BUILD_ID=dontKillMe /usr/bin/sleep 300 &
wait
sleep 5

does not make the /usr/bin/sleep immune from the ProcessTreeKiller when the job is aborted.

Here's my proof...

Start the job running and then observe which processes are part of the job by finding which processes have the /tmp/abort-test.debug file open:

# fuser /tmp/abort-test.debug 
/tmp/abort-test.debug: 37173 37177
# ps -p37173,37177 fw
  PID TTY      STAT   TIME COMMAND
37173 ?        S      0:00 /bin/bash /tmp/jenkins759647078472167760.sh
37177 ?        S      0:00  \_ /usr/bin/sleep 300

Now find the parent of the script to find the Jenkins executor process:

# ps -ef | grep 37173
lcl_bui+ 37173 12777  0 19:18 ?00:00:00 /bin/bash /tmp/jenkins759647078472167760.sh
lcl_bui+ 37177 37173  0 19:18 ?00:00:00 /usr/bin/sleep 300
# ps -p 12777
  PID TTY  TIME CMD
12777 ?08:57:24 java

Now attach strace to the executor:

strace -o /tmp/java.strace -f -p 12777 -e trace=\!futex,sched_yield

and then kill the job from Jenkins. Wait a few seconds and then observe what's in the file strace wrote and see that it did indeed kill the /usr/bin/sleep process:

...
37152 kill(37177, SIGTERM)      = 0
37152 stat("/proc/37177/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37177/status", 0x7f3af3bc8090) = -1 ENOENT (No such file or directory)
37152 prctl(PR_SET_NAME, "pool-1-thread-4"...) = 0
37152 write(24, "\v+", 2)       = 2
37152 write(24, "\254\355\0\5sr\0\33hudson.remoting.UserRequ"..., 2859) = 2859
12949 <... read resumed>"\7\363", 8192) = 2
12949 read(0, "\254\355\0\5sr\0\30hudson.remoting.Response"..., 8192) = 2035
12949 read(0,  
37152 prctl(PR_SET_NAME, "pool-1-thread-4"...) = 0

Followed by the main job script:

37152 kill(37173, SIGTERM)      = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
12838 mprotect(0x7f3d4a8f7000, 4096, PROT_READ) = 0
12838 mprotect(0x7f3d4a8f7000, 4096, PROT_READ|PROT_WRITE) = 0
12838 mprotect(0x7f3d4a8f8000, 4096, PROT_NONE) = 0
12838 mprotect(0x7f3d4a8f8000, 4096, PROT_READ) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
12949 <... read resumed>"\6\352", 8192) = 2
12949 read(0, "\254\355\0\5sr\0\33hudson.remoting.UserRequ"..., 8192) = 1770
12949 read(0,  
37234 prctl(PR_SET_NAME, "pool-1-thread-4"...) = 0
37234 write(24, "\6+", 2)       = 2
37234 write(24, "\254\355\0\5sr\0\30hudson.remoting.Response"..., 1579) = 1579
37234 prctl(PR_SET_NAME, "pool-1-thread-4"...) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 stat("/proc/37173/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
37152 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=37173, si_uid=1101, si_status=1, si_utime=0, si_stime=0} ---
37152 restart_syscall(<... resuming interrupted stat ...> 
37174 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 37173
37152 <... restart_syscall resumed>)    = -1 ETIMEDOUT (Connection timed out)

Which we can confirm by looking at the xtrace output from the job:

# cat /tmp/abort-test.debug 
+ exec
+ trap 'echo "Got INTR"; exit 1' SIGINT
+ trap 'echo "Got TERM"; exit 1' SIGTERM
+ trap 'echo "Got HUP"; exit 1' SIGHUP
+ wait
+ BUILD_ID=dontKillMe
+ /usr/bin/sleep 300
+ sleep 5
++ echo 'Got TERM'
Got TERM
++ exit 1

The important point here is that the /usr/bin/sleep process was killed by the executor even though it had it's BUILD_ID set to dontKillMe.

Originally reported by brianjmurrell, imported from: BUILD_ID=dontKillMe does not preserve processes when job is aborted

status: Open
priority: Critical
resolution: Unresolved
imported: 2022/01/10

brianjmurrell:

Here is a further example:

Given the script:

#!/bin/bash

set -x

exec 2>/tmp/abort-test.debug
exec 1>&2

trap 'echo "Got INTR"; exit 1' SIGINT
trap 'echo "Got TERM"; exit 1' SIGTERM
trap 'echo "Got HUP"; exit 1' SIGHUP

(trap 'echo "Got TERM in subprocess"' SIGTERM; echo $BUILD_ID; export BUILD_ID=dontKillMe; echo $BUILD_ID; /usr/bin/sleep 300; sleep 5) &
wait
sleep 5

Here is what it prints when you abort the job:

# cat /tmp/abort-test.debug 
+ exec
+ trap 'echo "Got INTR"; exit 1' SIGINT
+ trap 'echo "Got TERM"; exit 1' SIGTERM
+ trap 'echo "Got HUP"; exit 1' SIGHUP
+ wait
+ trap 'echo "Got TERM in subprocess"' SIGTERM
+ echo 21
21
+ export BUILD_ID=dontKillMe
+ BUILD_ID=dontKillMe
+ echo dontKillMe
dontKillMe
+ /usr/bin/sleep 300
Terminated
+ sleep 5
++ echo 'Got TERM in subprocess'
Got TERM in subprocess
+ sleep 5
++ echo 'Got TERM'
Got TERM
++ exit 1

As you can see, even though BUILD_ID was set to dontKillMe in the subshell, it still got the TERM signal.

markewaite:

I think that you're misunderstanding the expectation of the process tree killer. I believe the intent is to allow processes to continue running after the normal end of the Jenkins job. In your case, I believe you are canceling execution of the Jenkins job and the process tree killer attempts to kill all processes because the job was canceled, not completed successfully.

In my test, I created a freestyle job that runs the step

BUILD_ID=dontKillMe sleep 313 &
exit 0

When I run a second job on that same computer:

pgrep -f sleep.*313

It reports that the sleep 313 is still running even though the Jenkins build completed several minutes before.

brianjmurrell:

So your position is that the process tree killer is only avoidable (i.e. by using BUILD_ID=dontKillMe) for the case of normal job termination and that the process tree killer is always, without except applied when a job is aborted?

If so, that's rather nasty. The writer of a job really ought to have full control of how his job is wound down on ABORT if he wants. He should not be forced to have the rug pulled out from under him without any means of avoiding it and gracefully terminating his job if that's the responsibility he wants to take on. If the process tree killer behaves as you suggest, and only provides control to a job writer to wind down his job in the normal exit case then I would say that the process tree killer is falling short of it's job and that it really ought not to behave any differently in the ABORT case than the normal job exit case.

Ultimately, as a job writer of a potentially sophisticated job, I should not have be forced into accepting that my process tree is going to be killed off without any ability to wind down my job gracefully. This is a recipe for resource leaks and nasty left-over state for anything except the most simplest of stateless tasks.

As such, I would maintain my assertion that this is a Bug and a Critical one at that.

markewaite:

I don't object to you declaring this a bug, since I believe that a bug is something that bugs someone. I don't object to your declaring that it is critical for you. Those are both judgments that are yours to make.

The challenge I expect you will have will be to persuade others to accept a pull request that changes the long-standing behavior of process tree killer. I would not accept such a pull request without it being controlled by a switch that retained the previous behavior by default and only enabled the new behavior if a switch were enabled. I expect that there are many of the existing 270 000+ Jenkins installations that depend on the current behavior in some way.

I consider it unwise in general to launch persistent processes from Jenkins jobs without using operating system services that are designed to manage those services. The systemd process on Linux, the Windows Service Manager on Windows, the init system on BSD, and launchd on macOS all seem like better choices to manage persistent processes, rather than relying on a process that persists without the help of system services designed to manage persistent processes.

brianjmurrell:

I consider it unwise in general to launch persistent processes from Jenkins jobs without using operating system services that are designed to manage those services. The systemd process on Linux, the Windows Service Manager on Windows, the init system on BSD, and launchd on macOS all seem like better choices to manage persistent processes, rather than relying on a process that persists without the help of system services designed to manage persistent processes.

I completely agree, entirely! I cannot emphasize that enough.

But trying to leave a persistent process behind is not at all what I am trying to do. I am simply trying to gracefully wind down the state that my job creates (which will leak resources and interfere with the next run of the job if not wound down gracefully) when it's interrupted by an ABORT before it can get to it's designed wind-down-on-exit.

The problem here is that the process tree killer goes off killing all of the child processes of my job before those processes can wind themselves down. I don't mind (indeed, I want) that the main job process gets a SIGTERM when a job is aborted but that's where I want the killing to stop. My main job process will catch the SIGTERM and start it's own wind-down, taking care of any children it started – in a graceful and orderly fashion rather than those processes just being killed out from under the graceful wind-down.

It does not seem at all unreasonable that BUILD_ID=dontKillMe should effect that. Frankly I don't really see why the process tree killer should behave differently on ABORT vs. natural exit of the main job step. If the developer has specified BUILD_ID=dontKillMe he has acknowledged that he doesn't want child processes killed and takes on the responsibility of doing that himself, regardless of the reason the job is ending.

basil:

I read through the long discussion here but I must admit I am still confused about what exactly the problem is. Could you provide steps to reproduce the problem from scratch, expected results, and actual results, along with a justification about why the actual results are wrong and why the expected results are desirable? If there is a cogent argument here I could accept a change in behavior, but we need more justification in order to evaluate this in the context of all Jenkins users and not the particular use case described in this ticket.

timja / jenkins-gh-issues-poc-06-18

[JENKINS-68626] BUILD_ID=dontKillMe does not preserve processes when job is aborted #6028