[JENKINS-17116] graceful job termination

timja commented 11 years ago

Using the freestyle projects to execute bash shell scripts work fine. But cancelling a jenkins job seems to use SIGKILL. In this way the script cannot perform cleanup operations and free resources.

SIGKILL cannot be handled by shell

SIGINT/SIGTERM are not used by jenkins

Preferred: SIGINT -> wait 5 seconds -> SIGKILL

Originally reported by markusb, imported from: graceful job termination

status: Resolved
priority: Major
resolution: Fixed
resolved: 2019-10-07T06:38:48+00:00
imported: 2022/01/10

timja commented 11 years ago

deepchip:

I created this freestyle job, but the traps are never invoked when hitting [x] to "stop" the job.

#!/bin/bash
echo "Starting $0"
echo "Listing traps"
trap -p
echo "Setting trap"
trap 'echo SIGTERM; kill $pid; exit 15;' SIGTERM
trap 'echo SIGINT; kill $pid; exit 2;' SIGINT
echo "Listing traps again"
trap -p
echo "Sleeping"
sleep 10 & pid=$!
echo "Waiting"
wait $pid
echo "Exit status: $?"
echo "Ending"

It looks like Jenkins is using kill -9, but it is not since the rest of the script is executed:

Listing traps
Setting trap
Listing traps again
trap -- 'echo SIGINT; kill $pid; exit 2;' SIGINT
trap -- 'echo SIGTERM; kill $pid; exit 15;' SIGTERM
Sleeping
Waiting
Build was aborted
Aborted by d'Anjou, Martin
Build step 'Groovy Postbuild' marked build as failure
Recording test results
Exit status: 143
Ending

Is it possible that Jenkins disables the traps?

timja commented 11 years ago

deepchip:

Making this a major issue because there is no way a free style job can clean up after itself.

timja commented 11 years ago

torbent:

I am struggling with this as well! There is documentation which states that Jenkins uses SIGTERM to kill processes, but I too am having a hard time trapping it. One of the problems I have is that even if my script might trap the TERM, Jenkins appears to not wait for termination of the process(es) it has started. It's a bit difficult, then, to know whether the traps work or not when I cannot see the output.

You should be aware that the bash build scripts are usually invoked with -e, which may "break" your error handling. Jenkins will list all of the processes you have started, including the sleep, and send a TERM to all of them. Your sleep then fails (before you can kill it), causing the rest of the script to fail. It looks like you may have worked around that to get the "Ending" text out, but it caught me and may confuse others trying to reproduce the problem
The "list all of the processes" part involves an environment variable called BUILD_ID. See https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller

By using a set +e (and maybe BUILD_ID=ignore – so many experiments lately) I have managed to make my script ignore TERM, which can consistently lead to an orphaned bash. Jenkins is certain the build is aborted, but the script keeps running. I can kill the script (behind Jenkins) with -9, however.

timja commented 11 years ago

deepchip:

When the shell script starts with the shabang:

#!/bin/bash
set -o
echo $-

I get:

allexport          off
braceexpand     on
emacs   off
errexit off
errtrace        off
functrace       off
hashall on
histexpand      off
history off
ignoreeof       off
interactive-comments    on
keyword off
monitor off
noclobber       off
noexec  off
noglob  off
nolog   off
notify  off
nounset off
onecmd  off
physical        off
pipefail        off
posix   off
privileged      off
verbose off
vi      off
xtrace  off
hB

When the shell script does not start with the shabang:

set -o
echo $-

I get:

+ set -o
allexport       off
braceexpand     on
emacs   off
errexit on
errtrace        off
functrace       off
hashall on
histexpand      off
history off
ignoreeof       off
interactive-comments    on
keyword off
monitor off
noclobber       off
noexec  off
noglob  off
nolog   off
notify  off
nounset off
onecmd  off
physical        off
pipefail        off
posix   on
privileged      off
verbose off
vi      off
xtrace  on
+ echo ehxB
ehxB

Conclusion: Jenkins forces -ex when there is no shabang (#!/bin/bash) line, so you can control at least that part.

timja commented 11 years ago

deepchip:

First point: Changing the value of the BUILD_ID variable to bypass the tree killed is a bad idea: it changes the meaning of BUILD_ID. It would have been better to use a different variable name to express the "don't kill me" idea (hint: if the user sets DONTKILLME=true, then don't kill it).

Second point: Changing BUILD_ID has no effect on the example script shown in the first comment: it seems Jenkins disables the traps. I tried setting BUILD_ID in a job parameter and in the environment injection plugin to no avail.

Here are 2 scenarios explaining why Jenkins must not intercept the signals and must let the freestyle jobs handle their own termination:
1) the freestyle job needs a way to remove temporary files it might have created
2) the freestyle job needs a way to kill remote processes it might have created

I feel scenario 2 needs an explanation: Say the freestyle job spawned a process on a remote host, and disconnected from that remote host. There is no way for the process tree killer to find the connection between the freestyle job bash script, and the remote process, only the freestyle job script can kill the remote job. This is why signals must be propagated and not intercepted.

timja commented 11 years ago

deepchip:

After experimenting some more, it seems Jenkins cuts the ties to the child process too soon after sending the TERM signal. Some times, when the job runs on the master, I do see the message from the SIGTERM trap, and a lot of times, I don't see it. This makes it hard to tell what really happens. It looks like Jenkins simply needs to wait for the job process to cut the ties to stdout/stderr before it stops listening to the job itself.

On IRC (May 8, 2013), there was a discussion on changing SIGTERM to SIGTERM -> wait 10 sec -> SIGKILL, but I would prefer if this delay was configurable or even optional, as the clean up done by a properly behaving job could take more than 10 seconds (and it does take a few minutes in my case due to a very large amount of small files to clean up on NFS).

Here are loosely related but different requests:
JENKINS-11995
JENKINS-11996

timja commented 11 years ago

owenmehegan:

This may explain a problem I've been seeing. When a user cancels a build while a Ruby 'bundle install' operation is happening, the job exits but the bundle process goes into a zombie-ish state (not literally a zombie process but it never exits), no longer a child of the Jenkins process. I have to kill it manually, and sometimes it freaks out and consumes a lot of resources on the box as well. I'm not sure if we need a bigger/different hammer here, or what.

timja commented 11 years ago

deepchip:

Jenkins leaks processes when jobs are killed. I think this is related to this issue, so instead of creating a new bug report, I am adding this comment.

To reproduce the process leak, create a new freestyle job from a fresh install, and enter this script:

#!/usr/bin/python
import signal
import time
print "Main 1"
def handler(*ignored):
    print "Ignored 1"
    time.sleep(120)
    print "Ignored 2"

print "Main 2"
signal.signal(signal.SIGTERM, handler)
print "Main 3"
time.sleep(120)
print "Main 4"

Then execute the build, and after a few seconds once the build is running, hit the red [x] button to kill the job. After the job is killed and Jenkins is done, go to the terminal and look for the python process. You should find something like this:

$ ps -efH
...
mdanjou   2154  2150  0 08:22 pts/0    00:00:00     bash
mdanjou   2531  2154 16 08:24 pts/0    00:00:36       java -jar jenkins.war
mdanjou   2601  2531  0 08:25 pts/0    00:00:00 /usr/bin/python /tmp/hudson3048464595979281901.sh

The python script is still in memory, and still executing. However, Jenkins has cut the ties to the python script.

Jenkins must not cut the ties until the script is done.

In this comment, the script is a simple example, in real project scripts, the signal handler is used to clean up temporary files, and to terminate gracefully (e.g. killing other spawned processes).

timja commented 10 years ago

matthewlmcclure:

I wrote a script that you can execute periodically from cron to clean up processes orphaned by Jenkins.

timja commented 10 years ago

deepchip:

There is more to it than cleaning up the orphaned processes, which by the way should be done by Jenkins and not as an external process. The way this should work is that Jenkins should send the signal (SIGTERM or SIGTERM) and wait for the sub-processes to do their own cleanup. This gives the sub-processes a chance to propagate the signal to sub-sub-processes of their own (which by the way when you use a grid engine, might be running yet on other remote machines that are not Jenkins slaves).

I modified the first shell script to write to a file during the traps: Jenkins cuts the ties too early and no files show up anywhere.

#!/bin/bash
echo "Starting $0"
echo "Listing traps"
trap -p
echo "Setting trap"
trap 'echo SIGTERM | tee trap.sigterm; kill $pid; exit 15;' SIGTERM
trap 'echo SIGINT  | tee trap.sigint; kill $pid; exit 2;' SIGINT
echo "Listing traps again"
trap -p
echo "Sleeping"
sleep 20 & pid=$!
echo "Waiting"
wait $pid
echo "Exit status: $?"
echo "Ending"

So the SIGINT -> wait N seconds for the build process to return -> SIGKILL (with a user configurable N) would be an acceptable solution. The value of N should be configurable for each job.

timja commented 10 years ago

appid:

I'm also affected by this issue and would highly appreciate the solution proposed by Martin d'Anjou, in which Jenkins waits (a configurable amount of time) for its children to finish.

Will this be implemented in the near future?

timja commented 10 years ago

tintinwebweb:

I see the exact same issue as described in comment-182402.

I am utilizing the execute python-script build-step to invoke pretty long lasting python processes (parent) that also spawn multiple sub-processes on-demand which are subject to be managed by the parent. I've implemented proper signal handling in order to clean up child processes and threads whenever the parent gets terminated. Unfortunately it looks like - as described in comment-182402 - that jenkins notifies the parent but does not wait for the parent to cleanup and terminate but instead detaches from the process leaving it in an zombie like state. In my case I keep finding processes sitting in futex calls waiting for a lock on a resource that never gets unlocked.

Clean-up bash scripts are no option as they do not prevent the process from locking, thus some of the external resources that are also locked by my script will never get freed. I see the option to make jenkins wait for the hudson<...>.py process to gracefully terminate and optionally force termination in case the procs cleanup lasts too long.

I'd appreciate any clues on fixing this issue.
Thanks

timja commented 10 years ago

sandor_balazsi:

Is there any progress on this issue?

We are using jenkins to start a java based test framework.
This tool has a couple of java shutdown hook defined that
must be executed on the termination of java process.

Due to this problem jenkins does not wait for the proper
termination of our java process.

timja commented 10 years ago

yevkov:

Hi all,

I have the same problem, and would appreciate that solution, described by Martin.
Is anybody working on implementation?

timja commented 10 years ago

danielbeck:

(To clarify, this comment is about the issue as reported, not any other process killing issues discussed in comments.)

Jenkins preferably uses the java.lang.UNIXProcess.destroy(...) method in the JRE running Jenkins.

In OpenJDK 7 and up it seems to send SIGTERM, which is consistent with my observations below.

http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/27e0909d3fa0/src/solaris/native/java/lang/UNIXProcess_md.c#l722 (parameter is "false")
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/solaris/native/java/lang/UNIXProcess_md.c#l720 (parameter is "false")
http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/solaris/native/java/lang/UNIXProcess_md.c#l947

The call from Jenkins:
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/util/ProcessTree.java#L580

A build output of a very simple shell script demonstrating that SIGTERM is handled:

Building on master in workspace /var/lib/jenkins/workspace/jobname
[jobname] $ /bin/sh -xe /tmp/hudson6478022098890718097.sh
+ trap 'echo TERM' TERM
+ sleep 50
Terminated
++ echo TERM
TERM
Build was aborted
Aborted by Daniel Beck
Finished: ABORTED

So check your JRE's source code or documentation to see whether/how UNIXProcess is implemented. OpenJDK (in my case OpenJDK 1.7.0.45) seems to behave.

That said, the logging of hudson.util.ProcessTree might be interesting. Log on FINER or higher.

timja commented 10 years ago

pyrolistical:

Looks like its SIGTERM in jdk6 as well: http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/b2317f5542ce/src/solaris/native/java/lang/UNIXProcess_md.c#l684

timja commented 10 years ago

pyrolistical:

Something is odd. The trap is working when Jenkins is on Ubuntu 12.10 but not on CentOS 6.3

timja commented 10 years ago

deepchip:

This has gone from bad to worst. I have non-concurrent builds running back to back. When the first one is killed, it somehow keeps running in the background while the other one starts in the same workspace and fails when it should have passed.

Daniel Beck: how do I set the Log to FINER or higher on the process tree, and where do I look up the log? Give me urls please, I sometimes don't understand all the jargon.

This is the Java I am using:

/usr/java/jdk1.7/bin/java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

I run jenkins as /usr/java/jdk1.7/bin/java -jar jenkins.war

timja commented 10 years ago

danielbeck:

deepchip:

This has gone from bad to worst

Unhelpful statement without mentioning the involved Jenkins versions. Which were bad, which are worse?

how do I set the Log to FINER or higher on the process tree, and where do I look up the log?

Go to http://jenkins/log, create a new log recorder (use any name), add a logger named hudson.util.ProcessTree and set level to FINER. Save. Go to the log recorder's page occassionally when the issue occurs to see what it logs.

timja commented 10 years ago

deepchip:

Sorry I should have been more useful in my comment. By worst I meant that I have found that a killed job can corrupt the current job's workspace. I have found a way to reproduce this corruption 100% of the time.

I use Jenkins 1.578 and Java SE JRE 1.7.0_45-b18) Java HotSpot 64-bit Server VM (build 24.35-b08).

I launch jenkins from linux RHEL 6.4 (Santiago) with java -jar jenkins.war

The job needs to be configured with the following script (it is a variation on the python script above):

#!/usr/bin/python
import signal
import time
import os
def handler(*ignored):
    time.sleep(120)
    fh = open("a_file.txt","a")
    fh.write("Handler of Build number: "+os.environ['BUILD_NUMBER'])
    fh.close()

signal.signal(signal.SIGTERM, handler)
fh = open("a_file.txt","w")
fh.write("Main of Build number: "+os.environ['BUILD_NUMBER'])
fh.close()
time.sleep(120)

Then configure the job to archive the artifact named a_file.txt
Run two jobs back to back, kill the first one shortly after it started. Leave the second one to complete until it ends normally.

The log as configured in the above comment, shows:

killAll: process=java.lang.UNIXProcess@3d7c07c9 and envs={HUDSON_COOKIE=06668ba4-b481-4a17-86b3-5f4fbd4061b2}
Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
Recursively killing pid=1840
Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
Killing pid=1840
Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
Recursively killing pid=1840
Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
Killing pid=1840

The unix process table, after the kill, shows that both jobs are still running:

mdanjou   1251   953  0 10:07 pts/30   00:01:46   java -jar jenkins.war
mdanjou   1840  1251  0 15:30 pts/30   00:00:00     /usr/bin/python /tmp/hudson6469713064377741807.sh
mdanjou   1851  1251  0 15:30 pts/30   00:00:00     /usr/bin/python /tmp/hudson1969984296722384280.sh

Both jobs are still running.

When the second job completes, examine its artifact. It contains this:

Main of Build number: 18Handler of Build number: 17

So the killed build (#17) corrupts the workspace of the running build (#18).

timja commented 10 years ago

danielbeck:

Makes sense. I don't see how this could be circumvented. Maybe by waiting a bit to see whether SIGTERM worked, and if not, send SIGKILL? But Jenkins uses the JRE's abstraction of "kill a Unix process" and that behavior appears to be implementation dependent.

Should be possible to write a plugin that sends SIGKILL if configured (e.g. for specific jobs only). Would that help?

timja commented 10 years ago

deepchip:

Maximum flexibility, as a plugin or built-in, in my view and without regards to feasibility, would be:

wait a configurable amount of time for the SIGTERM killed process to come to its natural completion (i.e. let it run its traps/handlers)
if not dead by the timeout, send SIGKILL and wait for process to be gone (N seconds, configurable)
If not dead, move on to the next job or hang (as determined by the user - sometimes hanging is the right thing: spectacular failures are usually easy to debug but it's a judgement call)
When moving on, perform the post-build steps

Regarding the last point, I am not sure whether Jenkins is supposed to perform the post-build steps when a build is killed by the user - but it is certainly something that would help me. Perhaps this is something that could be configured?

I do not know what would belong to a plugin vs. what should be built-in.

timja commented 9 years ago

scm_issue_link:

Code changed in jenkins
User: Øyvind Harboe
Path:
src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/hudsontrigger/GerritTrigger.java
http://jenkins-ci.org/commit/gerrit-trigger-plugin/0eff041d3388cc8a2dba3367f3f0b131d19c018c
Log:
adds workaround for ~~JENKINS-17116~~

timja commented 9 years ago

scm_issue_link:

Code changed in jenkins
User: Robert Sandell
Path:
src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/hudsontrigger/GerritTrigger.java
http://jenkins-ci.org/commit/gerrit-trigger-plugin/a9de6534418bbeddf0ae449bae33b0a28b510ed5
Log:
Merge pull request #224 from zylin/jenkins-17116-workaround

adds workaround for ~~JENKINS-17116~~

Compare: https://github.com/jenkinsci/gerrit-trigger-plugin/compare/afa1cff24324...a9de6534418b

timja commented 8 years ago

akostadinov:

I am wondering, because universal solution might not be that easy, would it be possible to have a hook {{ gracefulShutdown }} where one can have a custom implementation before the regular {{ kill -9 }} kicks in?

timja commented 8 years ago

mbells:

I'm also have problems with this.
In particular, our nodes are running Ubuntu 14.04. We are using Jenkins to run some tests as part of the build. There are a few steps where interruption will cause communication failures, leaked temporary files gigabytes in size, and locks that are not undone. Orphaned process is very bad as well, since this could lead to a new build starting communication on the same channel prior to the previous one terminating.

Like deepchip indicated, we would need a timeout parameter, since the allowed timeout for SIGTERM may be about 300sec, which is probably a lot longer than someone implementing this fix may anticipate.

timja commented 7 years ago

kashierez:

I have the same issue. When canceling Job, I am trying to signal in side a python script, and cleanup.
It there any workaround for this issue?

timja commented 7 years ago

akostadinov:

kashierez, it is possible to:
1. remove the jenkins cookie environment variable
2. run your program in background (output still can go to stdout)
3. launch another background process to check original process PID, such that when it is gone, it would kill the other child gracefully
4. in main process, wait for the other two to complete (make sure second monitor process would exit if the first background process exits)
5. take care to report proper termination status of the program

Not very nice probably but you can script it to run arbitrary shell commands in this way. Also might not be worth the effort. It didn't for me.

Forgot to mention: this would only work on UNIX derivatives IIRC.

timja commented 7 years ago

kashierez:

Thanks for the quick response . I will try ...

timja commented 7 years ago

deepchip:

I have done more experiments. And I am still not seeing the signal being sent, like danielbeck is seeing.

I started with this Java version:

Java1.8.0_77. OS is Fedora release 14 (Laughlin).

The Jenkins console shows:

[freestyle-kill] $ /bin/sh -xe /tmp/hudson3073245061937599649.sh
+ trap 'echo TERM >terminated.txt' TERM
+ sleep 120
Build was aborted
Aborted by martinda
Finished: ABORTED

Observe that the script is not printing "TERM" to the file, like it does in Daniel's environment.

I also tried these Java versions ones:

Java OpenJDK 1.8.0_111, Red Hat Enterprise Linux Server release 6.6 (Santiago)
Java HotSpot 1.8.0_121, Ubuntu 16.04 LTS (Xenial Xerus)
Java OpenJDK 1.8.0_121, Ubuntu 16.04 LTS (Xenial Xerus)

I captured some logs using OpenJDK 1.8.0_121 on Ubuntu 16.04. In the terminal running Jenkins:

INFO: jenkins-17116/freestyle #4 aborted
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
at hudson.Proc$LocalProc.join(Proc.java:318)
at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:135)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:95)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:64)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
at hudson.model.Build$BuildExecution.build(Build.java:205)
at hudson.model.Build$BuildExecution.doRun(Build.java:162)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
at hudson.model.Run.execute(Run.java:1720)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:404)

In the jenkins log recorder:

Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
killAll: process=java.lang.UNIXProcess@2a8af379 and envs={HUDSON_COOKIE=2d16a893-7e22-4360-aad0-0931104599a5}
Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
Recursively killing pid=25054
Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
Recursively killing pid=25055
Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
Killing pid=25055
Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree
Killing pid=25054

None of them traps the signal.

timja commented 7 years ago

robinjarry:

I have 2 patches that add support for killing launched processes with specific signals.

I'll submit them in a PR asap.

This is unfortunately a *NIX-only solution where signals are supported. For windows I don’t know what to do.

timja commented 7 years ago

mpistell:

Did those patches get merged?

timja commented 7 years ago

hashar:

Jenkins does send a SIGTERM and when running scripts that is usually /bin/sh eg:

/bin/sh -xe /tmp/hudson013456789.sh

Most probably /bin/sh is bash. When bash receives a SIGTERM while executing a child process, it does not relay it to the child process.

The sh scripts does get terminated, but the child process keeps running behind, reattached to the parent process, and I guess Jenkins can't find it anymore.

A fix when you have a single command is to prefix it with exec. Eg instead of:

somebuildtool

do:

exec somebuildtool

/bin/sh will be replaced by somebuildtool and directly receives the signal. The drawback is that you cant run anymore command after. To do that you need to background each command, wait for it to terminate or get a signal then resend the signal. Something like:

somebuildtool &
apid=$!
trap 'kill -SIGTERM $apid; wait $apid' SIGTERM
wait

timja commented 6 years ago

juliccr:

Is this issue going to be solved ?

I am unable to trap the signal. The pipeline progress log states Sending interrupt signal to process , but although I am trapping SIGINT and also SIGTERM within my shell script, it's not working. Seems like it's sending a SIGKILL, could that be?

timja commented 6 years ago

hashar:

Julian, what is your script doing exactly? AFAIK when a build is aborted jenkins immediately close the stdout/stderr connections, so your traps message would not be shown on the console. If you get your trap to redirect to a file, you would then be able to tell it reacted properly by looking at the file on the slave.

timja commented 6 years ago

juliccr:

Ok, thanks hashar I'll try redirecting to a file the stdout. We have a shell script that runs some dockerized integration tests and when we cancel the build, we wanted to gracefully tear down the containers.

Thanks for the info

timja commented 6 years ago

hashar:

Docker is the reason I came here.

A note there is a bug in Docker (as of 17.06 and since October 2013) which is that when you use docker run --tty, the signals are not proxified to the daemon, hence the signal is never forwarded to the docker daemon and the container is left running. You can read about my finding at https://github.com/moby/moby/issues/9098#issuecomment-347536699

The way I went in Jenkins is to use:

exec docker run someimage
# nothing will be run after that due to exec which replaces the shell

This way the shell script started by the Jenkins agent is replaced by the docker command (due to exec). When the agent kill the process, 'docker' receives the SIGTERM and forward it to the daemon (note there is no --tty which would disable that forwarding).

And in the container entry point you might need trap handlers for SIGTERM / SIGINT. A rough example is https://gerrit.wikimedia.org/r/#/c/389937/5/dockerfiles/tox/run.sh

Some random mess at https://phabricator.wikimedia.org/T176747 , but I would not recommend reading it :]

timja commented 6 years ago

hashar:

An alternative is to keep the container id around, and when the build ends/get aborted, find a way to 'docker stop' the container.

Something like:

docker run --cidfile container.pid

And in a publisher (not sure it run when a job is aborted though):

docker stop --time=5 <(cat container.pid) || /bin/true

Which would instruct Docker to stop the container.

I think there is one of the Jenkins docker plugin which does exactly. The steps being executed by the Jenkins agent itself so that is probably a bit more robust than defining those steps in a job.

timja commented 6 years ago

juliccr:

Thanks Antoine! we'll definitely try that.

timja commented 6 years ago

sreiter:

Check out pull request https://github.com/jenkinsci/jenkins/pull/3414

I added code that will make Jenkins wait for process termination (for up to 30secs, should be made configurable).

Behavior changes are as follows:

On Windows, Jenkins sends Ctrl+C for up to 30secs. If the process hasn't exitted by then, it will be terminated like before.
On Linux, we send SIGTERMs for up to 30secs. If the process is still around after that, we continue as before: we close stdin/stdout/stderr which causes the process to terminate. (Note that we could send SIGKILL.)

Note that Jenkins doesn't use SIGKILL! It uses SIGTERM, but doesn't give the process any time to handle it before closing stdin/stdout/stderr.

timja commented 6 years ago

rahulnans:

Hi,

Any update regarding this issue? I am facing the same issue, where I am able to terminate gracefully using the command line, but when I issue the job using the Jenkins and abort it, it wont end gracefully.

It has to be noted, that if I abort using using the command line of a running job, the graceful termination is visible on Jenkins as well. But not when we abort using Jenkins, which is quite weird.

timja commented 6 years ago

msinclair:

It maybe a good idea to create a plugin that remaps the abort button to run a script specified in each job. It could be optionally configured to run the script, then do the normal abort process. For example:

1. User abort triggered.

2. Job specific abort script runs without interrupting what the job is currently running. A timeout counter starts simultaneously.

3. At completion of the script or timeout, Jenkins checks the job status. If the job is ready to exit normally, it does so - this will allow final status other than abort. If the job is still running, the normal Jenkins abort procedure takes over.

timja commented 6 years ago

sreiter:

Well, I am still pursuing the graceful-termination via SIGTERM. A stepping stone for this is ready to be merged into a library that is used by Jenkins for process management on Windows. After that has happened, a new version of that library needs to be bundled with Jenkins and then used in my pull request.

timja commented 6 years ago

rahulnans:

Hey, sreiter can we use the library(if its available) in our existing Jenkins setup? And if yes, how can we do that? Thanks for your help!

timja commented 6 years ago

sreiter:

Hi Rahul,

The library alone is not enough - Jenkins needs to be recompiled to use it.

WinP is the library I was talking about and needs to get the following change applied: https://github.com/kohsuke/winp/pull/49
After that we can recompile Jenkins with the new WinP library and this change: https://github.com/jenkinsci/jenkins/pull/3414

I am really hoping that we can move faster with those changes. I am usually somewhat patient, but the perceived lack of interest from Jenkins maintainers is hard to understand.

timja commented 6 years ago

rahulnans:

Thanks sreiter..Will wait for it, until then we can work around the problem.

timja commented 6 years ago

sreiter:

Cool, Rahul. We've been working around it for 4 years. But now we got tired of the performance penalty the workarounds introduce for us, so I said: How hard can it be (to fix the problem)? Turns out, not hard. Just getting it done is. sigh

timja commented 6 years ago

oleg_nenashev:

I believe it is going to land in the next weekly

timja commented 6 years ago

hashar:

The merge https://github.com/jenkinsci/jenkins/commit/d8eac92ee9a1c19bf145763589f1c152607bf3ed is in tag jenkins-2.141

timja commented 6 years ago

deepchip:

With Jenkins 2.141, I ran the bash script, and the python script and there is no change. Jenkins still leaks processes, and still the signals are not trapped by the user script. The is one difference though, the first click on the terminate button (the red [x]) does not kill the job immediately, but that seems to change nothing.

timja / jenkins-gh-issues-poc-06-18

[JENKINS-17116] graceful job termination #2974