Closed timja closed 5 years ago
I created this freestyle job, but the traps are never invoked when hitting [x] to "stop" the job.
#!/bin/bash echo "Starting $0" echo "Listing traps" trap -p echo "Setting trap" trap 'echo SIGTERM; kill $pid; exit 15;' SIGTERM trap 'echo SIGINT; kill $pid; exit 2;' SIGINT echo "Listing traps again" trap -p echo "Sleeping" sleep 10 & pid=$! echo "Waiting" wait $pid echo "Exit status: $?" echo "Ending"
It looks like Jenkins is using kill -9, but it is not since the rest of the script is executed:
Listing traps Setting trap Listing traps again trap -- 'echo SIGINT; kill $pid; exit 2;' SIGINT trap -- 'echo SIGTERM; kill $pid; exit 15;' SIGTERM Sleeping Waiting Build was aborted Aborted by d'Anjou, Martin Build step 'Groovy Postbuild' marked build as failure Recording test results Exit status: 143 Ending
Is it possible that Jenkins disables the traps?
Making this a major issue because there is no way a free style job can clean up after itself.
I am struggling with this as well! There is documentation which states that Jenkins uses SIGTERM to kill processes, but I too am having a hard time trapping it. One of the problems I have is that even if my script might trap the TERM, Jenkins appears to not wait for termination of the process(es) it has started. It's a bit difficult, then, to know whether the traps work or not when I cannot see the output.
You should be aware that the bash build scripts are usually invoked with -e, which may "break" your error handling. Jenkins will list all of the processes you have started, including the sleep, and send a TERM to all of them. Your sleep then fails (before you can kill it), causing the rest of the script to fail. It looks like you may have worked around that to get the "Ending" text out, but it caught me and may confuse others trying to reproduce the problem
The "list all of the processes" part involves an environment variable called BUILD_ID. See https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller
By using a set +e (and maybe BUILD_ID=ignore – so many experiments lately) I have managed to make my script ignore TERM, which can consistently lead to an orphaned bash. Jenkins is certain the build is aborted, but the script keeps running. I can kill the script (behind Jenkins) with -9, however.
When the shell script starts with the shabang:
#!/bin/bash set -o echo $-
I get:
allexport off braceexpand on emacs off errexit off errtrace off functrace off hashall on histexpand off history off ignoreeof off interactive-comments on keyword off monitor off noclobber off noexec off noglob off nolog off notify off nounset off onecmd off physical off pipefail off posix off privileged off verbose off vi off xtrace off hB
When the shell script does not start with the shabang:
set -o echo $-
I get:
+ set -o allexport off braceexpand on emacs off errexit on errtrace off functrace off hashall on histexpand off history off ignoreeof off interactive-comments on keyword off monitor off noclobber off noexec off noglob off nolog off notify off nounset off onecmd off physical off pipefail off posix on privileged off verbose off vi off xtrace on + echo ehxB ehxB
Conclusion: Jenkins forces -ex when there is no shabang (#!/bin/bash) line, so you can control at least that part.
First point: Changing the value of the BUILD_ID variable to bypass the tree killed is a bad idea: it changes the meaning of BUILD_ID. It would have been better to use a different variable name to express the "don't kill me" idea (hint: if the user sets DONTKILLME=true, then don't kill it).
Second point: Changing BUILD_ID has no effect on the example script shown in the first comment: it seems Jenkins disables the traps. I tried setting BUILD_ID in a job parameter and in the environment injection plugin to no avail.
Here are 2 scenarios explaining why Jenkins must not intercept the signals and must let the freestyle jobs handle their own termination:
1) the freestyle job needs a way to remove temporary files it might have created
2) the freestyle job needs a way to kill remote processes it might have created
I feel scenario 2 needs an explanation: Say the freestyle job spawned a process on a remote host, and disconnected from that remote host. There is no way for the process tree killer to find the connection between the freestyle job bash script, and the remote process, only the freestyle job script can kill the remote job. This is why signals must be propagated and not intercepted.
After experimenting some more, it seems Jenkins cuts the ties to the child process too soon after sending the TERM signal. Some times, when the job runs on the master, I do see the message from the SIGTERM trap, and a lot of times, I don't see it. This makes it hard to tell what really happens. It looks like Jenkins simply needs to wait for the job process to cut the ties to stdout/stderr before it stops listening to the job itself.
On IRC (May 8, 2013), there was a discussion on changing SIGTERM to SIGTERM -> wait 10 sec -> SIGKILL, but I would prefer if this delay was configurable or even optional, as the clean up done by a properly behaving job could take more than 10 seconds (and it does take a few minutes in my case due to a very large amount of small files to clean up on NFS).
Here are loosely related but different requests:
JENKINS-11995
JENKINS-11996
This may explain a problem I've been seeing. When a user cancels a build while a Ruby 'bundle install' operation is happening, the job exits but the bundle process goes into a zombie-ish state (not literally a zombie process but it never exits), no longer a child of the Jenkins process. I have to kill it manually, and sometimes it freaks out and consumes a lot of resources on the box as well. I'm not sure if we need a bigger/different hammer here, or what.
Jenkins leaks processes when jobs are killed. I think this is related to this issue, so instead of creating a new bug report, I am adding this comment.
To reproduce the process leak, create a new freestyle job from a fresh install, and enter this script:
#!/usr/bin/python import signal import time print "Main 1" def handler(*ignored): print "Ignored 1" time.sleep(120) print "Ignored 2" print "Main 2" signal.signal(signal.SIGTERM, handler) print "Main 3" time.sleep(120) print "Main 4"
Then execute the build, and after a few seconds once the build is running, hit the red [x] button to kill the job. After the job is killed and Jenkins is done, go to the terminal and look for the python process. You should find something like this:
$ ps -efH ... mdanjou 2154 2150 0 08:22 pts/0 00:00:00 bash mdanjou 2531 2154 16 08:24 pts/0 00:00:36 java -jar jenkins.war mdanjou 2601 2531 0 08:25 pts/0 00:00:00 /usr/bin/python /tmp/hudson3048464595979281901.sh
The python script is still in memory, and still executing. However, Jenkins has cut the ties to the python script.
Jenkins must not cut the ties until the script is done.
In this comment, the script is a simple example, in real project scripts, the signal handler is used to clean up temporary files, and to terminate gracefully (e.g. killing other spawned processes).
I wrote a script that you can execute periodically from cron to clean up processes orphaned by Jenkins.
There is more to it than cleaning up the orphaned processes, which by the way should be done by Jenkins and not as an external process. The way this should work is that Jenkins should send the signal (SIGTERM or SIGTERM) and wait for the sub-processes to do their own cleanup. This gives the sub-processes a chance to propagate the signal to sub-sub-processes of their own (which by the way when you use a grid engine, might be running yet on other remote machines that are not Jenkins slaves).
I modified the first shell script to write to a file during the traps: Jenkins cuts the ties too early and no files show up anywhere.
#!/bin/bash echo "Starting $0" echo "Listing traps" trap -p echo "Setting trap" trap 'echo SIGTERM | tee trap.sigterm; kill $pid; exit 15;' SIGTERM trap 'echo SIGINT | tee trap.sigint; kill $pid; exit 2;' SIGINT echo "Listing traps again" trap -p echo "Sleeping" sleep 20 & pid=$! echo "Waiting" wait $pid echo "Exit status: $?" echo "Ending"
So the SIGINT -> wait N seconds for the build process to return -> SIGKILL (with a user configurable N) would be an acceptable solution. The value of N should be configurable for each job.
I'm also affected by this issue and would highly appreciate the solution proposed by Martin d'Anjou, in which Jenkins waits (a configurable amount of time) for its children to finish.
Will this be implemented in the near future?
I see the exact same issue as described in comment-182402.
I am utilizing the execute python-script build-step to invoke pretty long lasting python processes (parent) that also spawn multiple sub-processes on-demand which are subject to be managed by the parent. I've implemented proper signal handling in order to clean up child processes and threads whenever the parent gets terminated. Unfortunately it looks like - as described in comment-182402 - that jenkins notifies the parent but does not wait for the parent to cleanup and terminate but instead detaches from the process leaving it in an zombie like state. In my case I keep finding processes sitting in futex calls waiting for a lock on a resource that never gets unlocked.
Clean-up bash scripts are no option as they do not prevent the process from locking, thus some of the external resources that are also locked by my script will never get freed. I see the option to make jenkins wait for the hudson<...>.py process to gracefully terminate and optionally force termination in case the procs cleanup lasts too long.
I'd appreciate any clues on fixing this issue.
Thanks
Is there any progress on this issue?
We are using jenkins to start a java based test framework.
This tool has a couple of java shutdown hook defined that
must be executed on the termination of java process.
Due to this problem jenkins does not wait for the proper
termination of our java process.
Hi all,
I have the same problem, and would appreciate that solution, described by Martin.
Is anybody working on implementation?
(To clarify, this comment is about the issue as reported, not any other process killing issues discussed in comments.)
Jenkins preferably uses the java.lang.UNIXProcess.destroy(...) method in the JRE running Jenkins.
In OpenJDK 7 and up it seems to send SIGTERM, which is consistent with my observations below.
http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/27e0909d3fa0/src/solaris/native/java/lang/UNIXProcess_md.c#l722 (parameter is "false")
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/solaris/native/java/lang/UNIXProcess_md.c#l720 (parameter is "false")
http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/solaris/native/java/lang/UNIXProcess_md.c#l947
The call from Jenkins:
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/util/ProcessTree.java#L580
A build output of a very simple shell script demonstrating that SIGTERM is handled:
Building on master in workspace /var/lib/jenkins/workspace/jobname [jobname] $ /bin/sh -xe /tmp/hudson6478022098890718097.sh + trap 'echo TERM' TERM + sleep 50 Terminated ++ echo TERM TERM Build was aborted Aborted by Daniel Beck Finished: ABORTED
So check your JRE's source code or documentation to see whether/how UNIXProcess is implemented. OpenJDK (in my case OpenJDK 1.7.0.45) seems to behave.
That said, the logging of hudson.util.ProcessTree might be interesting. Log on FINER or higher.
Looks like its SIGTERM in jdk6 as well: http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/b2317f5542ce/src/solaris/native/java/lang/UNIXProcess_md.c#l684
Something is odd. The trap is working when Jenkins is on Ubuntu 12.10 but not on CentOS 6.3
This has gone from bad to worst. I have non-concurrent builds running back to back. When the first one is killed, it somehow keeps running in the background while the other one starts in the same workspace and fails when it should have passed.
Daniel Beck: how do I set the Log to FINER or higher on the process tree, and where do I look up the log? Give me urls please, I sometimes don't understand all the jargon.
This is the Java I am using:
/usr/java/jdk1.7/bin/java -version java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
I run jenkins as /usr/java/jdk1.7/bin/java -jar jenkins.war
This has gone from bad to worst
Unhelpful statement without mentioning the involved Jenkins versions. Which were bad, which are worse?
how do I set the Log to FINER or higher on the process tree, and where do I look up the log?
Go to http://jenkins/log, create a new log recorder (use any name), add a logger named hudson.util.ProcessTree and set level to FINER. Save. Go to the log recorder's page occassionally when the issue occurs to see what it logs.
Sorry I should have been more useful in my comment. By worst I meant that I have found that a killed job can corrupt the current job's workspace. I have found a way to reproduce this corruption 100% of the time.
I use Jenkins 1.578 and Java SE JRE 1.7.0_45-b18) Java HotSpot 64-bit Server VM (build 24.35-b08).
I launch jenkins from linux RHEL 6.4 (Santiago) with java -jar jenkins.war
The job needs to be configured with the following script (it is a variation on the python script above):
#!/usr/bin/python import signal import time import os def handler(*ignored): time.sleep(120) fh = open("a_file.txt","a") fh.write("Handler of Build number: "+os.environ['BUILD_NUMBER']) fh.close() signal.signal(signal.SIGTERM, handler) fh = open("a_file.txt","w") fh.write("Main of Build number: "+os.environ['BUILD_NUMBER']) fh.close() time.sleep(120)
Then configure the job to archive the artifact named a_file.txt
Run two jobs back to back, kill the first one shortly after it started. Leave the second one to complete until it ends normally.
The log as configured in the above comment, shows:
killAll: process=java.lang.UNIXProcess@3d7c07c9 and envs={HUDSON_COOKIE=06668ba4-b481-4a17-86b3-5f4fbd4061b2} Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree Recursively killing pid=1840 Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree Killing pid=1840 Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree Recursively killing pid=1840 Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree Killing pid=1840
The unix process table, after the kill, shows that both jobs are still running:
mdanjou 1251 953 0 10:07 pts/30 00:01:46 java -jar jenkins.war mdanjou 1840 1251 0 15:30 pts/30 00:00:00 /usr/bin/python /tmp/hudson6469713064377741807.sh mdanjou 1851 1251 0 15:30 pts/30 00:00:00 /usr/bin/python /tmp/hudson1969984296722384280.sh
Both jobs are still running.
When the second job completes, examine its artifact. It contains this:
Main of Build number: 18Handler of Build number: 17
So the killed build (#17) corrupts the workspace of the running build (#18).
Makes sense. I don't see how this could be circumvented. Maybe by waiting a bit to see whether SIGTERM worked, and if not, send SIGKILL? But Jenkins uses the JRE's abstraction of "kill a Unix process" and that behavior appears to be implementation dependent.
Should be possible to write a plugin that sends SIGKILL if configured (e.g. for specific jobs only). Would that help?
Maximum flexibility, as a plugin or built-in, in my view and without regards to feasibility, would be:
Regarding the last point, I am not sure whether Jenkins is supposed to perform the post-build steps when a build is killed by the user - but it is certainly something that would help me. Perhaps this is something that could be configured?
I do not know what would belong to a plugin vs. what should be built-in.
Code changed in jenkins
User: Øyvind Harboe
Path:
src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/hudsontrigger/GerritTrigger.java
http://jenkins-ci.org/commit/gerrit-trigger-plugin/0eff041d3388cc8a2dba3367f3f0b131d19c018c
Log:
adds workaround for JENKINS-17116
Code changed in jenkins
User: Robert Sandell
Path:
src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/hudsontrigger/GerritTrigger.java
http://jenkins-ci.org/commit/gerrit-trigger-plugin/a9de6534418bbeddf0ae449bae33b0a28b510ed5
Log:
Merge pull request #224 from zylin/jenkins-17116-workaround
adds workaround for JENKINS-17116
Compare: https://github.com/jenkinsci/gerrit-trigger-plugin/compare/afa1cff24324...a9de6534418b
I am wondering, because universal solution might not be that easy, would it be possible to have a hook {{ gracefulShutdown }} where one can have a custom implementation before the regular {{ kill -9 }} kicks in?
I'm also have problems with this.
In particular, our nodes are running Ubuntu 14.04. We are using Jenkins to run some tests as part of the build. There are a few steps where interruption will cause communication failures, leaked temporary files gigabytes in size, and locks that are not undone. Orphaned process is very bad as well, since this could lead to a new build starting communication on the same channel prior to the previous one terminating.
Like deepchip indicated, we would need a timeout parameter, since the allowed timeout for SIGTERM may be about 300sec, which is probably a lot longer than someone implementing this fix may anticipate.
I have the same issue. When canceling Job, I am trying to signal in side a python script, and cleanup.
It there any workaround for this issue?
kashierez, it is possible to:
1. remove the jenkins cookie environment variable
2. run your program in background (output still can go to stdout)
3. launch another background process to check original process PID, such that when it is gone, it would kill the other child gracefully
4. in main process, wait for the other two to complete (make sure second monitor process would exit if the first background process exits)
5. take care to report proper termination status of the program
Not very nice probably but you can script it to run arbitrary shell commands in this way. Also might not be worth the effort. It didn't for me.
Forgot to mention: this would only work on UNIX derivatives IIRC.
I have done more experiments. And I am still not seeing the signal being sent, like danielbeck is seeing.
I started with this Java version:
The Jenkins console shows:
[freestyle-kill] $ /bin/sh -xe /tmp/hudson3073245061937599649.sh + trap 'echo TERM >terminated.txt' TERM + sleep 120 Build was aborted Aborted by martinda Finished: ABORTED
Observe that the script is not printing "TERM" to the file, like it does in Daniel's environment.
I also tried these Java versions ones:
I captured some logs using OpenJDK 1.8.0_121 on Ubuntu 16.04. In the terminal running Jenkins:
INFO: jenkins-17116/freestyle #4 aborted java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395) at hudson.Proc$LocalProc.join(Proc.java:318) at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:135) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:95) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:64) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779) at hudson.model.Build$BuildExecution.build(Build.java:205) at hudson.model.Build$BuildExecution.doRun(Build.java:162) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534) at hudson.model.Run.execute(Run.java:1720) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:404)
In the jenkins log recorder:
Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree killAll: process=java.lang.UNIXProcess@2a8af379 and envs={HUDSON_COOKIE=2d16a893-7e22-4360-aad0-0931104599a5} Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree Recursively killing pid=25054 Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree Recursively killing pid=25055 Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree Killing pid=25055 Feb 07, 2017 9:42:11 AM FINE hudson.util.ProcessTree Killing pid=25054
None of them traps the signal.
I have 2 patches that add support for killing launched processes with specific signals.
I'll submit them in a PR asap.
This is unfortunately a *NIX-only solution where signals are supported. For windows I don’t know what to do.
Jenkins does send a SIGTERM and when running scripts that is usually /bin/sh eg:
/bin/sh -xe /tmp/hudson013456789.sh
Most probably /bin/sh is bash. When bash receives a SIGTERM while executing a child process, it does not relay it to the child process.
The sh scripts does get terminated, but the child process keeps running behind, reattached to the parent process, and I guess Jenkins can't find it anymore.
A fix when you have a single command is to prefix it with exec. Eg instead of:
somebuildtool
do:
exec somebuildtool
/bin/sh will be replaced by somebuildtool and directly receives the signal. The drawback is that you cant run anymore command after. To do that you need to background each command, wait for it to terminate or get a signal then resend the signal. Something like:
somebuildtool &
apid=$!
trap 'kill -SIGTERM $apid; wait $apid' SIGTERM
wait
Is this issue going to be solved ?
I am unable to trap the signal. The pipeline progress log states Sending interrupt signal to process , but although I am trapping SIGINT and also SIGTERM within my shell script, it's not working. Seems like it's sending a SIGKILL, could that be?
Julian, what is your script doing exactly? AFAIK when a build is aborted jenkins immediately close the stdout/stderr connections, so your traps message would not be shown on the console. If you get your trap to redirect to a file, you would then be able to tell it reacted properly by looking at the file on the slave.
Docker is the reason I came here.
A note there is a bug in Docker (as of 17.06 and since October 2013) which is that when you use docker run --tty, the signals are not proxified to the daemon, hence the signal is never forwarded to the docker daemon and the container is left running. You can read about my finding at https://github.com/moby/moby/issues/9098#issuecomment-347536699
The way I went in Jenkins is to use:
exec docker run someimage # nothing will be run after that due to exec which replaces the shell
This way the shell script started by the Jenkins agent is replaced by the docker command (due to exec). When the agent kill the process, 'docker' receives the SIGTERM and forward it to the daemon (note there is no --tty which would disable that forwarding).
And in the container entry point you might need trap handlers for SIGTERM / SIGINT. A rough example is https://gerrit.wikimedia.org/r/#/c/389937/5/dockerfiles/tox/run.sh
Some random mess at https://phabricator.wikimedia.org/T176747 , but I would not recommend reading it :]
An alternative is to keep the container id around, and when the build ends/get aborted, find a way to 'docker stop' the container.
Something like:
docker run --cidfile container.pid
And in a publisher (not sure it run when a job is aborted though):
docker stop --time=5 <(cat container.pid) || /bin/true
Which would instruct Docker to stop the container.
I think there is one of the Jenkins docker plugin which does exactly. The steps being executed by the Jenkins agent itself so that is probably a bit more robust than defining those steps in a job.
Check out pull request https://github.com/jenkinsci/jenkins/pull/3414
I added code that will make Jenkins wait for process termination (for up to 30secs, should be made configurable).
Behavior changes are as follows:
Note that Jenkins doesn't use SIGKILL! It uses SIGTERM, but doesn't give the process any time to handle it before closing stdin/stdout/stderr.
Hi,
Any update regarding this issue? I am facing the same issue, where I am able to terminate gracefully using the command line, but when I issue the job using the Jenkins and abort it, it wont end gracefully.
It has to be noted, that if I abort using using the command line of a running job, the graceful termination is visible on Jenkins as well. But not when we abort using Jenkins, which is quite weird.
It maybe a good idea to create a plugin that remaps the abort button to run a script specified in each job. It could be optionally configured to run the script, then do the normal abort process. For example:
1. User abort triggered.
2. Job specific abort script runs without interrupting what the job is currently running. A timeout counter starts simultaneously.
3. At completion of the script or timeout, Jenkins checks the job status. If the job is ready to exit normally, it does so - this will allow final status other than abort. If the job is still running, the normal Jenkins abort procedure takes over.
Well, I am still pursuing the graceful-termination via SIGTERM. A stepping stone for this is ready to be merged into a library that is used by Jenkins for process management on Windows. After that has happened, a new version of that library needs to be bundled with Jenkins and then used in my pull request.
Hi Rahul,
The library alone is not enough - Jenkins needs to be recompiled to use it.
WinP is the library I was talking about and needs to get the following change applied: https://github.com/kohsuke/winp/pull/49
After that we can recompile Jenkins with the new WinP library and this change: https://github.com/jenkinsci/jenkins/pull/3414
I am really hoping that we can move faster with those changes. I am usually somewhat patient, but the perceived lack of interest from Jenkins maintainers is hard to understand.
Cool, Rahul. We've been working around it for 4 years. But now we got tired of the performance penalty the workarounds introduce for us, so I said: How hard can it be (to fix the problem)? Turns out, not hard. Just getting it done is. sigh
I believe it is going to land in the next weekly
The merge https://github.com/jenkinsci/jenkins/commit/d8eac92ee9a1c19bf145763589f1c152607bf3ed is in tag jenkins-2.141
With Jenkins 2.141, I ran the bash script, and the python script and there is no change. Jenkins still leaks processes, and still the signals are not trapped by the user script. The is one difference though, the first click on the terminate button (the red [x]) does not kill the job immediately, but that seems to change nothing.
Using the freestyle projects to execute bash shell scripts work fine. But cancelling a jenkins job seems to use SIGKILL. In this way the script cannot perform cleanup operations and free resources.
SIGKILL cannot be handled by shell
SIGINT/SIGTERM are not used by jenkins
Preferred: SIGINT -> wait 5 seconds -> SIGKILL
Originally reported by markusb, imported from: graceful job termination