Open timja opened 8 years ago
It does. Jenkins has a complex process termination logic (ProcessKiller, ProcessKillingVeto extension points), which require a connection to master in order to be invoked properly. From a user perspective I agree it's a serious UX bug
Not a regression
Jenkins master is on 2.32.1
Master and slaves running Win2012
The symptoms sound very familiar to a problem where we've had a jenkins slave up.... then we reboot the windows server (slave).
When the server returns and the slave is automatically started, it hangs around for about 30secs then terminates connection which kills our job.
We've also witnessed the hosting windows service winsw 1.17 (which auto upgrades to 1.18) bombs out but leaves the java process running.
The java process is still keeping the slave active to the master for an indiscriminate amount of time (anywhere between 20secs to 2hrs) before eventually dying of its own accord, with no fresh jobs sent or interaction with the windows service.
markjmanning Sounds like a different issue to me. Please file it and attach logs from both master and the slave for the moment of failure. Windows event log would be also useful. And CC me in the ticket. Both remoting and WinSW are supposed to be maintained by me now, so seems I am a person, who has to triangulate it
I have just noticed that WinSW is now on 2.0.1
I will upgrade and see if the problem still exists.. if it does, I will raise a separate bug
thx!
hi oleg...
so for the moment I have stayed on 2.32.2 and still see the problem when the slave server is rebooted, the slave starts (via the service auto start) and then it stops the service but retains a java process which we can't kill.
i'm a C#'er so i can read but not debug the java code.... but it just feels like the master can't handle the new process handshake when the slave computer reboots and tries to reestablish the connection, so then one end terminates it (maybe the master because the slave still tries to live on?)
to that effect, I've even played about with the master's polling interval to see if i can get the master to terminate while the server is rebooting, but it feels like i am playing with fire on a global setting such as that... where the proportionate time of servers in disconnected reboot vs online is minuscule.
-Dhudson.remoting.Launcher.pingIntervalSec=55
There is minimal detail in the logs.. nothing on the windows log and only slave connection terminated messages if I am lucky on the slave log
I did try and update my service host to WinSW 2.0.1 but as soon as the service starts, it looks to jenkins and then switches it out for 1.18 instead. (not sure if there is a way of getting jenkins slave to stop doing this, but i guess master/slave is trying to maintain compatibility)
i am waiting in eagerness for the LTS of 2.50+ which has a heap of your changes regarding the windows slaves and service host.
/mm
> I did try and update my service host to WinSW 2.0.1 but as soon as the service starts, it looks to jenkins and then switches it out for 1.18 instead. (not sure if there is a way of getting jenkins slave to stop doing this, but i guess master/slave is trying to maintain compatibility)
It is a "self-upgrade" feature I have added a flag for disabling this autoupdate in Windows Agent Installer 1.9 (https://github.com/jenkinsci/windows-slave-installer-module/blob/master/CHANGELOG.md#19), see JENKINS-43603 . But it has not been integrated into Jenkins weekly yet.
As a workaround, you can make the file read-only for the service account.
> but it just feels like the master can't handle the new process handshake when the slave computer reboots and tries to reestablish the connection, so then one end terminates it (maybe the master because the slave still tries to live on?)
> to that effect, I've even played about with the master's polling interval to see if i can get the master to terminate while the server is rebooting, but it feels like i am playing with fire on a global setting such as that... where the proportionate time of servers in disconnected reboot vs online is minuscule.
One of the potential causes for hanging agent is a non-released Channel object in the master. We have applied several fixes for it, but I am not 100% all potential causes are covered. Just in case, make sure you a running agents with JNLP4 protocol. It seems to be much more reliable in terms of connection handling.
krogan So 2.60.1 should be released on this Thursday. Just FYI. You can try the release candidate from here: http://mirrors.jenkins.io/war-stable-rc/2.60.1/
[Originally related to: JENKINS-26048]
We have a build step that runs a TestNG suite, with the command looking something like this:
If the process is aborted in any way (manual intervention, Jenkins build timeout, etc.) OR if the agent loses connection from the master long enough to fail the build, then there is a Java process left behind.
This is particularly damaging to us, as we load a DLL in the Java process, locking the file handle. If we attempt the job again, we cannot load the DLL again, meaning that all future builds will fail without manual intervention (killing the leftover process manually).
It is possible to reproduce with ANY java process executed on the Windows agent.
This bug seems similar to JENKINS-26048, but I did not understand from the title/description if it was the same problem or similar symptoms. Feel free to close as duplicate if it is.
Originally reported by gsfraley, imported from: Jenkins 2.7.4 seems to leave behind Java processes (on Windows agent) if the build is aborted/agent loses connection