[JENKINS-38807] Jenkins 2.7.4 seems to leave behind Java processes (on Windows agent) if the build is aborted/agent loses connection

timja commented 8 years ago

We have a build step that runs a TestNG suite, with the command looking something like this:

java -jar -Done-jar.main.class=org.testng.TestNG the-jar.jar TheTest.xml

If the process is aborted in any way (manual intervention, Jenkins build timeout, etc.) OR if the agent loses connection from the master long enough to fail the build, then there is a Java process left behind.

This is particularly damaging to us, as we load a DLL in the Java process, locking the file handle. If we attempt the job again, we cannot load the DLL again, meaning that all future builds will fail without manual intervention (killing the leftover process manually).

It is possible to reproduce with ANY java process executed on the Windows agent.

This bug seems similar to JENKINS-26048, but I did not understand from the title/description if it was the same problem or similar symptoms. Feel free to close as duplicate if it is.

Originally reported by gsfraley, imported from: Jenkins 2.7.4 seems to leave behind Java processes (on Windows agent) if the build is aborted/agent loses connection

status: Open
priority: Critical
resolution: Unresolved
imported: 2022/01/10

timja commented 8 years ago

oleg_nenashev:

It does. Jenkins has a complex process termination logic (ProcessKiller, ProcessKillingVeto extension points), which require a connection to master in order to be invoked properly. From a user perspective I agree it's a serious UX bug

timja commented 8 years ago

oleg_nenashev:

Not a regression

timja commented 7 years ago

krogan:

Jenkins master is on 2.32.1
Master and slaves running Win2012

The symptoms sound very familiar to a problem where we've had a jenkins slave up.... then we reboot the windows server (slave).
When the server returns and the slave is automatically started, it hangs around for about 30secs then terminates connection which kills our job.

We've also witnessed the hosting windows service winsw 1.17 (which auto upgrades to 1.18) bombs out but leaves the java process running.
The java process is still keeping the slave active to the master for an indiscriminate amount of time (anywhere between 20secs to 2hrs) before eventually dying of its own accord, with no fresh jobs sent or interaction with the windows service.

timja commented 7 years ago

oleg_nenashev:

markjmanning Sounds like a different issue to me. Please file it and attach logs from both master and the slave for the moment of failure. Windows event log would be also useful. And CC me in the ticket. Both remoting and WinSW are supposed to be maintained by me now, so seems I am a person, who has to triangulate it

timja commented 7 years ago

krogan:

I have just noticed that WinSW is now on 2.0.1
I will upgrade and see if the problem still exists.. if it does, I will raise a separate bug
thx!

timja commented 7 years ago

oleg_nenashev:

krogan any updates?

timja commented 7 years ago

krogan:

hi oleg...

so for the moment I have stayed on 2.32.2 and still see the problem when the slave server is rebooted, the slave starts (via the service auto start) and then it stops the service but retains a java process which we can't kill.

i'm a C#'er so i can read but not debug the java code.... but it just feels like the master can't handle the new process handshake when the slave computer reboots and tries to reestablish the connection, so then one end terminates it (maybe the master because the slave still tries to live on?)

to that effect, I've even played about with the master's polling interval to see if i can get the master to terminate while the server is rebooting, but it feels like i am playing with fire on a global setting such as that... where the proportionate time of servers in disconnected reboot vs online is minuscule.

-Dhudson.remoting.Launcher.pingIntervalSec=55

There is minimal detail in the logs.. nothing on the windows log and only slave connection terminated messages if I am lucky on the slave log

I did try and update my service host to WinSW 2.0.1 but as soon as the service starts, it looks to jenkins and then switches it out for 1.18 instead. (not sure if there is a way of getting jenkins slave to stop doing this, but i guess master/slave is trying to maintain compatibility)

i am waiting in eagerness for the LTS of 2.50+ which has a heap of your changes regarding the windows slaves and service host.

/mm

timja commented 7 years ago

oleg_nenashev:

> I did try and update my service host to WinSW 2.0.1 but as soon as the service starts, it looks to jenkins and then switches it out for 1.18 instead. (not sure if there is a way of getting jenkins slave to stop doing this, but i guess master/slave is trying to maintain compatibility)

It is a "self-upgrade" feature I have added a flag for disabling this autoupdate in Windows Agent Installer 1.9 (https://github.com/jenkinsci/windows-slave-installer-module/blob/master/CHANGELOG.md#19), see ~~JENKINS-43603~~ . But it has not been integrated into Jenkins weekly yet.

As a workaround, you can make the file read-only for the service account.

> but it just feels like the master can't handle the new process handshake when the slave computer reboots and tries to reestablish the connection, so then one end terminates it (maybe the master because the slave still tries to live on?)

> to that effect, I've even played about with the master's polling interval to see if i can get the master to terminate while the server is rebooting, but it feels like i am playing with fire on a global setting such as that... where the proportionate time of servers in disconnected reboot vs online is minuscule.

One of the potential causes for hanging agent is a non-released Channel object in the master. We have applied several fixes for it, but I am not 100% all potential causes are covered. Just in case, make sure you a running agents with JNLP4 protocol. It seems to be much more reliable in terms of connection handling.

timja commented 7 years ago

oleg_nenashev:

krogan So 2.60.1 should be released on this Thursday. Just FYI. You can try the release candidate from here: http://mirrors.jenkins.io/war-stable-rc/2.60.1/

timja commented 2 years ago

[Originally related to: JENKINS-26048]

timja / jenkins-gh-issues-poc-06-18

[JENKINS-38807] Jenkins 2.7.4 seems to leave behind Java processes (on Windows agent) if the build is aborted/agent loses connection #3584