timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-55106] Build stuck on final "exit 0" #4163

Open timja opened 5 years ago

timja commented 5 years ago

After executing successfully the shell script the workers remain stuck for 10 minutes on the final "exit 0".

There hasn't been any other failure that I could find: all the jobs run exactly as planned, they just don't seem to exit.

The fact that the jobs remain stuck for exactly 600 seconds makes me think of a timeout of some sort.

Reverting to 2.138 fixed the issue, that's why I am marking it as a regression.


Originally reported by ippo343, imported from: Build stuck on final "exit 0"
  • status: Open
  • priority: Critical
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 5 years ago

guruvamsi:

We are seeing the same issue in regular builds and pull requests, build stucks on exit 0 for more than 5 minutes and reports the status.

timja commented 5 years ago

precisionsean:

We are seeing this issue as well on version 2.150.1 running on Windows Server 2012 R2. Builds that took 4 minutes prior to the upgrade were taking 18 minutes afterward. We have reverted to version 2.138.3, which resolved the issue.

If there's information that I can provide to help pin this down, please let me know.

timja commented 5 years ago

guruvamsi:

Reverting the jenkins version to 2.138.3 fixed the issue. Hope it is fixed in next Jenkins LTS version.

Thank you  Sean.

timja commented 5 years ago

anatolys:

Confirmed. The same issue with 2.150.1 on Windows Server 2003, JDK 8. As you see exactly 10 minutes before the finish:

 

 19:59:42 D:\Jenkins\jobs\product\workspace>echo done 
 19:59:42 done
 19:59:42 
 19:59:42 D:\Jenkins\jobs\product\workspace>exit 0 
 20:09:44 Finished: SUCCESS

We have reverted to the 2.138.2 version.

 

 

timja commented 5 years ago

scb147:

I can also confirmed this on Windows Server 2012 R2, JDK 8 on Jenkins 2.150.2.  After the Build portion of the configuration has completed, there is a 10 minute delay before the Post-build Actions begin.

I reverted back to 2.138.4.

timja commented 5 years ago

ippo343:

I tried upgrading my instance to 2.164 and I can still reproduce it. I'll revert again to 2.138 for the moment.

I'll lose access to this instance soon (~3 weeks) so if anyone needs me to try stuff, now's the time.

timja commented 5 years ago

guruvamsi:

Added "-DSoftKillWaitSeconds=0 " in jenkins.xml before the -jar option. Now jobs execute normally with 2.150.2 version
Reference: https://stackoverflow.com/questions/54039226/jenkins-hangs-between-build-and-post-build/54072987#54072987
                 https://issues.jenkins-ci.org/browse/JENKINS-55422?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

timja commented 5 years ago

precisionsean:

Thank you very much for pointing this out, Guru. We tried this and are now running 2.150.2 without the delay.

Have a great day!

timja commented 5 years ago

kredens:

Still doesn't work properly, had to roll back from 2.164.1 LTS to 2.138.4 LTS. Workaround mentioned above doesn't work for me either, builds are getting stuck.

timja commented 5 years ago

danielbeck:

Workaround mentioned above doesn't work for me either

Are you sure you applied it correctly? Check the /systemInfo to see whether the system property is defined in Jenkins?

timja commented 5 years ago

kredens:

danielbeck yes, it's applied properly and still various builds are randomly getting stuck. 

timja commented 5 years ago

danielbeck:

kredens While you're waiting for the build to finish, check what Jenkins is doing: https://wiki.jenkins.io/display/JENKINS/Obtaining+a+thread+dump

timja commented 5 years ago

joshschreuder:

Does this work for Jenkins slaves? This fixed our master instance, but it seems like a similar issue is present with jobs that run on slaves.

Adding `-DSoftKillWaitSeconds=0` to jenkins-slave.xml and restarting the service adds it to the command line, but it doesn't seem to have any effect. Any ideas?

timja commented 5 years ago

kredens:

I should add I also run the jobs on slaves, not on the master node.

timja commented 5 years ago

00bins:

We also experience this but only on slave machines, and adding -DSoftKillWaitSeconds=0 has no affect on slave nodes.

timja commented 5 years ago

klamb:

We also experience this issue with jobs that run on slave machines. Adding -DSoftKillWaitSeconds did not affect the issue.

Tried rolling back to Jenkins version 2.150.3, but the bug was still there.

Then, rolled back to Jenkins version 2.138.4, and the bug is now gone.

We will have to stay on 2.138.4 until this bug is resolved.

timja commented 5 years ago

danielbeck:

To clarify, are you setting the system property on agent processes? I.e. as additional launch arguments to java -jar agent.jar?

timja commented 5 years ago

klamb:

I only set it on the master launch process. From what I have read, it has no effect on slaves.

timja commented 5 years ago

danielbeck:

Right, Josh wrote that. Would still like explicit confirmation from someone affected that setting it doesn't work, including confirmation that it appears correctly on the URL /computer/name_here/systemInfo in the list of system properties, since it's easy to get the Java invocation wrong.

timja commented 5 years ago

joshschreuder:

danielbeck

Here's the slave command line:

And from jenkins-slave.xml

I'm pretty confident that this invocation is correct, as it's copied from our master agent where this parameter is working fine.

timja commented 5 years ago

scb147:

Any progress with this? I understand that there is a workaround, but shouldn't the commit that broke it be looked at to at least see why it's broken?

timja commented 5 years ago

pmascha:

Can you please fix this? 

timja commented 5 years ago

fsteff:

We've had this problem for 6 months or more, and have been searching high and low for a solution, without finding this issue.

Just applied the workaround on one of our agents, and immediately cut down the build-time of one of our jobs by 25 minutes!!!!!!!!!! 

I can't wait to see how much server time will be freed by this, but it looks like a LOT!

timja commented 4 years ago

loafloaf:

This issue is preventing me from upgrading my Jenkins, and the plugin to Jenkins version gap is getting harder and harder to deal with.

Is this issue going to be looked at? And has anyone had success with a workaround for a Jenkins instance that uses only slave machines?

timja commented 4 years ago

klamb:

Just wanted to add a "me too" to Andy Lin's comment.
I used to be very diligent about keeping my Jenkins and all the plugins up to date.
This bug, however, has everything stuck with what works using Jenkins 2.138.4.

timja commented 4 years ago

rocha_stratovan:

loafloaf, what scenario are you encountering this under? I had a similar problem when using MS Visual Studio on a slave.

In my case the problem is that the slave waits for remote processes to close, and has a timeout of ~2 minutes per process. I found that I had parallel compiles enabled and 6 remote VS compile session on the slave. When it finished, those VS processes would not go away, and every 2 minutes jenknis would kill one of them.

I learned that MS causes compile processes to stick around once they are started. The idea being that when a new compile is needed it can grab one of the idle processes. However, in my case Jenkins doesn't need/want any more compiles and is stuck waiting for the VS processes to go away.

There is a flag that can be used at the command line that informs VS to not keep the processes alive. Details for this can be found in a similar issue I logged JENKINS-59400

timja commented 4 years ago

loafloaf:

rocha_stratovan, I do use MS Visual Studio on some of my slave machines, but I don't think it does parallel compilation. I'll be sure to try out what you suggested. Do you experience the issue if you don't do parallel compiles?

I have Mac slave machines for the other half. The solution you had might apply in some way so I'll have to investigate if xcodebuild also does something similar with lingering processes. Thanks!

timja commented 4 years ago

rocha_stratovan:

loafloaf, I didn't seem to notice it when I did simple compiles without parallel compilation. Although I honestly would expect there to be at least a 2 minute delay even if there is just one compile process. But I don't know.

Good luck.

timja commented 4 years ago

kredens:

Well, time passes, and some jobs still get stuck on the FINAL stage for about two minutes before finally letting go. How hard can it be to fix this?

timja commented 4 years ago

danielbeck:

kredens If it's so easy, submit a PR that does it.

timja commented 4 years ago

kredens:

danielbeck I'm not a developer on this project, neither am I using it by choice. Also - it used to work fine until someone changed something and can't be bothered to fix it.

timja commented 2 years ago

[Originally related to: JENKINS-17116]