[JENKINS-62605] Sometimes JVM with Jenkins just dies within a minute after starting with a new version

timja commented 4 years ago

For a number of weekly updates (so can't say when this started en-masse... roughly most of 2020 so far? by now it was last seen with current 2.239 weekly) I saw during upgrades that sometimes the JVM just quietly dies after running a new WAR version, e.g. last jenkins.log lines would be:

2020-06-05 12:36:32.701+0000 [id=1]     INFO    hudson.WebAppMain#contextInitialized: Jenkins home directory: /var/lib/jenkins found at: SystemProperties.getProperty("JENKINS_HOME")
2020-06-05 12:36:32.897+0000 [id=1]     INFO    o.e.j.s.handler.ContextHandler#doStart: Started w.@596df867{Jenkins v2.239,/,file:///var/cache/jenkins/war/,AVAILABLE}{/var/cache/jenkins/war}
2020-06-05 12:36:33.078+0000 [id=1]     INFO    o.e.j.server.AbstractConnector#doStart: Started ServerConnector at 376a0d86{HTTP/1 dot 1, (http/1.1)}{0.0.0.0:8080}
2020-06-05 12:36:33.092+0000 [id=1]     INFO    org.eclipse.jetty.server.Server#doStart: Started @5878ms

...and then nothing, and no java in `ps` output; detecting the situation was a problem for some time (like telling packages to update, wandering off because it takes tens of minutes for Jenkins to initialize and have a usable UI... and then find it is dead and not even trying to be booting).

Restarting the service in those cases helps, but it has to be done manually since e.g. RPM packages of Jenkins include an init script but not a real systemd service with child process monitoring (and not all distros/deployments favor systemd at all). Also, restarting is a nasty workaround for the original issue of quiet fail.

We do not often restart without also updating, so I can't say if this situation happens randomly for any start or is linked to updates specifically.

Originally reported by jimklimov, imported from: Sometimes JVM with Jenkins just dies within a minute after starting with a new version

status: Open
priority: Major
resolution: Unresolved
imported: 2022/01/10

timja commented 4 years ago

oleg_nenashev:

Would it be possible to get a core dump?

timja commented 4 years ago

jimklimov:

Still looking for a core file; here is a bit more of the systems-side context however:

[root@jenkins2 ~]# systemctl status jenkins -l 
* jenkins.service - LSB: Jenkins Automation Server
   Loaded: loaded (/etc/rc.d/init.d/jenkins; bad; vendor preset: disabled)
   Active: active (exited) since Mon 2020-06-22 22:53:17 UTC; 37s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 7612 ExecStart=/etc/rc.d/init.d/jenkins start (code=exited, status=0/SUCCESS)

Jun 22 22:53:14 jenkins2 systemd[1]: Starting LSB: Jenkins Automation Server...
Jun 22 22:53:14 jenkins2 runuser[7616]: pam_unix(runuser:session): session opened for user jenkins by (uid=0)
Jun 22 22:53:14 jenkins2 jenkins[7612]: Starting Jenkins OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=4096m; support was removed in 8.0
Jun 22 22:53:16 jenkins2 jenkins[7612]: OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=4096m; support was removed in 8.0
Jun 22 22:53:16 jenkins2 runuser[7616]: pam_unix(runuser:session): session closed for user jenkins
Jun 22 22:53:17 jenkins2 jenkins[7612]: [  OK  ]
Jun 22 22:53:17 jenkins2 systemd[1]: Started LSB: Jenkins Automation Server.

timja commented 4 years ago

jimklimov:

I have a new theory now, will try to watch it in next restarts, that this is not a Jenkins (JAVA) issue per se. With the init script being wrapped as a systemd unit for our jenkins-core delivered as RPM on CentOS, a stop of jenkins.service usually only has the JVM to terminate, but sometimes also has a number of its child processes assigned to same context - and when our networking lags, these can be dozens of git/ssh pollers waiting for data and/or processing it.

The new theory is that while a "systemctl stop" ends quickly (dispatches the kill signal and perhaps waits for immediate JVM child to end-of-life), other processes in the context may linger. So in these not-always-reproducible cases, a package update and/or other service-driven restart, I guess the following chain can be happening:

tells systemctl to stop the unit
systemctl says it did (but something lingers)
the new instance of JVM is started
systemd finds something lingered in the old context, and either kills those processes or finds they died and recycles the unit... killing the new JVM in the process

If this is a correct assessment, the root cause is the sysv-init script being autowrapped as a systemd service, instead of defining a real unit that systemd can track natively (the wrappings are constrained in many ways, so even a simple unit whose ExecStart and ExecStop call the same init script are usually more useful).

Providing a real unit would also address the problem I complained about some years ago, that a Jenkins JVM which died or exit'ed is not resuscitated by systemd in such case. (Note that for a JENKINS_URL/exit and similar non-restarting requests, e.g. for updates or other maintenance, ideally the unit would also really stay down... there is a libsystemd allowing to send bus signals from native programs; probably some touch-file based magic can be done to block a restart as well).

timja / jenkins-gh-issues-poc-06-18

[JENKINS-62605] Sometimes JVM with Jenkins just dies within a minute after starting with a new version #1070