Open timja opened 4 years ago
I'll re-post the relevant part of IRC-woes log below:
[09:31:06]
[09:37:27]
[09:37:41]
[11:44:58]
[11:45:35]
[11:46:55]
[11:50:01]
[11:50:01]
[11:50:01]
[11:52:27]
[12:01:18]
[12:01:18]
[12:01:18]
[12:01:18]
[12:10:32]
[12:10:59]
[12:54:08]
[13:07:34]
[13:07:34]
[13:07:34]
[13:07:34]
[13:07:34]
[13:07:34]
[13:08:00]
[13:08:24]
[13:09:54]
[13:09:55]
[13:28:32]
[13:28:32]
[13:28:32]
[13:28:32]
[13:28:32]
[13:28:32]
[13:28:32]
[13:28:33]
[13:28:33]
[13:28:34]
[13:28:34]
[13:30:03]
[13:52:33]
[13:52:33]
[14:07:22]
[14:09:23]
[14:10:58]
[14:12:22]
[14:41:51]
I've had a similar issue with update to 2.240 now - it started but was not doing anything, waiting for executors that are there and not seeing SSH Agent workers.
While trying to restart with "least losses", ended up removing just org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml and org.jenkinsci.plugins.pipeline.milestone.MilestoneStep.xml (not queue.xml since it accumulated quite a few needed items), but during a restart there was a lot of spam in the log about unexpected CSRF postings - I suppose from still running browsers trying to get in.
Then I disabled httpd (apache rev proxy publishing Jenkins), restarted Jenkins, waited some minutes for reasonable stuff to appear in the logs (some jobs starting to run, agents connecting, etc.) and then re-enabled httpd. This approach went smoothly. I guess next time I'll try to not kill any XML files and just restart jenkins with the apache being down.
So by such evidence it seems something changed around handling CSRF secured requests coming from clients that may be invalid on one hand (from another jenkins instance lifetime) and may get processed too early for something (config not read? class not initialized?) and not ignored/delayed until they can be meaningfully processed, and in the end this somehow corrupts the server initialization.
Interesting events from log of failed startup include:
2020-06-09 18:07:58.070+0000 [id=27] WARNING h.i.i.InstallUncaughtExceptionHandler#handleException: Caught unhandled exception with ID fe811201-89c8-4a44-8f01-c079d c848c09 java.lang.AssertionError: class hudson.security.csrf.DefaultCrumbIssuer is missing its descriptor at jenkins.model.Jenkins.getDescriptorOrDie(Jenkins.java:1600) at hudson.security.csrf.CrumbIssuer.getDescriptor(CrumbIssuer.java:152) at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:131) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604) at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:169) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604) at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604) at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604) at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604) at jenkins.security.SuspiciousRequestFilter.doFilter(SuspiciousRequestFilter.java:36) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:566) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1610) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1300) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1580) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1215) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.Server.handle(Server.java:500) at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) at java.lang.Thread.run(Thread.java:748)
and
2020-06-09 18:08:32.048+0000 [id=78] INFO jenkins.InitReactorRunner$1#onAttained: Prepared all plugins 2020-06-09 18:08:32.993+0000 [id=113] WARNING h.util.RobustReflectionConverter#doUnmarshal: There was a problem reporting unmarshalling field errors java.lang.StackOverflowError at java.io.UnixFileSystem.resolve(UnixFileSystem.java:102) at java.io.File.(File.java:367) at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1279) at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1249) at sun.misc.URLClassPath.findResource(URLClassPath.java:215) at java.net.URLClassLoader$2.run(URLClassLoader.java:569) at java.net.URLClassLoader$2.run(URLClassLoader.java:567) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findResource(URLClassLoader.java:566) at org.eclipse.jetty.webapp.WebAppClassLoader.loadAsResource(WebAppClassLoader.java:617) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:534) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at hudson.diagnosis.OldDataMonitor.referTo(OldDataMonitor.java:398) at hudson.diagnosis.OldDataMonitor.report(OldDataMonitor.java:224) at hudson.util.RobustReflectionConverter.doUnmarshal(RobustReflectionConverter.java:375) at hudson.util.RobustReflectionConverter.unmarshal(RobustReflectionConverter.java:269) at com.thoughtworks.xstream.core.TreeUnmarshaller.convert(TreeUnmarshaller.java:72) at com.thoughtworks.xstream.core.AbstractReferenceUnmarshaller.convert(AbstractReferenceUnmarshaller.java:65) at com.thoughtworks.xstream.core.TreeUnmarshaller.convertAnother(TreeUnmarshaller.java:66) at com.thoughtworks.xstream.core.TreeUnmarshaller.convertAnother(TreeUnmarshaller.java:50) at com.thoughtworks.xstream.core.TreeUnmarshaller.start(TreeUnmarshaller.java:134) at com.thoughtworks.xstream.core.AbstractTreeMarshallingStrategy.unmarshal(AbstractTreeMarshallingStrategy.java:32) at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1189) at hudson.util.XStream2.unmarshal(XStream2.java:161) at hudson.util.XStream2.unmarshal(XStream2.java:132) at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1173) at hudson.XmlFile.unmarshal(XmlFile.java:180) at hudson.XmlFile.unmarshal(XmlFile.java:163) at hudson.model.Descriptor.load(Descriptor.java:913) at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at hudson.ExtensionFinder$GuiceFinder$SezpozModule.onProvision(ExtensionFinder.java:596) at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:126) at com.google.inject.internal.ProvisionListenerStackCallback.provision(ProvisionListenerStackCallback.java:68) at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:87) at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:267) at com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46) at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1103) at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40) at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:145) at hudson.ExtensionFinder$GuiceFinder$FaultTolerantScope$1.get(ExtensionFinder.java:440) at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41) at com.google.inject.internal.InjectorImpl$2$1.call(InjectorImpl.java:1016) at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1103) at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1012) at hudson.ExtensionFinder$GuiceFinder._find(ExtensionFinder.java:402) at hudson.ExtensionFinder$GuiceFinder.find(ExtensionFinder.java:393) at hudson.ClassicPluginStrategy.findComponents(ClassicPluginStrategy.java:346) at hudson.ExtensionList.load(ExtensionList.java:381) at hudson.ExtensionList.ensureLoaded(ExtensionList.java:317) at hudson.ExtensionList.getComponents(ExtensionList.java:183) at hudson.DescriptorExtensionList.load(DescriptorExtensionList.java:213) at hudson.ExtensionList.ensureLoaded(ExtensionList.java:317) at hudson.ExtensionList.iterator(ExtensionList.java:172) at hudson.ExtensionList.getInstance(ExtensionList.java:162) at jenkins.security.QueueItemAuthenticatorConfiguration.get(QueueItemAuthenticatorConfiguration.java:61) at org.jenkinsci.plugins.authorizeproject.ProjectQueueItemAuthenticator.getConfigured(ProjectQueueItemAuthenticator.java:178) at org.jenkinsci.plugins.authorizeproject.ProjectQueueItemAuthenticator.isConfigured(ProjectQueueItemAuthenticator.java:190) at org.jenkinsci.plugins.authorizeproject.AuthorizeProjectStrategy.checkUnsecuredConfiguration(AuthorizeProjectStrategy.java:186) at org.jenkinsci.plugins.authorizeproject.AuthorizeProjectStrategy.readResolve(AuthorizeProjectStrategy.java:176) at sun.reflect.GeneratedMethodAccessor74.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.thoughtworks.xstream.converters.reflection.SerializationMethodInvoker.callReadResolve(SerializationMethodInvoker.java:66) at hudson.util.RobustReflectionConverter.unmarshal(RobustReflectionConverter.java:270) ... snip a thousand lines to fit into JIRA posting limits ... at hudson.ExtensionFinder$GuiceFinder$FaultTolerantScope$1.get(ExtensionFinder.java:440) at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41) at com.google.inject.internal.InjectorImpl$2$1.call(InjectorImpl.java:1016) at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1103) at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1012) at hudson.ExtensionFinder$GuiceFinder._find(ExtensionFinder.java:402) at hudson.ExtensionFinder$GuiceFinder.find(ExtensionFinder.java:393) ...
I think this was the single longest stack trace I've ever seen, or at least copied. Maybe this depth of calls and recursions is also why it failed?..
Per IRC discussion, one idea is that:
> (12:12:14) timja: ouch, appears to be because you use master => agent access control, and there's an issue there
> (12:12:30) timja: I assume a number of places don't use it which is why there's a low number of reports
Indeed, it is enabled (probably because at some point some documentation recommended it), but at the moment there are no further rules "tweaked".
It may help to highlight other settings:
For agents the protocol used is "Inbound TCP Agent Protocol/4 (TLS encryption) : A TLS secured connection between the master and the agent performed by TLS upgrade of the socket."
CSRF Protection => Default Crumb Issuer, with checked "Enable proxy compatibility" option
Users in local Jenkins DB, using "Matrix-based security" auth strategy
Duplicates JENKINS-61990?
I did link to it from the initial investigation, but am not sure if it is the same issue or not. By wording and discussion it seemed distinct, but the suggestions helped at least partially, and key words in stacktraces were how I found it
I guess it can be marked related at least, not sure still if a duplicate...
FYI : with today's update/restart of our Jenkins, with 2.242 weekly and most recent plugin revisions, the problem did not re-appear: even without wiping the queue files etc. it came up well, as it usually did before.
Also as I posted on IRC yesterday (though with no response), seems this was addressed in recent weeklies at least somehow, though I did not see direct indications in changelog. Or in some plugins, if they were at fault...
> hi, I'm back from vacation and during the time https://issues.jenkins-ci.org/browse/JENKINS-62604 bit our team, so I spent most of last night repairing their "fixes" back as they tried to reduce the scope of starting the Jenkins server (disabling plugins, losing their configs in job XMLs, etc)
> with 2.241 now installed I see there is a new jenkins.security.QueueItemAuthenticatorMonitor in config.xml - but see no apparent changelog entry for that... is that the core team aiding investigation of this regression somehow?
> I think restarts went smoother recently ("recently" being with 2.241, and guessuming it tried to address the issue), but there was a lot of reverting fixing of configs broken by the team so can't really say; did not do a clean restart yet
As I detailed on Jenkins IRC last Friday 2020-06-05 (log at https://echelog.com/logs/browse/jenkins/1591308000 - posts by "jimklimov"), I updated our instance from weekly 2.238 to 2.239 and hell broke loose with Jenkins not finding various classes (I think different ones at various restart retries), always mentioning a lot of problems with QueueItemAuthenticatorConfiguration - and not starting queued builds, trying for 10 minutes but not seeing SSH Agents (while Swarm ones did get connected - but not used) etc. Sometimes it had not loaded the base classes such as Configuration and FilePath and failed to start meaningfully. At some point during restarts, several tags from config.xml disappeared...
Reverting to 2.238 core that worked (albeit with the updated plugins from that week's brew) I could not restore the server - it failed with similar errors.
Ultimately, looking for other similarly worded issues I followed the ideas from JENKINS-61990 and moved away the saved queue and milestone XMLs, and starting from clean slate in this manner worked. However, a clean start in 2.238 WAR and restart into 2.239 failed like in original issue. Cleaning the configs and starting 2.239 worked well. Safe-restarting 2.239 also seems to succeed with the XMLs it had saved.
So initially I wanted to log this as a blocker but in the end a workaround exists so it is not exactly such (assuming lost queued builds can be tolerated). Also, in https://www.jenkins.io/changelog/ weather everyone else seemed satisfied with this release.
For the agents' connections, the fuller stack traces are here : https://pastebin.com/siJbMdWP and a lot of (probably from UI driven status refreshes?) same class in these combos of lines: https://pastebin.com/MFvQukr3
Originally reported by jimklimov, imported from: The class jenkins.security.QueueItemAuthenticatorConfiguration was not found, potentially not yet loaded