timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-62604] The class jenkins.security.QueueItemAuthenticatorConfiguration was not found, potentially not yet loaded #1069

Open timja opened 4 years ago

timja commented 4 years ago

As I detailed on Jenkins IRC last Friday 2020-06-05 (log at https://echelog.com/logs/browse/jenkins/1591308000 - posts by "jimklimov"), I updated our instance from weekly 2.238 to 2.239 and hell broke loose with Jenkins not finding various classes (I think different ones at various restart retries), always mentioning a lot of problems with QueueItemAuthenticatorConfiguration - and not starting queued builds, trying for 10 minutes but not seeing SSH Agents (while Swarm ones did get connected - but not used) etc. Sometimes it had not loaded the base classes such as Configuration and FilePath and failed to start meaningfully. At some point during restarts, several tags from config.xml disappeared...

Reverting to 2.238 core that worked (albeit with the updated plugins from that week's brew) I could not restore the server - it failed with similar errors.

Ultimately, looking for other similarly worded issues I followed the ideas from JENKINS-61990 and moved away the saved queue and milestone XMLs, and starting from clean slate in this manner worked. However, a clean start in 2.238 WAR and restart into 2.239 failed like in original issue. Cleaning the configs and starting 2.239 worked well. Safe-restarting 2.239 also seems to succeed with the XMLs it had saved.

So initially I wanted to log this as a blocker but in the end a workaround exists so it is not exactly such (assuming lost queued builds can be tolerated). Also, in https://www.jenkins.io/changelog/ weather everyone else seemed satisfied with this release.

For the agents' connections, the fuller stack traces are here : https://pastebin.com/siJbMdWP and a lot of (probably from UI driven status refreshes?) same class in these combos of lines: https://pastebin.com/MFvQukr3


Originally reported by jimklimov, imported from: The class jenkins.security.QueueItemAuthenticatorConfiguration was not found, potentially not yet loaded
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 4 years ago

jimklimov:

I'll re-post the relevant part of IRC-woes log below:

[09:31:06] huh, swarm agents say: "No valid crumb was included in the request" and refuse to start, what gives? :\ 2.239 weekly and 3.21 swarm
[09:37:27] uh, i'm relieved - after server restart completed, agents connected
[09:37:41] except those older ones not yet updated :\

[11:44:58] hm, after that upgrade there is another problem with agents: the static SSH agent lines blinked up and down in the JENKINS_URL/computer page and ultimately disappeared - both from the details listing (free space, clock, etc.) and from the left column (builds running or idling)
[11:45:35] If I ask to make a New Node as a copy of existing one, those AWOL SSH agents are listed as something to copy.
[11:46:55] however a JENKINS_URL/computer/AGENTNAME returns an HTTP-404
[11:50:01] in the jenkins.log I see several dozen of such entries for 10 minutes following Jenkins JVM startup:
[11:50:01] 2020-06-05 07:31:56.010+0000 [id=30] WARNING hudson.model.AbstractCIBase#updateComputer: Error updating node master-worker-old, continuing
[11:50:01] java.lang.IllegalStateException: The class jenkins.security.QueueItemAuthenticatorConfiguration was not found, potentially not yet loaded
[11:52:27] amounts vary, from about 40 to 200 per agent name that I checked
[12:01:18] this class is also mentioned in other cases:
[12:01:18] 2020-06-05 09:52:00.616+0000 [id=320664]WARNING j.util.AtmostOneTaskExecutor$1#call
[12:01:18] java.lang.IllegalStateException: The class jenkins.security.QueueItemAuthenticatorConfiguration was not found, potentially not yet loaded
[12:01:18] and a lot of (probably from UI driven status refreshes?) same class in these combos of lines: https://pastebin.com/MFvQukr3
[12:10:32] for the agents' connections, the fuller stack traces are here : https://pastebin.com/siJbMdWP
[12:10:59] but overall the common culprit seems to be "java.lang.IllegalStateException: The class jenkins.security.QueueItemAuthenticatorConfiguration was not found, potentially not yet loaded" in different cases

[12:54:08] back to (missing) jenkins.security.QueueItemAuthenticatorConfiguration issues above - I wonder if it is also linked to our detected workers still doing nothing, and queue growing? it says that required labels for jobs are not served by any agent... but the list in that comment does not include the actual workers that do serve the label. Some other jobs are "waiting for available executor on X" and X has dozens free executors. Actually all of them since nothing runs.

[13:07:34] and back to our usual programming: I restarted this instance with plugins again updated (RBAC) and see new logged errors I did not notice before: lots of the likes of:
[13:07:34] 2020-06-05 11:04:41.529+0000 [id=20] WARNING h.ExtensionFinder$GuiceFinder$FaultTolerantScope$1#error: Failed to instantiate Key[type=io.jenkins.plugins.analysis.warnings.Pmd$Descriptor, annotation=[none]]; skipping this component
[13:07:34] java.lang.NoClassDefFoundError: Could not initialize class net.sourceforge.pmd.lang.LanguageRegistry
[13:07:34] and
[13:07:34] 2020-06-05 11:04:41.526+0000 [id=20] WARNING h.ExtensionFinder$GuiceFinder$FaultTolerantScope$1#error: Failed to instantiate Key[type=jenkins.security.QueueItemAuthenticatorConfiguration, annotation=[none]]; skipping this component
[13:07:34] java.lang.IllegalStateException: Singleton is called recursively returning different results
[13:08:00] I wonder if the latter is why that class won't load
[13:08:24] wild guess: can it be that different plugins and/or core deliver actively conflicting versions of some libs?
[13:09:54] a discussion at https://issues.jenkins-ci.org/browse/JENKINS-61990 looks inconclusive, but relevant
[13:09:55] JENKINS-61990:StackOverflowError on boot related to QueueItemAuthenticatorConfiguration (Open) https://issues.jenkins-ci.org/browse/JENKINS-61990
[13:28:32] removing queue.xml and restarting... saw some newbies in the mile-long stacktraces...
[13:28:32] 2020-06-05 11:25:53.700+0000 [id=98] WARNING h.i.i.InstallUncaughtExceptionHandler#handleException: Caught unhandled exception with ID 3dffb911-e981-4a52-b2d0-be0cd7bf6c80
[13:28:32] java.lang.AssertionError: class hudson.security.csrf.DefaultCrumbIssuer is missing its descriptor
[13:28:32] and a more detailed
[13:28:32] at com.thoughtworks.xstream.core.TreeUnmarshaller.convert(TreeUnmarshaller.java:72)
[13:28:32] Caused: com.google.inject.ProvisionException: Unable to provision, see the following errors:
[13:28:32] 1) Error injecting constructor, java.lang.NoClassDefFoundError: Could not initialize class net.sourceforge.pmd.lang.LanguageRegistry
[13:28:33] at io.jenkins.plugins.analysis.warnings.Pmd$Descriptor.(Pmd.java:61)
[13:28:33] 1 error
[13:28:34] at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:52)
[13:28:34] at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:145)
[13:30:03] oh, wow: 2020-06-05 11:29:17.842+0000 [id=164] SEVERE h.triggers.SCMTrigger$Runner#runPolling: Failed to record SCM polling for org.jenkinsci.plugins.workflow.job.WorkflowJob at 110d8469[42ity-github/czmq/v3 dot 0.2-FTY] // java.lang.NoClassDefFoundError: Could not initialize class hudson.FilePath
[13:52:33] what the snap is happening there? :\
[13:52:33] java.lang.IllegalStateException: Jenkins.instance is missing. Read the documentation of Jenkins.getInstanceOrNull to see what you are doing wrong.

[14:07:22] another try goes by... moved away org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml.bak-20200605 org.jenkinsci.plugins.pipeline.milestone.MilestoneStep.xml.bak-20200605 queue.xml.bak-20200605 and shut down the reverse proxy that exposes Jenkins so it is not confused by the clients and agents until ready...
[14:09:23] also in jenkins.security.QueueItemAuthenticatorConfiguration.xml set the all the strategies to "true" (flipped org.jenkinsci.plugins.authorizeproject.strategy.AnonymousAuthorizationStrategy and org.jenkinsci.plugins.authorizeproject.strategy.SpecificUsersAuthorizationStrategy from false they had)
[14:10:58] and recovered a config.xml - something chopped a few XML tags from it between reboots
[14:12:22] and it's up, and a lot faster than usual

[14:41:51] so that good start was with a reverted 2.238 war... activated the 2.239 - and hell broke loose again, reverted to 2.238 - still broken. Removed the newly appeared queue and workflow xmls, and 2.239 seens to be starting well. So something is seriously borked with that queue persistence :\

timja commented 4 years ago

jimklimov:

I've had a similar issue with update to 2.240 now - it started but was not doing anything, waiting for executors that are there and not seeing SSH Agent workers.

While trying to restart with "least losses", ended up removing just org.jenkinsci.plugins.workflow.flow.FlowExecutionList.xml and org.jenkinsci.plugins.pipeline.milestone.MilestoneStep.xml (not queue.xml since it accumulated quite a few needed items), but during a restart there was a lot of spam in the log about unexpected CSRF postings - I suppose from still running browsers trying to get in.

Then I disabled httpd (apache rev proxy publishing Jenkins), restarted Jenkins, waited some minutes for reasonable stuff to appear in the logs (some jobs starting to run, agents connecting, etc.) and then re-enabled httpd. This approach went smoothly. I guess next time I'll try to not kill any XML files and just restart jenkins with the apache being down.

So by such evidence it seems something changed around handling CSRF secured requests coming from clients that may be invalid on one hand (from another jenkins instance lifetime) and may get processed too early for something (config not read? class not initialized?) and not ignored/delayed until they can be meaningfully processed, and in the end this somehow corrupts the server initialization.

timja commented 4 years ago

jimklimov:

Interesting events from log of failed startup include:

2020-06-09 18:07:58.070+0000 [id=27]    WARNING h.i.i.InstallUncaughtExceptionHandler#handleException: Caught unhandled exception with ID fe811201-89c8-4a44-8f01-c079d
c848c09
java.lang.AssertionError: class hudson.security.csrf.DefaultCrumbIssuer is missing its descriptor
at jenkins.model.Jenkins.getDescriptorOrDie(Jenkins.java:1600)
at hudson.security.csrf.CrumbIssuer.getDescriptor(CrumbIssuer.java:152)
at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:131)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)
at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:169)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)
at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)
at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)
at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)
at jenkins.security.SuspiciousRequestFilter.doFilter(SuspiciousRequestFilter.java:36)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:566)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1610)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1300)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1580)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1215)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:500)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
at java.lang.Thread.run(Thread.java:748)

and

2020-06-09 18:08:32.048+0000 [id=78]    INFO    jenkins.InitReactorRunner$1#onAttained: Prepared all plugins
2020-06-09 18:08:32.993+0000 [id=113]   WARNING h.util.RobustReflectionConverter#doUnmarshal: There was a problem reporting unmarshalling field errors
java.lang.StackOverflowError
at java.io.UnixFileSystem.resolve(UnixFileSystem.java:102)
at java.io.File.(File.java:367)
at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1279)
at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1249)
at sun.misc.URLClassPath.findResource(URLClassPath.java:215)
at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
at org.eclipse.jetty.webapp.WebAppClassLoader.loadAsResource(WebAppClassLoader.java:617)
at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:534)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at hudson.diagnosis.OldDataMonitor.referTo(OldDataMonitor.java:398)
at hudson.diagnosis.OldDataMonitor.report(OldDataMonitor.java:224)
at hudson.util.RobustReflectionConverter.doUnmarshal(RobustReflectionConverter.java:375)
at hudson.util.RobustReflectionConverter.unmarshal(RobustReflectionConverter.java:269)
at com.thoughtworks.xstream.core.TreeUnmarshaller.convert(TreeUnmarshaller.java:72)
at com.thoughtworks.xstream.core.AbstractReferenceUnmarshaller.convert(AbstractReferenceUnmarshaller.java:65)
at com.thoughtworks.xstream.core.TreeUnmarshaller.convertAnother(TreeUnmarshaller.java:66)
at com.thoughtworks.xstream.core.TreeUnmarshaller.convertAnother(TreeUnmarshaller.java:50)
at com.thoughtworks.xstream.core.TreeUnmarshaller.start(TreeUnmarshaller.java:134)
at com.thoughtworks.xstream.core.AbstractTreeMarshallingStrategy.unmarshal(AbstractTreeMarshallingStrategy.java:32)
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1189)
at hudson.util.XStream2.unmarshal(XStream2.java:161)
at hudson.util.XStream2.unmarshal(XStream2.java:132)
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1173)
at hudson.XmlFile.unmarshal(XmlFile.java:180)
at hudson.XmlFile.unmarshal(XmlFile.java:163)
at hudson.model.Descriptor.load(Descriptor.java:913)
at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at hudson.ExtensionFinder$GuiceFinder$SezpozModule.onProvision(ExtensionFinder.java:596)
at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:126)
at com.google.inject.internal.ProvisionListenerStackCallback.provision(ProvisionListenerStackCallback.java:68)
at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:87)
at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:267)
at com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46)
at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1103)
at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:145)
at hudson.ExtensionFinder$GuiceFinder$FaultTolerantScope$1.get(ExtensionFinder.java:440)
at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41)
at com.google.inject.internal.InjectorImpl$2$1.call(InjectorImpl.java:1016)
at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1103)
at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1012)
at hudson.ExtensionFinder$GuiceFinder._find(ExtensionFinder.java:402)
at hudson.ExtensionFinder$GuiceFinder.find(ExtensionFinder.java:393)
at hudson.ClassicPluginStrategy.findComponents(ClassicPluginStrategy.java:346)
at hudson.ExtensionList.load(ExtensionList.java:381)
at hudson.ExtensionList.ensureLoaded(ExtensionList.java:317)
at hudson.ExtensionList.getComponents(ExtensionList.java:183)
at hudson.DescriptorExtensionList.load(DescriptorExtensionList.java:213)
at hudson.ExtensionList.ensureLoaded(ExtensionList.java:317)
at hudson.ExtensionList.iterator(ExtensionList.java:172)
at hudson.ExtensionList.getInstance(ExtensionList.java:162)
at jenkins.security.QueueItemAuthenticatorConfiguration.get(QueueItemAuthenticatorConfiguration.java:61)
at org.jenkinsci.plugins.authorizeproject.ProjectQueueItemAuthenticator.getConfigured(ProjectQueueItemAuthenticator.java:178)
at org.jenkinsci.plugins.authorizeproject.ProjectQueueItemAuthenticator.isConfigured(ProjectQueueItemAuthenticator.java:190)
at org.jenkinsci.plugins.authorizeproject.AuthorizeProjectStrategy.checkUnsecuredConfiguration(AuthorizeProjectStrategy.java:186)
at org.jenkinsci.plugins.authorizeproject.AuthorizeProjectStrategy.readResolve(AuthorizeProjectStrategy.java:176)
at sun.reflect.GeneratedMethodAccessor74.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.thoughtworks.xstream.converters.reflection.SerializationMethodInvoker.callReadResolve(SerializationMethodInvoker.java:66)
at hudson.util.RobustReflectionConverter.unmarshal(RobustReflectionConverter.java:270)
... snip a thousand lines to fit into JIRA posting limits ...
at hudson.ExtensionFinder$GuiceFinder$FaultTolerantScope$1.get(ExtensionFinder.java:440)
at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41)
at com.google.inject.internal.InjectorImpl$2$1.call(InjectorImpl.java:1016)
at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1103)
at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1012)
at hudson.ExtensionFinder$GuiceFinder._find(ExtensionFinder.java:402)
at hudson.ExtensionFinder$GuiceFinder.find(ExtensionFinder.java:393)
...

I think this was the single longest stack trace I've ever seen, or at least copied. Maybe this depth of calls and recursions is also why it failed?..

timja commented 4 years ago

jimklimov:

Per IRC discussion, one idea is that:

> (12:12:14) timja: ouch, appears to be because you use master => agent access control, and there's an issue there
> (12:12:30) timja: I assume a number of places don't use it which is why there's a low number of reports

Indeed, it is enabled (probably because at some point some documentation recommended it), but at the moment there are no further rules "tweaked".

It may help to highlight other settings:

For agents the protocol used is "Inbound TCP Agent Protocol/4 (TLS encryption) : A TLS secured connection between the master and the agent performed by TLS upgrade of the socket."

CSRF Protection => Default Crumb Issuer, with checked "Enable proxy compatibility" option

Users in local Jenkins DB, using "Matrix-based security" auth strategy

timja commented 4 years ago

danielbeck:

Duplicates JENKINS-61990?

timja commented 4 years ago

jimklimov:

I did link to it from the initial investigation, but am not sure if it is the same issue or not. By wording and discussion it seemed distinct, but the suggestions helped at least partially, and key words in stacktraces were how I found it

I guess it can be marked related at least, not sure still if a duplicate...

timja commented 4 years ago

jimklimov:

FYI : with today's update/restart of our Jenkins, with 2.242 weekly and most recent plugin revisions, the problem did not re-appear: even without wiping the queue files etc. it came up well, as it usually did before.

Also as I posted on IRC yesterday (though with no response), seems this was addressed in recent weeklies at least somehow, though I did not see direct indications in changelog. Or in some plugins, if they were at fault...

> hi, I'm back from vacation and during the time https://issues.jenkins-ci.org/browse/JENKINS-62604 bit our team, so I spent most of last night repairing their "fixes" back as they tried to reduce the scope of starting the Jenkins server (disabling plugins, losing their configs in job XMLs, etc)

> with 2.241 now installed I see there is a new jenkins.security.QueueItemAuthenticatorMonitor in config.xml - but see no apparent changelog entry for that... is that the core team aiding investigation of this regression somehow?

> I think restarts went smoother recently ("recently" being with 2.241, and guessuming it tried to address the issue), but there was a lot of reverting fixing of configs broken by the team so can't really say; did not do a clean restart yet