Closed jonesbusy closed 3 months ago
@Dohbedoh any ideas?
I don't get it why it always happens somewhere else. I've multiple docker based Jenkins instance without ANY issues and I've also tested it on a manual installation on linux, no issues as well. What is different here?!
at org.jvnet.hudson.test.JenkinsRule.newHudson
I don't think that this is supposed to be here. Could the jenkins test harness dependency be loaded in this instance ? Give me a sec, using those plugin deps, I will try to figure out what plugin might be including it.
Starting an instance with this list, I can't reproduce and can't figure out what includes that class. @jonesbusy could you please run something like the following from the Manage Jenkins > Script Console:
String className = "org.jvnet.hudson.test.JenkinsRule"
def clazzResource = className.replaceAll("\\.", "/") + ".class"
println "Looking for: ${clazzResource}\n"
Jenkins.instance.pluginManager.activePlugins.forEach { PluginWrapper plugin ->
def c = plugin.classLoader.getResources(clazzResource)
if (c.hasMoreElements()) {
println "Found in ${plugin}"
println Collections.list(c).join("\n") + "\n"
}
}
return
This should hopefully tell us what include the jenkins-test-harness
that should not be present on a live Jenkins instance.
Note: you will need to have Jenkins started - so downgrade temporarily the prometheus plugin.
I also have the same issue, that since upgrading to 778.ve1c932a_ff24f
my Jenkins no longer starts.
However, I don't get an error message, the Jenkins instance just gets stuck after logging:
[id=39] INFO jenkins.InitReactorRunner$1#onAttained: Started initialization
[id=38] INFO jenkins.InitReactorRunner$1#onAttained: Listed all plugins
[id=41] INFO jenkins.InitReactorRunner$1#onAttained: Prepared all plugins
[id=39] INFO jenkins.InitReactorRunner$1#onAttained: Started all plugins
[id=41] INFO jenkins.InitReactorRunner$1#onAttained: Augmented all extensions
[id=38] INFO jenkins.InitReactorRunner$1#onAttained: System config loaded
[id=39] INFO jenkins.InitReactorRunner$1#onAttained: System config adapted
[id=40] INFO jenkins.InitReactorRunner$1#onAttained: Loaded all jobs
@tt-kkaiser any chance to get a thread dump or at least the initialization thread stacktrace stuck on startup ?
@Dohbedoh At the moment thats not possible, because this Issue only happens on my Production Jenkins and not on my Test Jenkins instance, even though both have the same data (docker image, plugins, configurations)
Stacktrace of deadlock threads would be the quickest way to narrow this down. @tt-kkaiser is it possible to at least have the list of plugin:version in that instance ?
We encountered the same issue as in issue 647. After removing Prometheus and enhanced metrics, which is dependent on Prometheus, it opened successfully. I run @Dohbedoh's script, while Jenkins started without mentioned plugins, nothing printed.
at org.jvnet.hudson.test.JenkinsRule.newHudson
I don't think that this is supposed to be here. Could the jenkins test harness dependency be loaded in this instance ? Give me a sec, using those plugin deps, I will try to figure out what plugin might be including it.
On my side I'm using test-harness into a maven project to validate plugin loading and dsl script.
I'm aware it's a bit different from using a real jenkins instance, but it still highlight in a reproductible way there is something wrong with the plugin lifecycle
If I'm able to reproduce on my test environment it should be possible to reproduce it on the plugin test or at least on the jenkins bom that test againts other plugin dependencies
To be honnest I'm also a bit lost on how Initilizer works and I saw also some issues with the opentelemetry
and opentelemetry-api
plugin with initialization
Perhaps there should be some check to avoid reinitialization ?
I can see something like https://github.com/jenkinsci/opentelemetry-plugin/blob/42e64cb74a0e728cbf020333d77d4f5ecbbcc752/src/main/java/io/jenkins/plugins/opentelemetry/JenkinsOpenTelemetryPluginConfiguration.java#L231 on the opentelemetry plugin ?
Not sure if it helps
@Dohbedoh here is a list of all the plugins in the jenkins instance:
active-directory:2.35
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
asm-api:9.7-33.v4d23ef79fcc8
authentication-tokens:1.119.v50285141b_7e1
blueocean-bitbucket-pipeline:1.27.14
blueocean-commons:1.27.14
blueocean-config:1.27.14
blueocean-core-js:1.27.14
blueocean-dashboard:1.27.14
blueocean-display-url:2.4.3
blueocean-events:1.27.14
blueocean-git-pipeline:1.27.14
blueocean-github-pipeline:1.27.14
blueocean-i18n:1.27.14
blueocean-jwt:1.27.14
blueocean-personalization:1.27.14
blueocean-pipeline-api-impl:1.27.14
blueocean-pipeline-editor:1.27.14
blueocean-pipeline-scm-api:1.27.14
blueocean-rest-impl:1.27.14
blueocean-rest:1.27.14
blueocean-web:1.27.14
blueocean:1.27.14
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.78.1-248.ve27176eb_46cb_
branch-api:2.1169.va_f810c56e895
caffeine-api:3.1.8-133.v17b_1ff2e0599
checks-api:2.2.0
cloudbees-bitbucket-branch-source:888.v8e6d479a_1730
cloudbees-folder:6.942.vb_43318a_156b_2
command-launcher:107.v773860566e2e
commons-compress-api:1.26.1-2
commons-lang3-api:3.14.0-76.vda_5591261cfe
commons-text-api:1.12.0-119.v73ef73f2345d
configuration-as-code:1836.vccda_4a_122a_a_e
credentials-binding:681.vf91669a_32e45
credentials:1371.vfee6b_095f0a_3
display-url-api:2.204.vf6fddd8a_8b_e9
docker-commons:439.va_3cb_0a_6a_fb_29
docker-workflow:580.vc0c340686b_54
durable-task:555.v6802fe0f0b_82
echarts-api:5.5.0-1
eddsa-api:0.3.0-4.v84c6f0f4969e
favorite:2.218.vd60382506538
font-awesome-api:6.5.2-1
git-client:5.0.0
git:5.2.2
github-api:1.318-461.v7a_c09c9fa_d63
github-branch-source:1790.v5a_7859812c8d
github:1.39.0
gson-api:2.11.0-41.v019fcf6125dc
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
htmlpublisher:1.36
http_request:1.18
instance-identity:185.v303dc7c645f9
ionicons-api:74.v93d5eb_813d5f
jackson2-api:2.17.0-379.v02de8ec9f64c
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javax-activation-api:1.2.0-7
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
jenkins-design-language:1.27.14
jjwt-api:0.11.5-112.ve82dfb_224b_a_d
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
jquery3-api:3.7.1-2
json-api:20240303-41.v94e11e6de726
json-path-api:2.9.0-58.v62e3e85b_a_655
junit:1265.v65b_14fa_f12f0
kubernetes-client-api:6.10.0-240.v57880ce8b_0b_2
kubernetes-credentials:174.va_36e093562d9
kubernetes:4253.v7700d91739e5
locale:519.v4e20f313cfa_f
mailer:472.vf7c289a_4b_420
matrix-auth:3.2.2
matrix-project:832.va_66e270d2946
metrics:4.2.21-451.vd51df8df52ec
mina-sshd-api-common:2.13.1-117.v2f1a_b_66ff91d
mina-sshd-api-core:2.13.1-117.v2f1a_b_66ff91d
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-groovy-lib:727.ve832a_9244dfa_
pipeline-input-step:495.ve9c153f6067b_
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2205.vc9522a_9d5711
pipeline-model-definition:2.2205.vc9522a_9d5711
pipeline-model-extensions:2.2205.vc9522a_9d5711
pipeline-rest-api:2.34
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2205.vc9522a_9d5711
pipeline-utility-steps:2.17.0
plain-credentials:183.va_de8f1dd5a_2b_
plugin-util-api:4.1.0
prism-api:1.29.0-15
prometheus:773.v3b_62d8178eec
pubsub-light:1.18
resource-disposer:0.23
scm-api:690.vfc8b_54395023
script-security:1341.va_2819b_414686
snakeyaml-api:2.2-111.vc6598e30cc65
sse-gateway:1.27
ssh-agent:367.vf9076cd4ee21
ssh-credentials:341.vf31377f30378
ssh-slaves:2.973.v0fa_8c0dea_f9f
stashNotifier:1.492.v1b_33f185ee18
structs:338.v848422169819
timestamper:1.27
token-macro:400.v35420b_922dcb_
trilead-api:2.147.vb_73cc728a_32e
variant:60.v7290fc0eb_b_cd
workflow-aggregator:600.vb_57cdd26fdd7
workflow-api:1322.v857eeeea_9902
workflow-basic-steps:1058.vcb_fc1e3a_21a_9
workflow-cps:3908.vd6b_b_5a_a_54010
workflow-durable-task-step:1360.v82d13453da_a_f
workflow-job:1436.vfa_244484591f
workflow-multibranch:791.v28fb_f74dfca_e
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:678.v3ee58b_469476
workflow-support:920.v59f71ce16f04
ws-cleanup:0.46
zulip:2.1.1
@Dohbedoh @jonesbusy You guys think it's sufficient enough to call this.collectorRegistry.clear();
in the constructor of DefaultPrometheusMetrics
? Would probably be the easiest workaround?
As I cannot reproduce the issue, could anyone of the affected users test the PR #684 ? HPI will appear here: https://ci.jenkins.io/job/Plugins/job/prometheus-plugin/job/PR-684/1/
Does not fix the issue for me, startup ends on the same part as using v778 did
2024-07-25 18:52:55.492+0000 [id=42] INFO jenkins.InitReactorRunner$1#onAttained: Loaded all jobs
2024-07-25 18:52:55.494+0000 [id=35] INFO o.j.p.l.NodesMirror#createNodeResources: lockable-resources-plugin: configure node resources
Ok thank you
I've created a second version using the lifecycle hooks. If anyone could test this again, that would be great: https://ci.jenkins.io/job/Plugins/job/prometheus-plugin/job/PR-684/lastSuccessfulBuild/artifact/org/jenkins-ci/plugins/prometheus/780.ve7641cfcb_594/prometheus-780.ve7641cfcb_594.hpi
I've created a second version using the lifecycle hooks. If anyone could test this again, that would be great: https://ci.jenkins.io/job/Plugins/job/prometheus-plugin/job/PR-684/lastSuccessfulBuild/artifact/org/jenkins-ci/plugins/prometheus/780.ve7641cfcb_594/prometheus-780.ve7641cfcb_594.hpi
On it!
@Waschndolos Its still not booting up
@tt-kkaiser Is it still the same error message? I've rebooted like 20 times on my instance without any error. I don't get it...
@Waschndolos I never got any error messages. The jenkins instance just stops booting and after some time the Kubernetes Startup Probe kills the Container, but even without the startup probe the jenkins just boots up indefinitely.
Guess I'll need to setup a local cluster. Let me check..
I dont think you need a local cluster, the plugin worked for me when all my build jobs had no build data (the stuff stored in caches/ workspace/ and jobs/**/builds/) but now that I have copied everything from my live instance it stopped booting up so it has something to do with that.
We dont run on Container also. I believe you should have recently finished jobs or something like that.
Ok I'll check that, although it's strange I operate a Jenkins on Docker with like > 3k Jobs without any problems.. I'll see if i can reproduce the issue
I have got the stacktrace whilst the jenkins was booting up:
"DefaultPrometheusMetrics.registerJobCollectors" #39 [2984] daemon prio=5 os_prio=0 cpu=10387.26ms elapsed=64.98s tid=0x00007faaac0039f0 nid=2984 runnable [0x00007fab10c47000]
java.lang.Thread.State: RUNNABLE
at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2692)
at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda/0x0000000100aeaa08.apply(Unknown Source)
at java.util.concurrent.ConcurrentHashMap.compute(java.base@21.0.4/Unknown Source)
- locked <0x00000000c88d44e0> (a java.util.concurrent.ConcurrentHashMap$Node)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2686)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2669)
at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:112)
at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:58)
at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:98)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1931)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1262)
at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
at org.jenkinsci.plugins.workflow.graphanalysis.ForkScanner.next(ForkScanner.java:603)
at org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.next(AbstractFlowScanner.java:211)
at org.jenkinsci.plugins.workflow.graphanalysis.ForkScanner.next(ForkScanner.java:593)
at org.jenkinsci.plugins.workflow.graphanalysis.ForkScanner.visitSimpleChunks(ForkScanner.java:796)
at org.jenkinsci.plugins.workflow.graphanalysis.ForkScanner.visitSimpleChunks(ForkScanner.java:661)
at org.jenkinsci.plugins.prometheus.util.FlowNodes.getSortedStageNodes(FlowNodes.java:26)
at org.jenkinsci.plugins.prometheus.collectors.builds.StageSummary.processPipelineRunStages(StageSummary.java:63)
at org.jenkinsci.plugins.prometheus.collectors.builds.StageSummary.calculateMetric(StageSummary.java:57)
at org.jenkinsci.plugins.prometheus.collectors.builds.StageSummary.calculateMetric(StageSummary.java:20)
at org.jenkinsci.plugins.prometheus.JobCollector.processRun(JobCollector.java:261)
at org.jenkinsci.plugins.prometheus.JobCollector.appendJobMetrics(JobCollector.java:232)
at org.jenkinsci.plugins.prometheus.JobCollector.lambda$collect$0(JobCollector.java:162)
at org.jenkinsci.plugins.prometheus.JobCollector$$Lambda/0x0000000100b91650.accept(Unknown Source)
at org.jenkinsci.plugins.prometheus.util.Jobs.forEachJob(Jobs.java:19)
at org.jenkinsci.plugins.prometheus.JobCollector.collect(JobCollector.java:155)
at io.prometheus.client.CollectorRegistry.collectorNames(CollectorRegistry.java:113)
at io.prometheus.client.CollectorRegistry.register(CollectorRegistry.java:50)
at org.jenkinsci.plugins.prometheus.service.DefaultPrometheusMetrics.registerCollector(DefaultPrometheusMetrics.java:54)
at org.jenkinsci.plugins.prometheus.service.DefaultPrometheusMetrics.registerJobCollectors(DefaultPrometheusMetrics.java:85)
at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@21.0.4/Unknown Source)
at java.lang.invoke.LambdaForm$MH/0x0000000100342000.invoke(java.base@21.0.4/LambdaForm$MH)
at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@21.0.4/Unknown Source)
at jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(java.base@21.0.4/Unknown Source)
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(java.base@21.0.4/Unknown Source)
at java.lang.reflect.Method.invoke(java.base@21.0.4/Unknown Source)
at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:109)
at hudson.init.TaskMethodFinder$TaskImpl.run(TaskMethodFinder.java:185)
at org.jvnet.hudson.reactor.Reactor.runTask(Reactor.java:305)
at jenkins.model.Jenkins$5.runTask(Jenkins.java:1175)
at org.jvnet.hudson.reactor.Reactor$2.run(Reactor.java:221)
at org.jvnet.hudson.reactor.Reactor$Node.run(Reactor.java:120)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.4/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.4/Unknown Source)
at java.lang.Thread.runWith(java.base@21.0.4/Unknown Source)
at java.lang.Thread.run(java.base@21.0.4/Unknown Source)
Okay that's somehting :) this code is there since 2020 - maybe now the behavior changed due to the lifecycles. Will it boot up if you disable this metric like:
.*stage_duration_milliseconds_summary.*
Testing it
Adding that to disabled metrics, does not seem to help
@Dohbedoh Do you have any additional idea? If not I'll probably revert the https://github.com/jenkinsci/prometheus-plugin/pull/682 to bring it back to work
Couldn't the JobCollector be initialized asynchronously?
The Initializer
synchronously execute the registerCollectors
and CollectorRegistry
immediately launch the collection when doing so.. Looks like theJobCollector
may take a long time in some instances, depending on item / build numbers.
As @tt-kkaiser propose, I think that we must register the collectors independently / asynchronously.
The difference that I see now, is that the data for the /prometheus endpoint is load at startup which causes a longer startup time, before the #682 change it was load when the endpoint was first requested, which did not cause any issues, because whilst the collectors were initialized, the endpoint just returned an empty 200 OK response. But as soon as all collectors were finished the endpoint would respond with the correct data.
I think this behavior should be kept. The change that only one instance exists for the PrometheusMetrics class is okay, but the collectors must be registered asynchronously, that can be done by adding a field called loading
(with getter and private setter) so as long as this returns false the /prometheus endpoint may just return an empty 200 OK response.
Ah didn't see your responses yet. I've added synchronized to the registerCollector
and cleanUpCollector
with https://ci.jenkins.io/job/Plugins/job/prometheus-plugin/job/PR-684/3/artifact/org/jenkins-ci/plugins/prometheus/781.v3dcc3856e4b_6/prometheus-781.v3dcc3856e4b_6.hpi maybe that already helps? Otherwise I'll check what you've written here
The difference that I see now, is that the data for the /prometheus endpoint is load at startup which causes a longer startup time, before the https://github.com/jenkinsci/prometheus-plugin/pull/682 change it was load when the endpoint was first requested
Actually not always the case. Guice injector in some environment might initialize earlier than this, as shown in the issue description: https://github.com/jenkinsci/prometheus-plugin/issues/635.
I have quickly propose something: https://github.com/jenkinsci/prometheus-plugin/pull/685. And tested the HPI locally. I people can test it once the CI produce the HPI. cc @Waschndolos
Testing it!
Its working!
Thanks for the effort and the quick implementation of the fix 👍
Well thanks for your patience ! And sorry for the fuss. Trying to fix one scenario was breaking others... Hopefully this one solve the overall problem.. @jonesbusy could you test this one too ?
@Dohbedoh Doing it now. Just before I was testing https://github.com/jenkinsci/prometheus-plugin/pull/684/checks?check_run_id=27949816152 (incrementals 781.v3dcc3856e4b_6) and didn't faces the duplicate collector registration
I will now test the https://github.com/jenkinsci/prometheus-plugin/pull/685/checks?check_run_id=27950494840 (incrementals 779.vc06615c39172)
I will let you know in some minutes my findings
I confirm it work also on my side with 779.vc06615c39172
alrighty then. I'll merge it and perform a release. Thank you all for your support
-> https://github.com/jenkinsci/prometheus-plugin/releases/tag/779.vb_59179a_27643 - Afk now enjoying my day off today :) Thank you all again!
Jenkins and plugins versions report
2.452.3 and all plugin up to date
What Operating System are you using (both controller, and any agents involved in the problem)?
All. Official docker image deployed on K8S using Jenkins Helm chart
Reproduction steps
Restart Jenkins
Expected Results
All work as expected like before on
773.v3b_62d8178eec
Actual Results
Anything else?
This is similar to https://github.com/jenkinsci/prometheus-plugin/issues/647 that was fixed on 2.5.3
Are you interested in contributing a fix?
No response