jenkinsci / prometheus-plugin

Jenkins Prometheus Plugin
https://plugins.jenkins.io/prometheus/
Apache License 2.0
181 stars 151 forks source link

Prometheus Plugin Update from `773.v3b_62d8178eec` to `778.ve1c932a_ff24f` prevents Jenkins from startup #683

Closed jonesbusy closed 1 month ago

jonesbusy commented 1 month ago

Jenkins and plugins versions report

2.452.3 and all plugin up to date

active-directory:2.35
analysis-model-api:12.4.0
ansible:403.v8d0ca_dcb_b_502
ansicolor:1.0.4
ant:497.v94e7d9fffa_b_9
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
apache-httpcomponents-client-5-api:5.3.1-110.v77252fb_d4da_5
artifactory-artifact-manager:122.v0b_d7cb_89ca_34
artifactory-client-api:2.17.0-7.ve5ccd925a_234
asm-api:9.7-33.v4d23ef79fcc8
atlassian-bitbucket-server-integration:4.0.0
authentication-tokens:1.119.v50285141b_7e1
badge:1.13
basic-branch-build-strategies:81.v05e333931c7d
bitbucket-kubernetes-credentials:336.vc0a_911cde608
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.78.1-248.ve27176eb_46cb_
branch-api:2.1169.va_f810c56e895
build-blocker-plugin:166.vc82fc20b_a_ed6
build-discarder:139.v05696a_7fe240
build-name-setter:2.4.3
build-user-vars-plugin:166.v52976843b_435
build-with-parameters:76.v9382db_f78962
byte-buddy-api:1.14.18-35.v791203db_06c0
caffeine-api:3.1.8-133.v17b_1ff2e0599
checks-api:2.2.0
cloudbees-bitbucket-branch-source:888.v8e6d479a_1730
cloudbees-folder:6.928.v7c780211d66e
commons-compress-api:1.26.1-2
commons-lang3-api:3.14.0-76.vda_5591261cfe
commons-text-api:1.12.0-119.v73ef73f2345d
config-file-provider:973.vb_a_80ecb_9a_4d0
configuration-as-code:1836.vccda_4a_122a_a_e
coverage-badges-extension:47.v41e62ecf0928
coverage:1.16.1
credentials-binding:681.vf91669a_32e45
credentials:1371.vfee6b_095f0a_3
custom-folder-icon:2.12
customizable-header:124.vff1b_7602cc5a_
customize-build-now:17.ve5db_875e5343
dark-theme:439.vdef09f81f85e
data-tables-api:2.0.8-1
database-postgresql:100.v2418e0a_c6909
database:247.v244b_d85f086d
display-url-api:2.204.vf6fddd8a_8b_e9
docker-commons:439.va_3cb_0a_6a_fb_29
docker-workflow:580.vc0c340686b_54
dotnet-sdk:1.4.0
dtkit-api:3.0.2
durable-task:555.v6802fe0f0b_82
echarts-api:5.5.0-1
eddsa-api:0.3.0-4.v84c6f0f4969e
email-ext:1814.v404722f34263
embeddable-build-status:487.va_0ef04c898a_2
extended-read-permission:53.v6499940139e5
extension-filter:135.v703ff01ca_817
extra-tool-installers:139.v723fee51b_7f2
file-operations:266.v9d4e1eb_235b_a_
file-parameters:339.v4b_cc83e11455
flatpickr-api:4.6.13-5.v534d8025a_a_59
flyway-api:9.22.3-75.vfdfb_f75a_a_9b_e
flyway-runner:122.v004b_965a_cd08
font-awesome-api:6.5.2-1
forensics-api:2.4.0
generic-tool:1.1
git-client:5.0.0
git-forensics:2.1.0
git-parameter:0.9.19
git:5.2.2
github-api:1.318-461.v7a_c09c9fa_d63
github-branch-source:1790.v5a_7859812c8d
github-checks:581.va_9669c91d6cb_
github:1.39.0
gitlab-api:5.3.0-91.v1f9a_fda_d654f
gitlab-branch-source:704.vc7f1202d7e14
gitlab-kubernetes-credentials:259.v7c72898df530
gradle:2.12
gson-api:2.11.0-41.v019fcf6125dc
h2-api:11.1.4.199-30.v1c64e772f3a_c
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
hidden-parameter:237.v4b_df26c7a_f0e
htmlpublisher:1.36
http_request:1.18
implied-labels:295.v770c54615e6a_
inline-pipeline:1.0.3
instance-identity:185.v303dc7c645f9
ionicons-api:74.v93d5eb_813d5f
jackson2-api:2.17.0-379.v02de8ec9f64c
jacoco:3.3.6
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javadoc:243.vb_b_503b_b_45537
javax-activation-api:1.2.0-7
javax-mail-api:1.6.2-10
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
jersey2-api:2.42-147.va_28a_44603b_d5
jjwt-api:0.11.5-112.ve82dfb_224b_a_d
job-dsl:1.87
jobcacher-artifactory-storage:51.v06da_175e2655
jobcacher:481.v15f51ca_4c6b_7
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
jquery3-api:3.7.1-2
jsch:0.2.16-86.v42e010d9484b_
json-api:20240303-41.v94e11e6de726
json-path-api:2.9.0-58.v62e3e85b_a_655
junit-attachments:239.v9e003a_c80a_8c
junit-sql-storage:282.v0b_9d0f91b_cec
junit:1279.v72cf99b_25c43
kubernetes-client-api:6.10.0-240.v57880ce8b_0b_2
kubernetes-credentials-provider:1.262.v2670ef7ea_0c5
kubernetes-credentials:174.va_36e093562d9
kubernetes:4253.v7700d91739e5
ldap:725.v3cb_b_711b_1a_ef
locale:519.v4e20f313cfa_f
lockable-resources:1255.vf48745da_35d0
login-theme:146.v64a_da_cf70ea_6
mailer:472.vf7c289a_4b_420
mapdb-api:1.0.9-40.v58107308b_7a_7
markdown-formatter:201.v7057261b_8dff
mask-passwords:173.v6a_077a_291eb_5
matrix-auth:3.2.2
matrix-project:832.va_66e270d2946
metrics:4.2.21-451.vd51df8df52ec
mina-sshd-api-common:2.13.1-117.v2f1a_b_66ff91d
mina-sshd-api-core:2.13.1-117.v2f1a_b_66ff91d
mstest:1.0.5
next-build-number:1.8
next-executions:327.v136ff959e97b_
nodejs:1.6.1
nunit:485.ve8a_85357320d
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
openshift-client:1.1.0.424.v829cb_ccf8798
openshift-k8s-credentials:168.v79a_983191991
opentelemetry-api:1.40.0-15.vf8b_8cdb_d8ff7
pam-auth:1.11
parameter-separator:166.vd0120849b_386
parameterized-scheduler:277.v61a_4b_a_49a_c5c
parameterized-trigger:806.vf6fff3e28c3e
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-graph-view:313.v1322ce83d680
pipeline-groovy-lib:727.ve832a_9244dfa_
pipeline-input-step:495.ve9c153f6067b_
pipeline-maven-api:1421.v610fa_b_e2d60e
pipeline-maven-database:1421.v610fa_b_e2d60e
pipeline-maven:1421.v610fa_b_e2d60e
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2205.vc9522a_9d5711
pipeline-model-definition:2.2205.vc9522a_9d5711
pipeline-model-extensions:2.2205.vc9522a_9d5711
pipeline-npm:204.v4dc4c2202625
pipeline-rest-api:2.34
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2205.vc9522a_9d5711
pipeline-utility-steps:2.17.0
plain-credentials:183.va_de8f1dd5a_2b_
platformlabeler:2617.v5444054f5e35
plugin-util-api:4.1.0
postgresql-api:42.7.2-40.v76d376d65c77
postgresql-fingerprint-storage:274.v417e436b_1c5e
prism-api:1.29.0-15
prometheus:778.ve1c932a_ff24f
pubsub-light:1.18
purge-build-queue-plugin:88.v23b_97b_f2c7a_d
resource-disposer:0.23
scm-api:690.vfc8b_54395023
scm-filter-branch-pr:148.v0b_5f06e8b_c84
script-security:1341.va_2819b_414686
sidebar-link:2.4.1
simple-theme-plugin:191.vcd207ef9dd24
skip-notifications-trait:313.vd1337c8f8134
snakeyaml-api:2.2-111.vc6598e30cc65
sonar:2.17.2
ssh-agent:367.vf9076cd4ee21
ssh-credentials:342.ve5a_f1db_5a_132
ssh-slaves:2.973.v0fa_8c0dea_f9f
sshd:3.330.vc866a_8389b_58
strict-crumb-issuer:2.1.1
structs:338.v848422169819
theme-manager:262.vc57ee4a_eda_5d
timestamper:1.27
token-macro:400.v35420b_922dcb_
trilead-api:2.147.vb_73cc728a_32e
uno-choice:2.8.3
variant:60.v7290fc0eb_b_cd
versioncolumn:243.vda_c20eea_a_8a_f
warnings-ng:11.3.0
workflow-aggregator:600.vb_57cdd26fdd7
workflow-api:1322.v857eeeea_9902
workflow-basic-steps:1058.vcb_fc1e3a_21a_9
workflow-cps:3908.vd6b_b_5a_a_54010
workflow-durable-task-step:1360.v82d13453da_a_f
workflow-job:1400.v7fd111b_ec82f
workflow-multibranch:783.va_6eb_ef636fb_d
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:678.v3ee58b_469476
workflow-support:920.v59f71ce16f04
ws-cleanup:0.46

What Operating System are you using (both controller, and any agents involved in the problem)?

All. Official docker image deployed on K8S using Jenkins Helm chart

Reproduction steps

Restart Jenkins

Expected Results

All work as expected like before on 773.v3b_62d8178eec

Actual Results

[05:40:14.108+02:00] - Caused by: org.jvnet.hudson.reactor.ReactorException: java.lang.Error: java.lang.reflect.InvocationTargetException
[05:40:14.108+02:00] -  at org.jvnet.hudson.reactor.Reactor.execute(Reactor.java:290)
[05:40:14.108+02:00] -  at jenkins.InitReactorRunner.run(InitReactorRunner.java:49)
[05:40:14.108+02:00] -  at jenkins.model.Jenkins.executeReactor(Jenkins.java:1210)
[05:40:14.108+02:00] -  at jenkins.model.Jenkins.<init>(Jenkins.java:997)
[05:40:14.108+02:00] -  at hudson.model.Hudson.<init>(Hudson.java:86)
[05:40:14.108+02:00] -  at org.jvnet.hudson.test.JenkinsRule.newHudson(JenkinsRule.java:746)
[05:40:14.108+02:00] -  at org.jvnet.hudson.test.JenkinsRule.before(JenkinsRule.java:408)
[05:40:14.108+02:00] -  at org.jvnet.hudson.test.junit.jupiter.JenkinsExtension.resolveParameter(JenkinsExtension.java:45)
[05:40:14.108+02:00] -  ... 2 more
[05:40:14.108+02:00] - Caused by: java.lang.Error: java.lang.reflect.InvocationTargetException
[05:40:14.108+02:00] -  at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:115)
[05:40:14.108+02:00] -  at hudson.init.TaskMethodFinder$TaskImpl.run(TaskMethodFinder.java:185)
[05:40:14.109+02:00] -  at org.jvnet.hudson.reactor.Reactor.runTask(Reactor.java:305)
[05:40:14.109+02:00] -  at jenkins.model.Jenkins$5.runTask(Jenkins.java:1175)
[05:40:14.109+02:00] -  at org.jvnet.hudson.reactor.Reactor$2.run(Reactor.java:221)
[05:40:14.109+02:00] -  at org.jvnet.hudson.reactor.Reactor$Node.run(Reactor.java:120)
[05:40:14.109+02:00] -  at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
[05:40:14.109+02:00] -  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
[05:40:14.109+02:00] -  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
[05:40:14.109+02:00] -  at java.base/java.lang.Thread.run(Thread.java:1583)
[05:40:14.109+02:00] - Caused by: java.lang.reflect.InvocationTargetException
[05:40:14.109+02:00] -  at java.base/java.lang.reflect.Method.invoke(Method.java:580)
[05:40:14.109+02:00] -  at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:109)
[05:40:14.109+02:00] -  ... 9 more
[05:40:14.109+02:00] - Caused by: java.lang.IllegalArgumentException: Failed to register Collector of type JenkinsStatusCollector: default_jenkins_version_info is already in use by another Collector of type JenkinsStatusCollector
[05:40:14.109+02:00] -  at io.prometheus.client.CollectorRegistry.register(CollectorRegistry.java:57)
[05:40:14.109+02:00] -  at org.jenkinsci.plugins.prometheus.service.DefaultPrometheusMetrics.registerCollector(DefaultPrometheusMetrics.java:54)
[05:40:14.109+02:00] -  at org.jenkinsci.plugins.prometheus.service.DefaultPrometheusMetrics.registerCollectors(DefaultPrometheusMetrics.java:62)
[05:40:14.109+02:00] -  ... 11 more

Anything else?

This is similar to https://github.com/jenkinsci/prometheus-plugin/issues/647 that was fixed on 2.5.3

Are you interested in contributing a fix?

No response

Waschndolos commented 1 month ago

@Dohbedoh any ideas?

Waschndolos commented 1 month ago

I don't get it why it always happens somewhere else. I've multiple docker based Jenkins instance without ANY issues and I've also tested it on a manual installation on linux, no issues as well. What is different here?!

Dohbedoh commented 1 month ago

at org.jvnet.hudson.test.JenkinsRule.newHudson

I don't think that this is supposed to be here. Could the jenkins test harness dependency be loaded in this instance ? Give me a sec, using those plugin deps, I will try to figure out what plugin might be including it.

Dohbedoh commented 1 month ago

Starting an instance with this list, I can't reproduce and can't figure out what includes that class. @jonesbusy could you please run something like the following from the Manage Jenkins > Script Console:

String className = "org.jvnet.hudson.test.JenkinsRule"

def clazzResource = className.replaceAll("\\.", "/") + ".class"
println "Looking for: ${clazzResource}\n"
Jenkins.instance.pluginManager.activePlugins.forEach { PluginWrapper plugin ->
    def c = plugin.classLoader.getResources(clazzResource)
    if (c.hasMoreElements()) {
        println "Found in ${plugin}"
        println Collections.list(c).join("\n") + "\n"
    }
}

return

This should hopefully tell us what include the jenkins-test-harness that should not be present on a live Jenkins instance.

Note: you will need to have Jenkins started - so downgrade temporarily the prometheus plugin.

tt-kkaiser commented 1 month ago

I also have the same issue, that since upgrading to 778.ve1c932a_ff24f my Jenkins no longer starts. However, I don't get an error message, the Jenkins instance just gets stuck after logging:

[id=39]    INFO    jenkins.InitReactorRunner$1#onAttained: Started initialization
[id=38]    INFO    jenkins.InitReactorRunner$1#onAttained: Listed all plugins
[id=41]    INFO    jenkins.InitReactorRunner$1#onAttained: Prepared all plugins
[id=39]    INFO    jenkins.InitReactorRunner$1#onAttained: Started all plugins
[id=41]    INFO    jenkins.InitReactorRunner$1#onAttained: Augmented all extensions
[id=38]    INFO    jenkins.InitReactorRunner$1#onAttained: System config loaded
[id=39]    INFO    jenkins.InitReactorRunner$1#onAttained: System config adapted
[id=40]    INFO    jenkins.InitReactorRunner$1#onAttained: Loaded all jobs
Dohbedoh commented 1 month ago

@tt-kkaiser any chance to get a thread dump or at least the initialization thread stacktrace stuck on startup ?

tt-kkaiser commented 1 month ago

@Dohbedoh At the moment thats not possible, because this Issue only happens on my Production Jenkins and not on my Test Jenkins instance, even though both have the same data (docker image, plugins, configurations)

Dohbedoh commented 1 month ago

Stacktrace of deadlock threads would be the quickest way to narrow this down. @tt-kkaiser is it possible to at least have the list of plugin:version in that instance ?

nilbek commented 1 month ago

We encountered the same issue as in issue 647. After removing Prometheus and enhanced metrics, which is dependent on Prometheus, it opened successfully. I run @Dohbedoh's script, while Jenkins started without mentioned plugins, nothing printed.

jonesbusy commented 1 month ago

at org.jvnet.hudson.test.JenkinsRule.newHudson

I don't think that this is supposed to be here. Could the jenkins test harness dependency be loaded in this instance ? Give me a sec, using those plugin deps, I will try to figure out what plugin might be including it.

On my side I'm using test-harness into a maven project to validate plugin loading and dsl script.

I'm aware it's a bit different from using a real jenkins instance, but it still highlight in a reproductible way there is something wrong with the plugin lifecycle

If I'm able to reproduce on my test environment it should be possible to reproduce it on the plugin test or at least on the jenkins bom that test againts other plugin dependencies

To be honnest I'm also a bit lost on how Initilizer works and I saw also some issues with the opentelemetry and opentelemetry-api plugin with initialization

Perhaps there should be some check to avoid reinitialization ?

I can see something like https://github.com/jenkinsci/opentelemetry-plugin/blob/42e64cb74a0e728cbf020333d77d4f5ecbbcc752/src/main/java/io/jenkins/plugins/opentelemetry/JenkinsOpenTelemetryPluginConfiguration.java#L231 on the opentelemetry plugin ?

Not sure if it helps

tt-kkaiser commented 1 month ago

@Dohbedoh here is a list of all the plugins in the jenkins instance:

active-directory:2.35
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
asm-api:9.7-33.v4d23ef79fcc8
authentication-tokens:1.119.v50285141b_7e1
blueocean-bitbucket-pipeline:1.27.14
blueocean-commons:1.27.14
blueocean-config:1.27.14
blueocean-core-js:1.27.14
blueocean-dashboard:1.27.14
blueocean-display-url:2.4.3
blueocean-events:1.27.14
blueocean-git-pipeline:1.27.14
blueocean-github-pipeline:1.27.14
blueocean-i18n:1.27.14
blueocean-jwt:1.27.14
blueocean-personalization:1.27.14
blueocean-pipeline-api-impl:1.27.14
blueocean-pipeline-editor:1.27.14
blueocean-pipeline-scm-api:1.27.14
blueocean-rest-impl:1.27.14
blueocean-rest:1.27.14
blueocean-web:1.27.14
blueocean:1.27.14
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.78.1-248.ve27176eb_46cb_
branch-api:2.1169.va_f810c56e895
caffeine-api:3.1.8-133.v17b_1ff2e0599
checks-api:2.2.0
cloudbees-bitbucket-branch-source:888.v8e6d479a_1730
cloudbees-folder:6.942.vb_43318a_156b_2
command-launcher:107.v773860566e2e
commons-compress-api:1.26.1-2
commons-lang3-api:3.14.0-76.vda_5591261cfe
commons-text-api:1.12.0-119.v73ef73f2345d
configuration-as-code:1836.vccda_4a_122a_a_e
credentials-binding:681.vf91669a_32e45
credentials:1371.vfee6b_095f0a_3
display-url-api:2.204.vf6fddd8a_8b_e9
docker-commons:439.va_3cb_0a_6a_fb_29
docker-workflow:580.vc0c340686b_54
durable-task:555.v6802fe0f0b_82
echarts-api:5.5.0-1
eddsa-api:0.3.0-4.v84c6f0f4969e
favorite:2.218.vd60382506538
font-awesome-api:6.5.2-1
git-client:5.0.0
git:5.2.2
github-api:1.318-461.v7a_c09c9fa_d63
github-branch-source:1790.v5a_7859812c8d
github:1.39.0
gson-api:2.11.0-41.v019fcf6125dc
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
htmlpublisher:1.36
http_request:1.18
instance-identity:185.v303dc7c645f9
ionicons-api:74.v93d5eb_813d5f
jackson2-api:2.17.0-379.v02de8ec9f64c
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javax-activation-api:1.2.0-7
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
jenkins-design-language:1.27.14
jjwt-api:0.11.5-112.ve82dfb_224b_a_d
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
jquery3-api:3.7.1-2
json-api:20240303-41.v94e11e6de726
json-path-api:2.9.0-58.v62e3e85b_a_655
junit:1265.v65b_14fa_f12f0
kubernetes-client-api:6.10.0-240.v57880ce8b_0b_2
kubernetes-credentials:174.va_36e093562d9
kubernetes:4253.v7700d91739e5
locale:519.v4e20f313cfa_f
mailer:472.vf7c289a_4b_420
matrix-auth:3.2.2
matrix-project:832.va_66e270d2946
metrics:4.2.21-451.vd51df8df52ec
mina-sshd-api-common:2.13.1-117.v2f1a_b_66ff91d
mina-sshd-api-core:2.13.1-117.v2f1a_b_66ff91d
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-groovy-lib:727.ve832a_9244dfa_
pipeline-input-step:495.ve9c153f6067b_
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2205.vc9522a_9d5711
pipeline-model-definition:2.2205.vc9522a_9d5711
pipeline-model-extensions:2.2205.vc9522a_9d5711
pipeline-rest-api:2.34
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2205.vc9522a_9d5711
pipeline-utility-steps:2.17.0
plain-credentials:183.va_de8f1dd5a_2b_
plugin-util-api:4.1.0
prism-api:1.29.0-15
prometheus:773.v3b_62d8178eec
pubsub-light:1.18
resource-disposer:0.23
scm-api:690.vfc8b_54395023
script-security:1341.va_2819b_414686
snakeyaml-api:2.2-111.vc6598e30cc65
sse-gateway:1.27
ssh-agent:367.vf9076cd4ee21
ssh-credentials:341.vf31377f30378
ssh-slaves:2.973.v0fa_8c0dea_f9f
stashNotifier:1.492.v1b_33f185ee18
structs:338.v848422169819
timestamper:1.27
token-macro:400.v35420b_922dcb_
trilead-api:2.147.vb_73cc728a_32e
variant:60.v7290fc0eb_b_cd
workflow-aggregator:600.vb_57cdd26fdd7
workflow-api:1322.v857eeeea_9902
workflow-basic-steps:1058.vcb_fc1e3a_21a_9
workflow-cps:3908.vd6b_b_5a_a_54010
workflow-durable-task-step:1360.v82d13453da_a_f
workflow-job:1436.vfa_244484591f
workflow-multibranch:791.v28fb_f74dfca_e
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:678.v3ee58b_469476
workflow-support:920.v59f71ce16f04
ws-cleanup:0.46
zulip:2.1.1
Waschndolos commented 1 month ago

@Dohbedoh @jonesbusy You guys think it's sufficient enough to call this.collectorRegistry.clear(); in the constructor of DefaultPrometheusMetrics? Would probably be the easiest workaround?

Waschndolos commented 1 month ago

As I cannot reproduce the issue, could anyone of the affected users test the PR #684 ? HPI will appear here: https://ci.jenkins.io/job/Plugins/job/prometheus-plugin/job/PR-684/1/

cmoran commented 1 month ago

Does not fix the issue for me, startup ends on the same part as using v778 did

2024-07-25 18:52:55.492+0000 [id=42]    INFO    jenkins.InitReactorRunner$1#onAttained: Loaded all jobs
2024-07-25 18:52:55.494+0000 [id=35]    INFO    o.j.p.l.NodesMirror#createNodeResources: lockable-resources-plugin: configure node resources
Waschndolos commented 1 month ago

Ok thank you

Waschndolos commented 1 month ago

I've created a second version using the lifecycle hooks. If anyone could test this again, that would be great: https://ci.jenkins.io/job/Plugins/job/prometheus-plugin/job/PR-684/lastSuccessfulBuild/artifact/org/jenkins-ci/plugins/prometheus/780.ve7641cfcb_594/prometheus-780.ve7641cfcb_594.hpi

tt-kkaiser commented 1 month ago

I've created a second version using the lifecycle hooks. If anyone could test this again, that would be great: https://ci.jenkins.io/job/Plugins/job/prometheus-plugin/job/PR-684/lastSuccessfulBuild/artifact/org/jenkins-ci/plugins/prometheus/780.ve7641cfcb_594/prometheus-780.ve7641cfcb_594.hpi

On it!

tt-kkaiser commented 1 month ago

@Waschndolos Its still not booting up

Waschndolos commented 1 month ago

@tt-kkaiser Is it still the same error message? I've rebooted like 20 times on my instance without any error. I don't get it...

tt-kkaiser commented 1 month ago

@Waschndolos I never got any error messages. The jenkins instance just stops booting and after some time the Kubernetes Startup Probe kills the Container, but even without the startup probe the jenkins just boots up indefinitely.

Waschndolos commented 1 month ago

Guess I'll need to setup a local cluster. Let me check..

tt-kkaiser commented 1 month ago

I dont think you need a local cluster, the plugin worked for me when all my build jobs had no build data (the stuff stored in caches/ workspace/ and jobs/**/builds/) but now that I have copied everything from my live instance it stopped booting up so it has something to do with that.

nilbek commented 1 month ago

We dont run on Container also. I believe you should have recently finished jobs or something like that.

Waschndolos commented 1 month ago

Ok I'll check that, although it's strange I operate a Jenkins on Docker with like > 3k Jobs without any problems.. I'll see if i can reproduce the issue

tt-kkaiser commented 1 month ago

I have got the stacktrace whilst the jenkins was booting up:

"DefaultPrometheusMetrics.registerJobCollectors" #39 [2984] daemon prio=5 os_prio=0 cpu=10387.26ms elapsed=64.98s tid=0x00007faaac0039f0 nid=2984 runnable  [0x00007fab10c47000]
   java.lang.Thread.State: RUNNABLE
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2692)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda/0x0000000100aeaa08.apply(Unknown Source)
        at java.util.concurrent.ConcurrentHashMap.compute(java.base@21.0.4/Unknown Source)
        - locked <0x00000000c88d44e0> (a java.util.concurrent.ConcurrentHashMap$Node)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2686)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2669)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:112)
        at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:58)
        at org.jenkinsci.plugins.workflow.support.storage.SimpleXStreamFlowNodeStorage.getNode(SimpleXStreamFlowNodeStorage.java:98)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1931)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1262)
        at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:165)
        at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:156)
        at org.jenkinsci.plugins.workflow.graphanalysis.ForkScanner.next(ForkScanner.java:603)
        at org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.next(AbstractFlowScanner.java:211)
        at org.jenkinsci.plugins.workflow.graphanalysis.ForkScanner.next(ForkScanner.java:593)
        at org.jenkinsci.plugins.workflow.graphanalysis.ForkScanner.visitSimpleChunks(ForkScanner.java:796)
        at org.jenkinsci.plugins.workflow.graphanalysis.ForkScanner.visitSimpleChunks(ForkScanner.java:661)
        at org.jenkinsci.plugins.prometheus.util.FlowNodes.getSortedStageNodes(FlowNodes.java:26)
        at org.jenkinsci.plugins.prometheus.collectors.builds.StageSummary.processPipelineRunStages(StageSummary.java:63)
        at org.jenkinsci.plugins.prometheus.collectors.builds.StageSummary.calculateMetric(StageSummary.java:57)
        at org.jenkinsci.plugins.prometheus.collectors.builds.StageSummary.calculateMetric(StageSummary.java:20)
        at org.jenkinsci.plugins.prometheus.JobCollector.processRun(JobCollector.java:261)
        at org.jenkinsci.plugins.prometheus.JobCollector.appendJobMetrics(JobCollector.java:232)
        at org.jenkinsci.plugins.prometheus.JobCollector.lambda$collect$0(JobCollector.java:162)
        at org.jenkinsci.plugins.prometheus.JobCollector$$Lambda/0x0000000100b91650.accept(Unknown Source)
        at org.jenkinsci.plugins.prometheus.util.Jobs.forEachJob(Jobs.java:19)
        at org.jenkinsci.plugins.prometheus.JobCollector.collect(JobCollector.java:155)
        at io.prometheus.client.CollectorRegistry.collectorNames(CollectorRegistry.java:113)
        at io.prometheus.client.CollectorRegistry.register(CollectorRegistry.java:50)
        at org.jenkinsci.plugins.prometheus.service.DefaultPrometheusMetrics.registerCollector(DefaultPrometheusMetrics.java:54)
        at org.jenkinsci.plugins.prometheus.service.DefaultPrometheusMetrics.registerJobCollectors(DefaultPrometheusMetrics.java:85)
        at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@21.0.4/Unknown Source)
        at java.lang.invoke.LambdaForm$MH/0x0000000100342000.invoke(java.base@21.0.4/LambdaForm$MH)
        at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@21.0.4/Unknown Source)
        at jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(java.base@21.0.4/Unknown Source)
        at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(java.base@21.0.4/Unknown Source)
        at java.lang.reflect.Method.invoke(java.base@21.0.4/Unknown Source)
        at hudson.init.TaskMethodFinder.invoke(TaskMethodFinder.java:109)
        at hudson.init.TaskMethodFinder$TaskImpl.run(TaskMethodFinder.java:185)
        at org.jvnet.hudson.reactor.Reactor.runTask(Reactor.java:305)
        at jenkins.model.Jenkins$5.runTask(Jenkins.java:1175)
        at org.jvnet.hudson.reactor.Reactor$2.run(Reactor.java:221)
        at org.jvnet.hudson.reactor.Reactor$Node.run(Reactor.java:120)
        at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.4/Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.4/Unknown Source)
        at java.lang.Thread.runWith(java.base@21.0.4/Unknown Source)
        at java.lang.Thread.run(java.base@21.0.4/Unknown Source)
Waschndolos commented 1 month ago

Okay that's somehting :) this code is there since 2020 - maybe now the behavior changed due to the lifecycles. Will it boot up if you disable this metric like: grafik

.*stage_duration_milliseconds_summary.*

tt-kkaiser commented 1 month ago

Testing it

tt-kkaiser commented 1 month ago

Adding that to disabled metrics, does not seem to help

Waschndolos commented 1 month ago

@Dohbedoh Do you have any additional idea? If not I'll probably revert the https://github.com/jenkinsci/prometheus-plugin/pull/682 to bring it back to work

tt-kkaiser commented 1 month ago

Couldn't the JobCollector be initialized asynchronously?

Dohbedoh commented 1 month ago

The Initializer synchronously execute the registerCollectors and CollectorRegistry immediately launch the collection when doing so.. Looks like theJobCollector may take a long time in some instances, depending on item / build numbers. As @tt-kkaiser propose, I think that we must register the collectors independently / asynchronously.

tt-kkaiser commented 1 month ago

The difference that I see now, is that the data for the /prometheus endpoint is load at startup which causes a longer startup time, before the #682 change it was load when the endpoint was first requested, which did not cause any issues, because whilst the collectors were initialized, the endpoint just returned an empty 200 OK response. But as soon as all collectors were finished the endpoint would respond with the correct data.

I think this behavior should be kept. The change that only one instance exists for the PrometheusMetrics class is okay, but the collectors must be registered asynchronously, that can be done by adding a field called loading (with getter and private setter) so as long as this returns false the /prometheus endpoint may just return an empty 200 OK response.

Waschndolos commented 1 month ago

Ah didn't see your responses yet. I've added synchronized to the registerCollector and cleanUpCollector with https://ci.jenkins.io/job/Plugins/job/prometheus-plugin/job/PR-684/3/artifact/org/jenkins-ci/plugins/prometheus/781.v3dcc3856e4b_6/prometheus-781.v3dcc3856e4b_6.hpi maybe that already helps? Otherwise I'll check what you've written here

Dohbedoh commented 1 month ago

The difference that I see now, is that the data for the /prometheus endpoint is load at startup which causes a longer startup time, before the https://github.com/jenkinsci/prometheus-plugin/pull/682 change it was load when the endpoint was first requested

Actually not always the case. Guice injector in some environment might initialize earlier than this, as shown in the issue description: https://github.com/jenkinsci/prometheus-plugin/issues/635.

I have quickly propose something: https://github.com/jenkinsci/prometheus-plugin/pull/685. And tested the HPI locally. I people can test it once the CI produce the HPI. cc @Waschndolos

tt-kkaiser commented 1 month ago

Testing it!

tt-kkaiser commented 1 month ago

Its working!

Thanks for the effort and the quick implementation of the fix 👍

Dohbedoh commented 1 month ago

Well thanks for your patience ! And sorry for the fuss. Trying to fix one scenario was breaking others... Hopefully this one solve the overall problem.. @jonesbusy could you test this one too ?

jonesbusy commented 1 month ago

@Dohbedoh Doing it now. Just before I was testing https://github.com/jenkinsci/prometheus-plugin/pull/684/checks?check_run_id=27949816152 (incrementals 781.v3dcc3856e4b_6) and didn't faces the duplicate collector registration

I will now test the https://github.com/jenkinsci/prometheus-plugin/pull/685/checks?check_run_id=27950494840 (incrementals 779.vc06615c39172)

I will let you know in some minutes my findings

jonesbusy commented 1 month ago

I confirm it work also on my side with 779.vc06615c39172

Waschndolos commented 1 month ago

alrighty then. I'll merge it and perform a release. Thank you all for your support

Waschndolos commented 1 month ago

-> https://github.com/jenkinsci/prometheus-plugin/releases/tag/779.vb_59179a_27643 - Afk now enjoying my day off today :) Thank you all again!