jenkinsci / opentelemetry-plugin

Monitor and observe Jenkins with OpenTelemetry.
https://plugins.jenkins.io/opentelemetry/
Apache License 2.0
98 stars 50 forks source link

Plugin generates thousands of threads #484

Closed timmjd closed 2 years ago

timmjd commented 2 years ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.346.2 OS: Linux - 5.4.196-108.356.amzn2.x86_64 --- Parameterized-Remote-Trigger:3.1.6.1 ace-editor:1.1 ansicolor:1.0.1 antisamy-markup-formatter:2.7 anything-goes-formatter:1.0 apache-httpcomponents-client-4-api:4.5.13-1.0 authentication-tokens:1.4 badge:1.9.1 basic-branch-build-strategies:1.3.2 blueocean:1.25.5 blueocean-autofavorite:1.2.5 blueocean-bitbucket-pipeline:1.25.5 blueocean-commons:1.25.5 blueocean-config:1.25.5 blueocean-core-js:1.25.5 blueocean-dashboard:1.25.5 blueocean-display-url:2.4.1 blueocean-events:1.25.5 blueocean-git-pipeline:1.25.5 blueocean-github-pipeline:1.25.5 blueocean-i18n:1.25.5 blueocean-jwt:1.25.5 blueocean-personalization:1.25.5 blueocean-pipeline-api-impl:1.25.5 blueocean-pipeline-editor:1.25.5 blueocean-pipeline-scm-api:1.25.5 blueocean-rest:1.25.5 blueocean-rest-impl:1.25.5 blueocean-web:1.25.5 bootstrap5-api:5.1.3-7 bouncycastle-api:2.26 branch-api:2.1046.v0ca_37783ecc5 caffeine-api:2.9.3-65.v6a_47d0f4d1fe checks-api:1.7.4 cloudbees-bitbucket-branch-source:773.v4b_9b_005b_562b_ cloudbees-disk-usage-simple:0.10 cloudbees-folder:6.729.v2b_9d1a_74d673 command-launcher:1.2 commons-lang3-api:3.12.0.0 commons-text-api:1.9-9.v39a_53e2e0343 configuration-as-code:1466.v2d4119502006 copyartifact:1.46.4 credentials:1139.veb_9579fca_33b_ credentials-binding:523.vd859a_4b_122e6 display-url-api:2.3.6 docker-commons:1.19 docker-workflow:1.29 dtkit-api:3.0.1 durable-task:496.va67c6f9eefa7 echarts-api:5.3.3-1 email-ext:2.89 embeddable-build-status:237.v37e2c03964c9 extended-read-permission:3.2 favorite:2.4.1 font-awesome-api:6.1.1-1 git:4.11.3 git-client:3.11.0 github:1.34.4 github-api:1.303-400.v35c2d8258028 github-branch-source:1656.v77eddb_b_e95df groovy-postbuild:2.5 handlebars:3.0.8 handy-uri-templates-2-api:2.1.8-22.v77d5b_75e6953 hashicorp-vault-plugin:351.vdb_f83a_1c6a_9d htmlpublisher:1.30 http_request:1.15 jackson2-api:2.13.3-285.vc03c0256d517 javax-activation-api:1.2.0-3 javax-mail-api:1.6.2-6 jaxb:2.3.6-1 jdk-tool:1.0 jenkins-design-language:1.25.5 jjwt-api:0.11.5-77.v646c772fddb_0 job-dsl:1.80 jobConfigHistory:1156.v536a_97b_8d649 jquery-detached:1.2.1 jquery3-api:3.6.0-4 jsch:0.1.55.2 junit:1119.1121.vc43d0fc45561 kubernetes:3663.v1c1e0ec5b_650 kubernetes-cli:1.10.3 kubernetes-client-api:5.12.2-193.v26a_6078f65a_9 kubernetes-credentials:0.9.0 ldap:2.10 locale:180.v207501dff9b_a_ lockable-resources:2.15 mailer:414.vcc4c33714601 matrix-auth:2.6.11 matrix-project:772.v494f19991984 metrics:4.1.6.2 momentjs:1.1.1 monitoring:1.91.0 okhttp-api:4.9.3-105.vb96869f8ac3a opentelemetry:2.9.0-rc1 permissive-script-security:0.7 pipeline-build-step:2.18 pipeline-graph-analysis:195.v5812d95a_a_2f9 pipeline-groovy-lib:593.va_a_fc25d520e9 pipeline-input-step:449.v77f0e8b_845c4 pipeline-milestone-step:101.vd572fef9d926 pipeline-model-api:2.2097.v33db_b_de764b_e pipeline-model-definition:2.2097.v33db_b_de764b_e pipeline-model-extensions:2.2097.v33db_b_de764b_e pipeline-rest-api:2.24 pipeline-stage-step:293.v200037eefcd5 pipeline-stage-tags-metadata:2.2097.v33db_b_de764b_e pipeline-stage-view:2.24 pipeline-utility-steps:2.13.0 plain-credentials:1.8 plugin-util-api:2.17.0 popper2-api:2.11.5-2 prometheus:2.0.10 pubsub-light:1.16 robot:3.2.0 role-strategy:3.2.0 scm-api:608.vfa_f971c5a_a_e9 script-security:1175.v4b_d517d6db_f0 simple-theme-plugin:103.va_161d09c38c7 snakeyaml-api:1.30.2-76.vc104f7ce9870 sse-gateway:1.25 ssh-agent:295.v9ca_a_1c7cc3a_a_ ssh-credentials:291.v8211e4f8efb_c ssh-slaves:1.834.v622da_57f702c sshd:3.0.3 structs:318.va_f3ccb_729b_71 timestamper:1.18 token-macro:293.v283932a_0a_b_49 trilead-api:1.67.vc3938a_35172f variant:1.4 windows-slaves:1.8.1 workflow-aggregator:590.v6a_d052e5a_a_b_5 workflow-api:1188.v0016b_4f29881 workflow-basic-steps:969.vc4ec3e4854b_f workflow-cps:2746.v0da_83a_332669 workflow-durable-task-step:1174.v73a_9a_17edce0 workflow-job:1207.ve6191ff089f8 workflow-multibranch:716.vc692a_e52371b_ workflow-scm-step:400.v6b_89a_1317c9a_ workflow-step-api:625.vd896b_f445a_f8 workflow-support:833.va_1c71061486b_ xunit:3.1.0 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

The current implementation seems to spawn tons of threads over time

Reproduction steps

Run OpenTelemetry with logging forward for quiet a time

Expected Results

Threads will stay relativley constant

Actual Results

Threads are growing over time

Anything else?

JMX / Prometheus monitoring: image

Monitoring plugin

https://my_jenkins_name/monitoring?part=threads

I/O dispatcher threads image

pool threads image

timmjd commented 2 years ago

Digging a bit more into the issue.

Looks like this is only happens if the Store pipeline logs In Elastic and visualize logs both in Elastic and through Jenkins option is used.

For a given job:

podTemplate(showRawYaml: false) {
    node(POD_LABEL) {
        sh('pwd')
    }
}

the following timestamps represent the thread count:

  1. Jenkins freshly restarted
  2. With the option Don't store pipeline logs in Elastic, execute the job 10 times
  3. All Jobs finished: Thread count goes back to normal
  4. Switching the option to Store pipeline logs In Elastic and visualize logs exclusively in Elastic and execute the same job 10 times
  5. All jobs finished: Thread count goes back to normal
  6. Changing the option to Store pipeline logs In Elastic and visualize logs both in Elastic and through Jenkins and execute the same job 10 times
  7. All jobs finished: Thread count stays high 🔥
  8. Executing the same job additional 10 times
  9. All jobs finished: Thread count grows even higher - and stays 🔥

image

As you can see, some threads are not removed if the option is set to Store pipeline logs In Elastic and visualize logs both in Elastic and through Jenkins. With every new job executed, the threads will constantly increase. It's also not possible to reduce the threads by switching the option back to the Don't store pipeline logs in Elastic option. Only workaround is a restart. We'd seen the thread count will usually grow over time up to 2.000 - 3.000 before Jenkins will crash and perform a restart.

cyrille-leclerc commented 2 years ago

Sorry for the inconvenience and thank you for all the troubleshooting. I'm on holidays but I'm exploring the problem.

Cyrille's observations on Otel SDK 1.16.0

See

cyrille-leclerc commented 2 years ago

Potential http connection leak on the Elasticsearch access found.

I'm introducing ErrorProne checker to put it under control.

-- Update 2022/08/18

cyrille-leclerc commented 2 years ago

@timmjd can you please test https://github.com/jenkinsci/opentelemetry-plugin/releases/tag/opentelemetry-2.9.1-beta-1 and check if the thread leak is fixed?

timmjd commented 2 years ago

Niiiiice, looks like you fixed it:

Even with ~50 builds, the thread count goes back to normal after a few minutes. There is also no pool or I/O thread hanging anymore. Thank you very much!

image

cyrille-leclerc commented 2 years ago

Thank you so much for testing in depth and for sharing feedback on your Saturday