jenkinsci / opentelemetry-plugin

Monitor and observe Jenkins with OpenTelemetry.
https://plugins.jenkins.io/opentelemetry/
Apache License 2.0
100 stars 53 forks source link

Not seeing some APM data in Elastic (CPU, Java Heap, etc #584

Closed trudesea closed 1 year ago

trudesea commented 1 year ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.375.1 OS: Linux - 5.10.133+ --- ace-editor:1.1 allure-jenkins-plugin:2.30.3 ansicolor:1.0.2 antisamy-markup-formatter:155.v795fb_8702324 apache-httpcomponents-client-4-api:4.5.13-138.v4e7d9a_7b_a_e61 atlassian-bitbucket-server-integration:3.3.2 authentication-tokens:1.4 authorize-project:1.4.0 bitbucket-kubernetes-credentials:88.vc8d98c56572c bootstrap4-api:4.6.0-5 bootstrap5-api:5.2.1-3 bouncycastle-api:2.26 branch-api:2.1051.v9985666b_f6cc build-name-setter:2.2.0 caffeine-api:2.9.3-65.v6a_47d0f4d1fe checks-api:1.8.0 cloudbees-folder:6.800.v71307ca_b_986b command-launcher:1.2 commons-lang3-api:3.12.0-36.vd97de6465d5b_ commons-text-api:1.10.0-27.vb_fa_3896786a_7 configuration-as-code:1569.vb_72405b_80249 credentials:1214.v1de940103927 credentials-binding:523.vd859a_4b_122e6 custom-checkbox-parameter:1.4 dark-theme:156.v6cf16af6f9ef display-url-api:2.3.6 durable-task:503.v57154d18d478 echarts-api:5.4.0-1 email-ext:2.92 extended-choice-parameter:359.v35dcfdd0c20d font-awesome-api:6.2.1-1 generic-webhook-trigger:1.85.2 git:4.14.1 git-client:3.13.1 git-server:1.11 github:1.36.0 github-api:1.303-400.v35c2d8258028 github-branch-source:1696.v3a_7603564d04 google-kubernetes-engine:0.8.7 google-metadata-plugin:0.4 google-oauth-plugin:1.0.7 google-storage-plugin:1.5.7 groovy:453.vcdb_a_c5c99890 handlebars:3.0.8 hashicorp-vault-pipeline:1.4 hashicorp-vault-plugin:359.v2da_3b_45f17d5 instance-identity:142.v04572ca_5b_265 ionicons-api:31.v4757b_6987003 jackson2-api:2.14.1-313.v504cdd45c18b jakarta-activation-api:2.0.1-2 jakarta-mail-api:2.0.1-2 javadoc:226.v71211feb_e7e9 javax-activation-api:1.2.0-5 javax-mail-api:1.6.2-6 jaxb:2.3.7-1 jdk-tool:1.0 jjwt-api:0.11.5-77.v646c772fddb_0 job-dsl:1.81 jquery:1.12.4-1 jquery3-api:3.6.1-2 jsch:0.1.55.61.va_e9ee26616e7 junit:1166.va_436e268e972 kubernetes:3734.v562b_b_a_627ea_c kubernetes-client-api:5.12.2-193.v26a_6078f65a_9 kubernetes-credentials:0.9.0 kubernetes-credentials-provider:1.206.v7ce2cf7b_0c8b logstash:2.5.0205.vd05825ed46bd mailer:438.v02c7f0a_12fa_4 matrix-auth:3.1.5 matrix-project:785.v06b_7f47b_c631 maven-plugin:3.20 metrics:4.2.13-420.vea_2f17932dd6 mina-sshd-api-common:2.9.1-44.v476733c11f82 mina-sshd-api-core:2.9.1-44.v476733c11f82 momentjs:1.1.1 oauth-credentials:0.5 okhttp-api:4.9.3-108.v0feda04578cf opentelemetry:2.10.0 pam-auth:1.10 phabricator-k8s:1.0.0 phabricator-plugin:2.1.5 pipeline-build-step:2.18 pipeline-github-lib:38.v445716ea_edda_ pipeline-graph-analysis:195.v5812d95a_a_2f9 pipeline-groovy-lib:621.vb_44ce045b_582 pipeline-input-step:466.v6d0a_5df34f81 pipeline-milestone-step:101.vd572fef9d926 pipeline-model-api:2.2118.v31fd5b_9944b_5 pipeline-model-definition:2.2118.v31fd5b_9944b_5 pipeline-model-extensions:2.2118.v31fd5b_9944b_5 pipeline-rest-api:2.28 pipeline-stage-step:296.v5f6908f017a_5 pipeline-stage-tags-metadata:2.2118.v31fd5b_9944b_5 pipeline-stage-view:2.28 pipeline-utility-steps:2.14.0 plain-credentials:139.ved2b_9cf7587b plugin-util-api:2.20.0 popper-api:1.16.1-3 popper2-api:2.11.6-2 saltstack:3.2.2 scm-api:621.vda_a_b_055e58f7 script-security:1218.v39ca_7f7ed0a_c snakeyaml-api:1.33-90.v80dcb_3814d35 ssh-credentials:305.v8f4381501156 ssh-slaves:2.854.v7fd446b_337c9 ssh-steps:2.0.39.v831c5e6468b_c sshd:3.242.va_db_9da_b_26a_c3 structs:324.va_f5d6774f3a_d terraform:1.0.10 theme-manager:0.6 throttle-concurrents:2.10 token-macro:321.vd7cc1f2a_52c8 trilead-api:2.84.v72119de229b_7 uno-choice:2.6.4 variant:59.vf075fe829ccb view-job-filters:2.3 workflow-aggregator:590.v6a_d052e5a_a_b_5 workflow-api:1200.v8005c684b_a_c6 workflow-basic-steps:994.vd57e3ca_46d24 workflow-cps:3536.vb_8a_6628079d5 workflow-cps-global-lib:588.v576c103a_ff86 workflow-durable-task-step:1217.v38306d8fa_b_5c workflow-job:1254.v3f64639b_11dd workflow-multibranch:716.vc692a_e52371b_ workflow-scm-step:400.v6b_89a_1317c9a_ workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:839.v35e2736cfd5c ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Docker Containers from dockerhub: 2.375.1-lts-centos7 jenkins/inbound-agent:4.10-3

All Running in GKE, Elastic version 8.5.2

Reproduction steps

Install plugin Configure plugin for use with Elasticsearch Observability View APM dashboards, some data such as transactions etc are there.

Expected Results

See APM data including CPU, RAM, Java Heap, etc

Actual Results

No data jenkins1 Screenshot 2023-01-26 124631

Anything else?

Placed the following in the advanced configuration section as I was getting 404 errors without it

otel.exporter.otlp.protocol=http/protobuf

trudesea commented 1 year ago

I imagine it has something to do with the last line? getFirstMetricsCapableObservabilityBackend: null

OpenTelemetry SDK initialized: SDK [config: otel.traces.exporter=otlp, otel.metrics.exporter=otlp, otel.exporter.otlp.endpoint=https://xxx-apm.xxx.com, resource: service.name=jenkins-dev, service.namespace=jenkins, service.version=2.375.1]
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.OpenTelemetrySdkProvider
Initialize Otel SDK on components: io.jenkins.plugins.opentelemetry.init.OtelJulHandler, io.jenkins.plugins.opentelemetry.computer.MonitoringCloudListener, io.jenkins.plugins.opentelemetry.computer.MonitoringComputerListener, io.jenkins.plugins.opentelemetry.init.GitHubClientMonitoring, io.jenkins.plugins.opentelemetry.init.JvmMonitoringInitializer, io.jenkins.plugins.opentelemetry.init.SCMEventMonitoringInitializer, io.jenkins.plugins.opentelemetry.init.ServletFilterInitializer, io.jenkins.plugins.opentelemetry.job.MonitoringBuildStepListener, io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener, io.jenkins.plugins.opentelemetry.job.MonitoringRunListener, io.jenkins.plugins.opentelemetry.job.OtelTraceService, io.jenkins.plugins.opentelemetry.job.log.OtelLogStorageFactory, io.jenkins.plugins.opentelemetry.queue.MonitoringQueueListener, io.jenkins.plugins.opentelemetry.security.AuditingSecurityListener
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.computer.MonitoringCloudListener
Start monitoring Jenkins cloud agent provisioning...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.computer.MonitoringComputerListener
Start monitoring Jenkins agents management...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.init.GitHubClientMonitoring
Start monitoring Jenkins GitHub client...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.init.JvmMonitoringInitializer
Start monitoring Jenkins JVM...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.init.SCMEventMonitoringInitializer
Start monitoring Jenkins SCM events...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.init.ServletFilterInitializer
Jenkins Web instrumentation enabled
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener
Start monitoring Jenkins pipeline executions...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.job.MonitoringRunListener
Start monitoring Jenkins build executions...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.queue.MonitoringQueueListener
Start monitoring Jenkins queue...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.security.AuditingSecurityListener
Start monitoring Jenkins authentication events...
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.JenkinsOpenTelemetryPluginConfiguration
resolveStorageRetriever: CustomLogStorageRetriever{urlTemplate=groovy.text.GStringTemplateEngine$GStringTemplate@63fb79e}
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.JenkinsOpenTelemetryPluginConfiguration
Configured
Feb 03, 2023 4:31:22 PM FINE io.jenkins.plugins.opentelemetry.OpenTelemetryRootAction
getFirstMetricsCapableObservabilityBackend: null
Feb 03, 2023 4:31:25 PM FINE io.jenkins.plugins.opentelemetry.OpenTelemetryRootAction
getFirstMetricsCapableObservabilityBackend: null
trudesea commented 1 year ago

I opened a ticket with elastic, basically they say that the plugin is putting the data in the wrong place for the apm dashboard. The date is there in elastic, just not in the correct index pattern. Sorry no more info than that.

cyrille-leclerc commented 1 year ago

Note that we are re-aligning with the otel java sdk official runtime metrics adopting the instrumentation:opentelemetry-runtime-metrics library.

cyrille-leclerc commented 1 year ago

Can you please test https://github.com/jenkinsci/opentelemetry-plugin/releases/tag/opentelemetry-2.12.0-rc1

It uses the instrumentation of the Otel Java Auto Instrumentation library to collect JVM / runtime metrics.

trudesea commented 1 year ago

Can you please test https://github.com/jenkinsci/opentelemetry-plugin/releases/tag/opentelemetry-2.12.0-rc1

It uses the instrumentation of the Otel Java Auto Instrumentation library to collect JVM / runtime metrics.

I can, but we did eventually run into another issue where the plugin was crashing jenkins, we do 100-200 jobs constantly throughout the day in our production system, these jobs can be quick or running for up to 4 hours. My guess is we where simply pumping too much data at once and it was causing restarts on our master jenkins container(GKE). We've disabled the plugin at this time.

cyrille-leclerc commented 1 year ago

I'm sorry to hear this @trudesea . Could you by any chance capture details of the crash of the Jenkins controller so we can troubleshoot your problem? Was it an OutOfMemoryError problem?

trudesea commented 1 year ago

I'm sorry to hear this @trudesea . Could you by any chance capture details of the crash of the Jenkins controller so we can troubleshoot your problem? Was it an OutOfMemoryError problem?

We looked though the logs after a Jenkins master container restart and could find nothing, it would just crash and GKE would restart it., ram and cpu usage were within constraints. Because we couldn't find an obvious cause we were instructed to disable the plugin, afterwards stability returned. I cannot reproduce the problem in our dev environment, which does not reach the number of jobs production does.

cyrille-leclerc commented 1 year ago

Thanks for the clarification @trudesea . I'm going to close this ticket as we can't investigate further. Please feel free to open a ticket if you can reproduce the scalability problems you encountered.

trudesea commented 1 year ago

Hi,

We are revisiting the plugin since the new updates, The issue remains, I've create 2 bug reports with an additional bug discovered Can work with anyone on this in our dev environment.

Thanks