jenkinsci / opentelemetry-plugin

Monitor and observe Jenkins with OpenTelemetry.
https://plugins.jenkins.io/opentelemetry/
Apache License 2.0
98 stars 52 forks source link

Support for Retrieving Jenkins Build Duration Metrics via OpenTelemetry Plugin #972

Open miraccan00 opened 5 days ago

miraccan00 commented 5 days ago

Hello OpenTelemetry Development Team,

I have been working on integrating Jenkins with OpenTelemetry to collect build metrics. While I've made some progress, I've encountered challenges in retrieving certain metrics and would like to seek your assistance or guidance.

Objectives:

Challenges:

Additional Goals:

Current Implementation:

I have created a repository with my current setup and attempts to achieve these objectives:

Request:

Thank you for your time and consideration. I look forward to your response and the possibility of enhancing the Jenkins OpenTelemetry integration together.

Best regards,

cyrille-leclerc commented 3 days ago

Great suggestion! For the build metrics, please see:

Longer term you are absolutely right, the Jenkins otel plugin should provide all the metrics needed by Jenkins admins and users.

I'm on PTO at the moment, I'll follow up asap.

christophe-kamphaus-jemmic commented 2 days ago

The opentelemetry-plugin also supports sending build traces to a tracing backend (elasticsearch/jaeger). These traces can be queried to calculate metrics which can be displayed on a dashboard. These are also called span metrics. This is already possible in the current version of the plugin by using the span duration grouped by ci.pipeline.id attribute set on the root span of the build.

If you also want duration metrics for individual stages per-pipeline that is possible by adding withSpanAttributes to your jobs. cf. https://github.com/jenkinsci/opentelemetry-plugin/issues/952#issuecomment-2388922816, https://github.com/jenkinsci/opentelemetry-plugin/issues/811#issuecomment-2116113648

In general it's not a good idea to have very specific metrics (ie. specific to a single job run) because of the cardinality issue some metric backends suffer from (eg. Prometheus). Usually metrics are used to aggregate data (counts, histograms, …) while traces/logs consider individual requests/events. For traces/logs it's possible to use sampling to reduce the amount of data needing to be processed and stored. If a sampling rate of 100% is used than any metric calculate based on the traces should be accurate.

I think https://github.com/jenkinsci/opentelemetry-plugin/pull/959 is a great addition to the opentelemetry-plugin. It is fine since it aggregates the individual job runs for a given pipeline and gives administrators control over which pipelines should be monitored specifically. What it does not allow is querying the exact build duration for a specific job run. Having metrics specific to a job run would be problematic. The prometheus-plugin has such an option which is thankfully guarded by a configuration option, but it is global and does not allow filtering which jobs it applies to: Image

In my experience if you want per-run metrics you are better of to query the traces.