Support for Retrieving Jenkins Build Duration Metrics via OpenTelemetry Plugin

miraccan00 commented 4 weeks ago

Hello OpenTelemetry Development Team,

I have been working on integrating Jenkins with OpenTelemetry to collect build metrics. While I've made some progress, I've encountered challenges in retrieving certain metrics and would like to seek your assistance or guidance.

Objectives:

Retrieve Build Duration Metrics for Jobs:

I need to collect the build duration metrics for individual Jenkins jobs. This includes capturing the time each build takes to complete.
Calculate Average Build Duration for Specific Jobs:

I aim to compute the average build duration over time for specific jobs to analyze performance trends.

Challenges:

Unified Data Collection:

I wish to obtain all the metrics that the Jenkins Prometheus plugin provides but exclusively using the OpenTelemetry plugin. My goal is to avoid using multiple collectors and centralize all data collection through OpenTelemetry.
Plugin Support and Roadmap:

It appears that the current OpenTelemetry Jenkins plugin may not support some of these metrics out of the box. If that's the case, I am willing to contribute to the plugin's development. I would greatly appreciate a roadmap, guidelines, or any documentation that could assist me in extending the plugin to support these metrics.

Additional Goals:

Grafana Dashboard Integration:

Ultimately, I aim to create a Grafana dashboard that visualizes all the collected Jenkins metrics in one place, leveraging the data from OpenTelemetry.

Current Implementation:

I have created a repository with my current setup and attempts to achieve these objectives:

GitHub Repository: https://github.com/miraccan00/Jenkins-Otel-Grafana

This repository includes the configurations and code I've been working with, which may help in understanding the current state and the issues I'm facing.

Request:

Support and Guidance:

Could you please advise on whether the OpenTelemetry Jenkins plugin currently supports these metrics? If not, what would be the recommended approach to implement this functionality?
Collaboration Opportunity:

If development is needed to add this feature, I am eager to contribute. Guidance on how to proceed or whom to collaborate with would be highly appreciated.

Thank you for your time and consideration. I look forward to your response and the possibility of enhancing the Jenkins OpenTelemetry integration together.

Best regards,

cyrille-leclerc commented 4 weeks ago

Great suggestion! For the build metrics, please see:

https://github.com/jenkinsci/opentelemetry-plugin/pull/959

Longer term you are absolutely right, the Jenkins otel plugin should provide all the metrics needed by Jenkins admins and users.

I'm on PTO at the moment, I'll follow up asap.

christophe-kamphaus-jemmic commented 3 weeks ago

The opentelemetry-plugin also supports sending build traces to a tracing backend (elasticsearch/jaeger). These traces can be queried to calculate metrics which can be displayed on a dashboard. These are also called span metrics. This is already possible in the current version of the plugin by using the span duration grouped by ci.pipeline.id attribute set on the root span of the build.

If you also want duration metrics for individual stages per-pipeline that is possible by adding withSpanAttributes to your jobs. cf. https://github.com/jenkinsci/opentelemetry-plugin/issues/952#issuecomment-2388922816, https://github.com/jenkinsci/opentelemetry-plugin/issues/811#issuecomment-2116113648

In general it's not a good idea to have very specific metrics (ie. specific to a single job run) because of the cardinality issue some metric backends suffer from (eg. Prometheus). Usually metrics are used to aggregate data (counts, histograms, …) while traces/logs consider individual requests/events. For traces/logs it's possible to use sampling to reduce the amount of data needing to be processed and stored. If a sampling rate of 100% is used than any metric calculate based on the traces should be accurate.

I think https://github.com/jenkinsci/opentelemetry-plugin/pull/959 is a great addition to the opentelemetry-plugin. It is fine since it aggregates the individual job runs for a given pipeline and gives administrators control over which pipelines should be monitored specifically. What it does not allow is querying the exact build duration for a specific job run. Having metrics specific to a job run would be problematic. The prometheus-plugin has such an option which is thankfully guarded by a configuration option, but it is global and does not allow filtering which jobs it applies to:

In my experience if you want per-run metrics you are better of to query the traces.

cyrille-leclerc commented 2 weeks ago

Please use the ci.pipeline.run.duration{ci.pipeline.id="<<pipeline full name>>", ci.pipeline.result="<<SUCCESS, UNSTABLE, FAILURE, NOT_BUILT, ABORTED>>"} histogram metric we have just released. ℹ Use the otel.instrumentation.jenkins.run.metric.duration.allow_list and otel.instrumentation.jenkins.run.metric.duration.deny_list to specify the pipelines for which you want to capture the run duration, other pipelines will be aggregated in the ci.pipeline.id="#other#" time series.

See documentation https://github.com/jenkinsci/opentelemetry-plugin/blob/main/docs/monitoring-metrics.md#build-duration

I'm marking your enhancement request as solved. Please open new enhancement requests if needed.

miraccan00 commented 2 weeks ago

Thanks for addressing my enhancement request and providing the solution. I appreciate the prompt response and detailed guidance.

cyrille-leclerc commented 2 weeks ago

You're welcome!

jenkinsci / opentelemetry-plugin

Support for Retrieving Jenkins Build Duration Metrics via OpenTelemetry Plugin #972