sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.18k stars 184 forks source link

Add OpenTelemetry converter to export Prometheus metrics #97

Closed husky-parul closed 10 months ago

husky-parul commented 2 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

brunobat commented 2 years ago

I also believe that outputting data using open telemetry with the right attribute semantics could be used to later correlate data from application data. Moreover, Using OpenTelemetry across the board in the project would allow not only OTLP standard exporters but also Prometheus and many others.

marceloamaral commented 2 years ago

Typically we use tracing the measure the latency across function calls. So, I am not sure when/how/why we will use data from trace....

Could you please explain better why you need tracing?

brunobat commented 2 years ago

When I mention OpenTelemetry(OTel) it's not just about tracing. Metrics themselves can be produced in the OTel format. This seems to me much more flexible than outputting metrics as timeseries in the Prometheus format. The main advantage is to easily integrate and correlate with App generated metrics. This can be used to identify inside the apps what might be the energy consumption hotspots.

marceloamaral commented 2 years ago

It's actually on our roadmap to support export metrics in different formats, not just Prometheus...

We could discuss in more detail, it would be nice if you could create a google doc detailing the ideas and then everyone can give feedback

bertysentry commented 1 year ago

I totally support the idea of replacing the Prometheus metrics with OpenTelemetry metrics. Then it can be exported anywhere (including Prometheus) through the OpenTelemetry Collector. It would make things more open and platform-agnostic.

SamYuan1990 commented 1 year ago

wait a min, why kepler need OpenTelemetry or in general distributed tracing? in my point of view

it's good to support different type of format as output, but may I know what's the different between OpenTelemetry and prometheus? I hope this is the correct document.

if document above is correct, can anyone help find a sample that prometheus consumes openmetrics format? Otherwise, it looks like a single way from prometheus to openmetrics. is there any sample/application supports both openmetrics and prometheus?

Hence, I suppose to avoid misunderstanding, we'd better rename this issue to add openmetrics support?

brunobat commented 1 year ago

OpenTelemetry (OTel) is not just about Tracing. It includes metrics and logs... More to come in the future. Providing OTel metrics output would potentially allow to cross correlate the metrics generated here with application metrics. Mind that OTel supports multiple programming languages and is quickly becoming the de-facto standard for telemetry. This would be super useful.

rootfs commented 1 year ago

@sallyom has some early PoC with this

SamYuan1990 commented 1 year ago

@sallyom has some early PoC with this

I would like to see the PoC and further evaluate it. To see whether it's ready for us at implementation level or not. Once before https://github.com/open-telemetry/opentelemetry-go/discussions/2742 I asked as migrate from jaeger to OTel, but during the time it seems not ready.

Hence, if migrate to OTel means too much efforts and dependencies, I would like to considering that we wait until OTel ready with UI as dashboard, to make sure kepler's user has same UX in prometheus and OTel. list some key features/points here for discussion

bertysentry commented 1 year ago

@SamYuan1990 OTel won't have natively dashboards, or UI, etc. OTel defines data structures and protocols for metrics, logs and traces. They provide SDKs so that app developers can send metrics, logs and traces that can then be consumed in any OpenTelemetry-supported backend and UI: Prometheus + Grafana, or Datadog, or New Relic, or Splunk, or Dynatrace, etc. etc.

OTel also provides a "collector", whose role is mostly to act as a proxy, relaying metrics, logs and traces from one place to another.

You can use OpenTelemetry in Kepler to export OTel metrics, that will be pushed to an OTel Collector running on the side (like a wagon), and that will export these metrics to Prometheus. This way, it's 100% compatible with the current architecture, and you don't need to rewrite your Grafana dashboards.

The benefit is that the user can easily configure the OpenTelemetry Collector to push metrics to other backends as well (Datadog, New Relic, etc.)

To answer your points:

Last but not least: it is important to follow semantic conventions. For example, you're currently exporting this metric: kepler_container_core_joules_total, which follows Prometheus conventions.

In OpenTelemetry, you will rather create a metric as:

When exported to Prometheus (using either OpenTelemetry Collector Contrib exporters for Prometheus), this metric will be converted to kepler_container_core_joules_total by the collector automatically (using the translator documented here).

Hope this helps understand OpenTelemetry!

SamYuan1990 commented 1 year ago

@SamYuan1990 OTel won't have natively dashboards, or UI, etc. OTel defines data structures and protocols for metrics, logs and traces. They provide SDKs so that app developers can send metrics, logs and traces that can then be consumed in any OpenTelemetry-supported backend and UI: Prometheus + Grafana, or Datadog, or New Relic, or Splunk, or Dynatrace, etc. etc.

OTel also provides a "collector", whose role is mostly to act as a proxy, relaying metrics, logs and traces from one place to another.

You can use OpenTelemetry in Kepler to export OTel metrics, that will be pushed to an OTel Collector running on the side (like a wagon), and that will export these metrics to Prometheus. This way, it's 100% compatible with the current architecture, and you don't need to rewrite your Grafana dashboards.

The benefit is that the user can easily configure the OpenTelemetry Collector to push metrics to other backends as well (Datadog, New Relic, etc.)

To answer your points:

Last but not least: it is important to follow semantic conventions. For example, you're currently exporting this metric: kepler_container_core_joules_total, which follows Prometheus conventions.

In OpenTelemetry, you will rather create a metric as:

  • type: Counter
  • name: kepler.container.core
  • unit: J (for joules)

When exported to Prometheus (using either OpenTelemetry Collector Contrib exporters for Prometheus), this metric will be converted to kepler_container_core_joules_total by the collector automatically (using the translator documented here).

Hope this helps understand OpenTelemetry!

LGTM.

btw, do you know if prometheus has any idea to consume OTel metric directly?

SamYuan1990 commented 1 year ago

I tried with https://github.com/open-telemetry/opentelemetry-go/blob/main/example/prometheus/main.go and https://github.com/open-telemetry/opentelemetry-go/blob/main/example/view/main.go it seems if we use OTel, it's nearly same as prometheus? as the output is http://localhost:2222/metrics or http://localhost:2223/metrics from the sample, hence, does it mean we can use prometheus to consume Otel directly? if we don't have sum or count type convert?

ref https://opentelemetry.io/docs/reference/specification/metrics/data-model/#point-kinds

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

marceloamaral commented 1 year ago

@SamYuan1990 OpenTelemetry can collect metrics from Prometheus directly https://uptrace.dev/opentelemetry/prometheus-metrics.html

So, we don't need to export OpenTelemetry metrics, right?

SamYuan1990 commented 1 year ago

@SamYuan1990 OpenTelemetry can collect metrics from Prometheus directly https://uptrace.dev/opentelemetry/prometheus-metrics.html

So, we don't need to export OpenTelemetry metrics, right?

yea... but in the past, as offline discussed with @rootfs , if we are going to run kepler on edge node. (edge computing) we'd better support opentelemetry metric. as for edge node, it's better to use remote push.

marceloamaral commented 1 year ago

Humm, I was not aware of this use case. Ok, let's discuss how to move forward with this.

bertysentry commented 1 year ago

@marceloamaral In general, we all agree it's better to use the open standard that most vendors agreed on, than just one specific technology. It will make the integration with the rest of the world much smoother, and it should not add any friction when interacting with the Prometheus world.

I understand that switching from a Prometheus-based code to OpenTelemetry is quite a challenge, though!

Trivia: Did you know that OpenTelemetry takes its roots in OpenMetrics (and others), which derives directly from Prometheus? 😉

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

brunobat commented 1 year ago

I don't think this is stale. It's scope needs to be clarified in the light of https://github.com/sustainable-computing-io/kepler/issues/659

rootfs commented 1 year ago

thanks! @brunobat

frzifus commented 1 year ago

btw, do you know if prometheus has any idea to consume OTel metric directly?

yes, in the latest release they added native otlp ingestion. https://github.com/prometheus/prometheus/releases/tag/v2.47.0

simonpasquier commented 1 year ago

:wave: Prometheus team member here. For information, the Prometheus community agreed that the Prometheus client libraries will support exporting OTLP metrics directly.

From the Sep 30th 2023's Prometheus developers summit notes

CONSENSUS: We want to support exporting OTLP protocol metrics in our client libraries. We want to maintain Prometheus SDKs as an opinionated and efficient way to do custom metrics.

marceloamaral commented 1 year ago

There was some discussion in the Community meeting about the overhead of Prometheus and OTLP client. Prometheus client has better scalability.

https://github.com/danielm0hr/edge-metrics-measurements/blob/main/talks/DanielMohr_PromAgentVsOtelCol.pdf

husky-parul commented 1 year ago

Some more experiments are leading that PromAgent+RW is less cpu crunching than setting up Otel Collector+OTLP and Otel collector+RW. (results here) thanks to @danielm0hr

@bertysentry has your team conducted similar benchmarks?

At the same point I would like to add the point that Otel SDK instrumentation still supports Prometheus and the scope of this integration to not limited to setting up Otel Collector +RW but to instrument Kepler using open protocol (and not Prom metrics) that supports metrics vendor other than Prometheus. Using Prometheus as a backend is not affected by this integration.

frzifus commented 1 year ago

I did some benchmarks in the past that show a CPU overhead on the otel side when dealing with the otel prometheus receiver and exporter. But it did better in memory and network - While it also depends on the configuration of the collector.

I can not confirm that the CPU usage was higher in a otel in/out senario then in prometheus scrape + rw.

Unfortunately, I do not have much time to make the setup and the results available in a way that is understandable like https://github.com/danielm0hr/edge-metrics-measurements.

simonpasquier commented 1 year ago

I think that we are talking about different use cases here:

  1. Export Kepler metrics in OTLP format.
  2. Instrument the Kepler exporter using the OTEL Go SDK.

IIUC the first use case can be accomplished today with the OTEL collector scraping metrics from the /metrics endpoint (and hopefully Prometheus should be able to support this natively in the future).

IMHO the second case would deserve careful evaluation because the Kepler exporter has some unique characteristics/challenges in terms of instrumentation (discussed in https://github.com/sustainable-computing-io/kepler/discussions/439 and https://github.com/sustainable-computing-io/kepler/issues/365#issuecomment-1327505536).

bertysentry commented 1 year ago

@simonpasquier The idea would be to use the OpenTelemetry SDK everywhere we can to produce OTLP metrics instead of Prometheus metrics.

Of course, one can use OpenTelemetry's receiver for Prometheus to export the metrics to another OpenTelemetry-supporting backend, but it's an added step in the way that we can remove.

The Prometheus server can now ingest OTLP metrics natively. This means that Kepler use OpenTelemetry to send OTLP metrics and still use Prometheus as a backend, without any extra-step, and no OpenTelemetry Collector required at all, and therefore no performance hit either.

simonpasquier commented 1 year ago

The performance issue I'm referring to was with the Prometheus client_golang library and one would need to verify that the OTEL SDK provides good performances given the very special nature of the Kepler exporter.

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.