Closed husky-parul closed 10 months ago
I also believe that outputting data using open telemetry with the right attribute semantics could be used to later correlate data from application data. Moreover, Using OpenTelemetry across the board in the project would allow not only OTLP standard exporters but also Prometheus and many others.
Typically we use tracing the measure the latency across function calls. So, I am not sure when/how/why we will use data from trace....
Could you please explain better why you need tracing?
When I mention OpenTelemetry(OTel) it's not just about tracing. Metrics themselves can be produced in the OTel format. This seems to me much more flexible than outputting metrics as timeseries in the Prometheus format. The main advantage is to easily integrate and correlate with App generated metrics. This can be used to identify inside the apps what might be the energy consumption hotspots.
It's actually on our roadmap to support export metrics in different formats, not just Prometheus...
We could discuss in more detail, it would be nice if you could create a google doc detailing the ideas and then everyone can give feedback
I totally support the idea of replacing the Prometheus metrics with OpenTelemetry metrics. Then it can be exported anywhere (including Prometheus) through the OpenTelemetry Collector. It would make things more open and platform-agnostic.
wait a min, why kepler need OpenTelemetry or in general distributed tracing? in my point of view
it's good to support different type of format as output, but may I know what's the different between OpenTelemetry and prometheus? I hope this is the correct document.
if document above is correct, can anyone help find a sample that prometheus consumes openmetrics format? Otherwise, it looks like a single way from prometheus to openmetrics. is there any sample/application supports both openmetrics and prometheus?
Hence, I suppose to avoid misunderstanding, we'd better rename this issue to add openmetrics support?
OpenTelemetry (OTel) is not just about Tracing. It includes metrics and logs... More to come in the future. Providing OTel metrics output would potentially allow to cross correlate the metrics generated here with application metrics. Mind that OTel supports multiple programming languages and is quickly becoming the de-facto standard for telemetry. This would be super useful.
@sallyom has some early PoC with this
@sallyom has some early PoC with this
I would like to see the PoC and further evaluate it. To see whether it's ready for us at implementation level or not. Once before https://github.com/open-telemetry/opentelemetry-go/discussions/2742 I asked as migrate from jaeger to OTel, but during the time it seems not ready.
Hence, if migrate to OTel means too much efforts and dependencies, I would like to considering that we wait until OTel ready with UI as dashboard, to make sure kepler's user has same UX in prometheus and OTel. list some key features/points here for discussion
@SamYuan1990 OTel won't have natively dashboards, or UI, etc. OTel defines data structures and protocols for metrics, logs and traces. They provide SDKs so that app developers can send metrics, logs and traces that can then be consumed in any OpenTelemetry-supported backend and UI: Prometheus + Grafana, or Datadog, or New Relic, or Splunk, or Dynatrace, etc. etc.
OTel also provides a "collector", whose role is mostly to act as a proxy, relaying metrics, logs and traces from one place to another.
You can use OpenTelemetry in Kepler to export OTel metrics, that will be pushed to an OTel Collector running on the side (like a wagon), and that will export these metrics to Prometheus. This way, it's 100% compatible with the current architecture, and you don't need to rewrite your Grafana dashboards.
The benefit is that the user can easily configure the OpenTelemetry Collector to push metrics to other backends as well (Datadog, New Relic, etc.)
To answer your points:
Last but not least: it is important to follow semantic conventions. For example, you're currently exporting this metric: kepler_container_core_joules_total
, which follows Prometheus conventions.
In OpenTelemetry, you will rather create a metric as:
kepler.container.core
J
(for joules)When exported to Prometheus (using either OpenTelemetry Collector Contrib exporters for Prometheus), this metric will be converted to kepler_container_core_joules_total
by the collector automatically (using the translator documented here).
Hope this helps understand OpenTelemetry!
@SamYuan1990 OTel won't have natively dashboards, or UI, etc. OTel defines data structures and protocols for metrics, logs and traces. They provide SDKs so that app developers can send metrics, logs and traces that can then be consumed in any OpenTelemetry-supported backend and UI: Prometheus + Grafana, or Datadog, or New Relic, or Splunk, or Dynatrace, etc. etc.
OTel also provides a "collector", whose role is mostly to act as a proxy, relaying metrics, logs and traces from one place to another.
You can use OpenTelemetry in Kepler to export OTel metrics, that will be pushed to an OTel Collector running on the side (like a wagon), and that will export these metrics to Prometheus. This way, it's 100% compatible with the current architecture, and you don't need to rewrite your Grafana dashboards.
The benefit is that the user can easily configure the OpenTelemetry Collector to push metrics to other backends as well (Datadog, New Relic, etc.)
To answer your points:
- OpenTelemetry also provides its own Kubernetes Operator (https://github.com/open-telemetry/opentelemetry-operator).
- The OTel SDK for Go is good, use the OTLP exporter though, and not the Prometheus one
- You will keep Grafana for dashboarding (Kepler --> OpenTelemetry Collector --> Prometheus --> Grafana)
- Full prometheus compatibility (both using classic scraping method, or remote write protocol)
Last but not least: it is important to follow semantic conventions. For example, you're currently exporting this metric:
kepler_container_core_joules_total
, which follows Prometheus conventions.In OpenTelemetry, you will rather create a metric as:
- type: Counter
- name:
kepler.container.core
- unit:
J
(for joules)When exported to Prometheus (using either OpenTelemetry Collector Contrib exporters for Prometheus), this metric will be converted to
kepler_container_core_joules_total
by the collector automatically (using the translator documented here).Hope this helps understand OpenTelemetry!
LGTM.
btw, do you know if prometheus has any idea to consume OTel metric directly?
I tried with https://github.com/open-telemetry/opentelemetry-go/blob/main/example/prometheus/main.go and https://github.com/open-telemetry/opentelemetry-go/blob/main/example/view/main.go it seems if we use OTel, it's nearly same as prometheus? as the output is http://localhost:2222/metrics or http://localhost:2223/metrics from the sample, hence, does it mean we can use prometheus to consume Otel directly? if we don't have sum or count type convert?
ref https://opentelemetry.io/docs/reference/specification/metrics/data-model/#point-kinds
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@SamYuan1990 OpenTelemetry can collect metrics from Prometheus directly https://uptrace.dev/opentelemetry/prometheus-metrics.html
So, we don't need to export OpenTelemetry metrics, right?
@SamYuan1990 OpenTelemetry can collect metrics from Prometheus directly https://uptrace.dev/opentelemetry/prometheus-metrics.html
So, we don't need to export OpenTelemetry metrics, right?
yea... but in the past, as offline discussed with @rootfs , if we are going to run kepler on edge node. (edge computing) we'd better support opentelemetry metric. as for edge node, it's better to use remote push.
Humm, I was not aware of this use case. Ok, let's discuss how to move forward with this.
@marceloamaral In general, we all agree it's better to use the open standard that most vendors agreed on, than just one specific technology. It will make the integration with the rest of the world much smoother, and it should not add any friction when interacting with the Prometheus world.
I understand that switching from a Prometheus-based code to OpenTelemetry is quite a challenge, though!
Trivia: Did you know that OpenTelemetry takes its roots in OpenMetrics (and others), which derives directly from Prometheus? 😉
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I don't think this is stale. It's scope needs to be clarified in the light of https://github.com/sustainable-computing-io/kepler/issues/659
thanks! @brunobat
btw, do you know if prometheus has any idea to consume OTel metric directly?
yes, in the latest release they added native otlp ingestion. https://github.com/prometheus/prometheus/releases/tag/v2.47.0
:wave: Prometheus team member here. For information, the Prometheus community agreed that the Prometheus client libraries will support exporting OTLP metrics directly.
From the Sep 30th 2023's Prometheus developers summit notes
CONSENSUS: We want to support exporting OTLP protocol metrics in our client libraries. We want to maintain Prometheus SDKs as an opinionated and efficient way to do custom metrics.
There was some discussion in the Community meeting about the overhead of Prometheus and OTLP client. Prometheus client has better scalability.
Some more experiments are leading that PromAgent+RW is less cpu crunching than setting up Otel Collector+OTLP and Otel collector+RW. (results here) thanks to @danielm0hr
@bertysentry has your team conducted similar benchmarks?
At the same point I would like to add the point that Otel SDK instrumentation still supports Prometheus and the scope of this integration to not limited to setting up Otel Collector +RW but to instrument Kepler using open protocol (and not Prom metrics) that supports metrics vendor other than Prometheus. Using Prometheus as a backend is not affected by this integration.
I did some benchmarks in the past that show a CPU overhead on the otel side when dealing with the otel prometheus receiver and exporter. But it did better in memory and network - While it also depends on the configuration of the collector.
I can not confirm that the CPU usage was higher in a otel in/out senario then in prometheus scrape + rw.
Unfortunately, I do not have much time to make the setup and the results available in a way that is understandable like https://github.com/danielm0hr/edge-metrics-measurements.
I think that we are talking about different use cases here:
IIUC the first use case can be accomplished today with the OTEL collector scraping metrics from the /metrics endpoint (and hopefully Prometheus should be able to support this natively in the future).
IMHO the second case would deserve careful evaluation because the Kepler exporter has some unique characteristics/challenges in terms of instrumentation (discussed in https://github.com/sustainable-computing-io/kepler/discussions/439 and https://github.com/sustainable-computing-io/kepler/issues/365#issuecomment-1327505536).
@simonpasquier The idea would be to use the OpenTelemetry SDK everywhere we can to produce OTLP metrics instead of Prometheus metrics.
Of course, one can use OpenTelemetry's receiver for Prometheus to export the metrics to another OpenTelemetry-supporting backend, but it's an added step in the way that we can remove.
The Prometheus server can now ingest OTLP metrics natively. This means that Kepler use OpenTelemetry to send OTLP metrics and still use Prometheus as a backend, without any extra-step, and no OpenTelemetry Collector required at all, and therefore no performance hit either.
The performance issue I'm referring to was with the Prometheus client_golang library and one would need to verify that the OTEL SDK provides good performances given the very special nature of the Kepler exporter.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.