Closed dmitryax closed 1 year ago
Discussed this issue in the Prometheus WG meeting.
Q 1: @gouthamve mentioned that Prometheus is going to support dots in metric names going forward, so, potentially, we can restore OTel metric names in the OTel -> Prom -> OTel scenario by applying the full normalization. I'm still not sure if it justifies changing all the Prometheus metrics in the Prometheus receiver because of the reasons mentioned above (process_cpu_seconds_total
-> process_cpu
). Maybe we can update the Prometheus exporter normalization and mark all the metrics coming from it (e.g., by adding a special note to the HELP section) in a way that the Prometheus receiver from the other side can read the hint and restore the original OTel metric names. If the hint contains the original OTel metric name like otel:system.cpu.time
, we don't even need to depend on the dots support. @open-telemetry/wg-prometheus WDYT?
Q 2: Look like there are no objections to making the normalization on the Prometheus exporter side configurable and having it enabled by default.
Sorry I missed the discussion yesterday; I'm just getting back from leave and am catching up now. The response to enabling the feature gate makes me think our approach might have to change. If we tried to make the change at a future point, we would likely run into the same pushback, and i'm not sure the value we are getting for the change is worth the pain it would cause users right now.
My suggestion is that we enable normalization on prometheus exporters by default, but disable it on the prometheus receiver by default. It would essentially make only additive changes by default. Prom -> OTel -> Prom would still round trip successfully in that case, but OTel -> Prom -> OTel would end up with extra suffixes in addition to underscores. To round-trip OTel -> Prom -> OTel, a user would need to:
.
-> _
) in their OTel SDK prometheus exporter. This knob would be introduced after prometheus allows dots.Since we want OTel SDK prometheus exporters to work with older prometheus servers, sanitization (.
-> _
) will have to stay the default behavior. That means regardless of whether we trim suffixes in the prometheus receiver by default, a user would have to change default behavior to be able to round-trip metric names.
I'm not a fan of encoding/decoding information from/to the HELP text. It seems likely to break, and pollutes the help text with information users are unlikely to care about.
My suggestion is that we enable normalization on prometheus exporters by default, but disable it on the prometheus receiver by default.
I like it. Echoes what I said here: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/21743#issuecomment-1602634542
My suggestion is that we enable normalization on prometheus exporters by default, but disable it on the prometheus receiver by default.
@dashpole I agree with this in general. But WDYT about the suggestion I mentioned in my previous comment about adding an OTel metric hint to the HELP section by the Prometheus exporter. For example, if any metric has otel:system.cpu.time
in the HELP section, the OTel Prometheus receiver would use that name instead. It'll make OTel -> Prom -> OTel work without dots support and any configuration option on the Prometheus receiver. Is that realistic?
I'm not a fan of encoding/decoding information from/to the HELP text. It seems likely to break, and pollutes the help text with information users are unlikely to care about.
OK, let's proceed with the configuration options for the full normalization, which will be disabled on the receiver and enabled on the exporter. We probably don't need a feature gate for that, and the existing one can be deprecated.
This has been completed in the collector
We recently attempted to activate the pkg.translator.prometheus.NormalizeName feature gate in the collector. This feature gate applies full metrics normalization by default in both the Prometheus receiver and exporter. However, the process did not go smoothly and was seen as an unnecessary disruptive change. Two issues were reported regarding this matter:
After discussing the topic in the Collector SIG on June 6th, a decision was made to revert the feature gate back to its alpha state until both issues are resolved.
To initiate the discussion towards resolving these problems, I have two questions:
1. Why do we need to change the original Prometheus metrics names in the Prometheus receiver?
While it is beneficial to populate as many fields as possible in an OTel metric object from the available information in a Prometheus endpoint, including the metric unit, changing the original metric name seems unnecessary. Unless we have predefined mappings, we cannot convert a Prometheus metric name into an OTel metric with a name compliant with the OTel spec. By simply removing the unit and
_total
suffixes, we end up with metric names that may be unfamiliar to users. For example, a commonly used Prometheus metric,process_cpu_seconds_total
, would becomeprocess_cpu
after the full normalization, which lacks references in both OTel and Prometheus contexts. However, if we retain the nameprocess_cpu_seconds_total
in the OTel metric name (withs
as the unit), it is evident where the metric originates.If we decide against changing the original name in the Prometheus receiver:
system.cpu.time
would become an OTel metric namedsystem_cpu_time_seconds_total
instead ofsystem_cpu_time
. I believe bothsystem_cpu_time_seconds_total
andsystem_cpu_time
metric names should be acceptable since it is not possible to revert back tosystem.cpu.time
.2. Do we want to provide an option for the end user to disable/enable the full normalization in the Prometheus exporter? If so, what would be the default setting?
Unlike the Prometheus receiver, the full normalization in the Prometheus exporter is important as it prevents the loss of the unit field value during conversion. However, there might be situations where users prefer to avoid additional suffixes and keep the metric name as close to the original OTel name as possible. Considering the feedback from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/21743, it seems reasonable to provide this option. If we do include it, what should be the default behavior? It is plausible that the default setting should be the enabled full normalization, but alternative viewpoints are welcome.