Reconsider the approach of applying the full normalization

dmitryax commented 1 year ago

We recently attempted to activate the pkg.translator.prometheus.NormalizeName feature gate in the collector. This feature gate applies full metrics normalization by default in both the Prometheus receiver and exporter. However, the process did not go smoothly and was seen as an unnecessary disruptive change. Two issues were reported regarding this matter:

After discussing the topic in the Collector SIG on June 6th, a decision was made to revert the feature gate back to its alpha state until both issues are resolved.

To initiate the discussion towards resolving these problems, I have two questions:

1. Why do we need to change the original Prometheus metrics names in the Prometheus receiver?

While it is beneficial to populate as many fields as possible in an OTel metric object from the available information in a Prometheus endpoint, including the metric unit, changing the original metric name seems unnecessary. Unless we have predefined mappings, we cannot convert a Prometheus metric name into an OTel metric with a name compliant with the OTel spec. By simply removing the unit and _total suffixes, we end up with metric names that may be unfamiliar to users. For example, a commonly used Prometheus metric, process_cpu_seconds_total, would become process_cpu after the full normalization, which lacks references in both OTel and Prometheus contexts. However, if we retain the name process_cpu_seconds_total in the OTel metric name (with s as the unit), it is evident where the metric originates.

If we decide against changing the original name in the Prometheus receiver:

The Prom -> OTel -> Prom scenario remains unaffected since the full normalization in the Prometheus exporter will not add additional prefixes if they already exist in the metric name.
In an OTel -> Prom -> OTel scenario (assuming full normalization exists in the Prometheus exporter), different results would be obtained. For instance, system.cpu.time would become an OTel metric named system_cpu_time_seconds_total instead of system_cpu_time. I believe both system_cpu_time_seconds_total and system_cpu_time metric names should be acceptable since it is not possible to revert back to system.cpu.time.

2. Do we want to provide an option for the end user to disable/enable the full normalization in the Prometheus exporter? If so, what would be the default setting?

Unlike the Prometheus receiver, the full normalization in the Prometheus exporter is important as it prevents the loss of the unit field value during conversion. However, there might be situations where users prefer to avoid additional suffixes and keep the metric name as close to the original OTel name as possible. Considering the feedback from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/21743, it seems reasonable to provide this option. If we do include it, what should be the default behavior? It is plausible that the default setting should be the enabled full normalization, but alternative viewpoints are welcome.

dmitryax commented 1 year ago

Discussed this issue in the Prometheus WG meeting.

Q 1: @gouthamve mentioned that Prometheus is going to support dots in metric names going forward, so, potentially, we can restore OTel metric names in the OTel -> Prom -> OTel scenario by applying the full normalization. I'm still not sure if it justifies changing all the Prometheus metrics in the Prometheus receiver because of the reasons mentioned above (process_cpu_seconds_total -> process_cpu). Maybe we can update the Prometheus exporter normalization and mark all the metrics coming from it (e.g., by adding a special note to the HELP section) in a way that the Prometheus receiver from the other side can read the hint and restore the original OTel metric names. If the hint contains the original OTel metric name like otel:system.cpu.time, we don't even need to depend on the dots support. @open-telemetry/wg-prometheus WDYT?

Q 2: Look like there are no objections to making the normalization on the Prometheus exporter side configurable and having it enabled by default.

dashpole commented 1 year ago

Sorry I missed the discussion yesterday; I'm just getting back from leave and am catching up now. The response to enabling the feature gate makes me think our approach might have to change. If we tried to make the change at a future point, we would likely run into the same pushback, and i'm not sure the value we are getting for the change is worth the pain it would cause users right now.

My suggestion is that we enable normalization on prometheus exporters by default, but disable it on the prometheus receiver by default. It would essentially make only additive changes by default. Prom -> OTel -> Prom would still round trip successfully in that case, but OTel -> Prom -> OTel would end up with extra suffixes in addition to underscores. To round-trip OTel -> Prom -> OTel, a user would need to:

Disable sanitization (. -> _) in their OTel SDK prometheus exporter. This knob would be introduced after prometheus allows dots.
Enable suffix trimming in their prometheus receiver.

Since we want OTel SDK prometheus exporters to work with older prometheus servers, sanitization (. -> _) will have to stay the default behavior. That means regardless of whether we trim suffixes in the prometheus receiver by default, a user would have to change default behavior to be able to round-trip metric names.

I'm not a fan of encoding/decoding information from/to the HELP text. It seems likely to break, and pollutes the help text with information users are unlikely to care about.

gouthamve commented 1 year ago

My suggestion is that we enable normalization on prometheus exporters by default, but disable it on the prometheus receiver by default.

I like it. Echoes what I said here: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/21743#issuecomment-1602634542

dmitryax commented 1 year ago

My suggestion is that we enable normalization on prometheus exporters by default, but disable it on the prometheus receiver by default.

@dashpole I agree with this in general. But WDYT about the suggestion I mentioned in my previous comment about adding an OTel metric hint to the HELP section by the Prometheus exporter. For example, if any metric has otel:system.cpu.time in the HELP section, the OTel Prometheus receiver would use that name instead. It'll make OTel -> Prom -> OTel work without dots support and any configuration option on the Prometheus receiver. Is that realistic?

dashpole commented 1 year ago

I'm not a fan of encoding/decoding information from/to the HELP text. It seems likely to break, and pollutes the help text with information users are unlikely to care about.

dmitryax commented 1 year ago

OK, let's proceed with the configuration options for the full normalization, which will be disabled on the receiver and enabled on the exporter. We probably don't need a feature gate for that, and the existing one can be deprecated.

dashpole commented 1 year ago

This has been completed in the collector

open-telemetry / prometheus-interoperability-spec

Reconsider the approach of applying the full normalization #72

1. Why do we need to change the original Prometheus metrics names in the Prometheus receiver?

2. Do we want to provide an option for the end user to disable/enable the full normalization in the Prometheus exporter? If so, what would be the default setting?