open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.69k stars 886 forks source link

OTLP -> Prometheus: Instrumentation Scope needs clarification #3778

Open fstab opened 9 months ago

fstab commented 9 months ago

While implementing open-telemetry/opentelemetry-java#6015 I found some corner cases when mapping the OTel instrumentation scope to Prometheus. I think it would be worthwhile to clarify these cases in the Prometheus and OpenMetrics Compatibility spec.

Good Case Example

Let's say we have two counters named processing.time: One defined with scopeA and one defined with scopeB. In Prometheus these two counters get merged into a single counter with different otel_scope... attributes:

# TYPE processing_time_seconds counter
# UNIT processing_time_seconds seconds
# HELP processing_time_seconds processing time in seconds
processing_time_seconds_total{a="b",otel_scope_name="scopeA",otel_scope_version="1.1"} 3.3
processing_time_seconds_total{a="b",otel_scope_name="scopeB",otel_scope_version="1.2"} 3.3

This good case is trivial because the metadata for the counters is exactly the same. The corner cases arise when there are differences.

What if the TYPE differs?

Thoughts:

What to do if the type differs?

My opinion is that responding the scrape request with HTTP 500 is better than dropping metrics, because then the Prometheus server will show an issue with scraping and users can fix the issue, rather than having a metric silently missing.

What if the HELP text differs?

What if the UNIT differs?

I don't think metrics with different units can be merged, but how to handle the error?

Curious what you think.

dashpole commented 9 months ago

Current Spec

The specification currently says:

Prometheus SDK exporters MUST NOT allow duplicate UNIT, HELP, or TYPE comments for the same metric name to be returned in a single scrape of the Prometheus endpoint. Exporters MUST drop entire metrics to prevent conflicting TYPE comments, but SHOULD NOT drop metric points as a result of conflicting UNIT or HELP comments. Instead, all but one of the conflicting UNIT and HELP comments (but not metric points) SHOULD be dropped. If dropping a comment or metric points, the exporter SHOULD warn the user through error logging.

So, for conflicting unit or help, exporters SHOULD keep one of them. I believe all SDKs that have implemented this currently keep the first HELP or UNIT, and cache it to ensure it doesn't change between scrapes.

For conflicting (prometheus) TYPE, we currently drop the metric.

Differences

To summarize the differences between the current spec, and your comment:

  1. UNIT conflict:
    • Current: Merge
    • Proposed: Drop/Fail
  2. How to "fail" when irreconcilable conflicts are found:
    • Current: Drop metric(s) and log a warning
    • Proposed: Fail the scrape (500 response)

Thoughts

Thankfully, unit conflicts should be difficult to hit, since we add a unit suffix when the unit is present, and otherwise discourage units in metric names. So to get a conflict, you would have to do something like:

My only concern switching unit differences to drop/fail is units can't currently be configured using Views, which means a user using two libraries with conflicting units doesn't have an easy will have to change the names of one of the metrics to resolve it (maybe that isn't a bad thing)? With unit comments having low adoption, it seems like unit conflicts in the collection layer (e.g. in the collector's prometheus exporter) could also be very common.

For drop vs fail the scrape, I think we should defer to the recommendations of the prometheus community. @fstab do you think we should fail scrapes for all irreconcilable differences? Additional cases include:

fstab commented 9 months ago

Thanks a lot @dashpole. I didn't see that the "Metric Metadata" section answers some of these points, I just looked at the "Instrumentation Scope" section. Thanks for the pointer.

For drop vs fail the scrape, I think we should defer to the recommendations of the prometheus community.

I'll reach out and ask for opinions. I'll post the result here.

@fstab do you think we should fail scrapes for all irreconcilable differences?

Great question. My gut feeling:

fstab commented 9 months ago

I added an item to tomorrow's Prometheus Dev Summit agenda. If anyone's interested in joining, please do, everyone is welcome to participate and propose topics. Link to the calendar invite is at the top of the agenda doc.

fstab commented 9 months ago

Awesome news from the Prometheus Dev Summit: The Prometheus team decided to implement support for OTel's instrumentation scope. See Meeting Notes (2023-11-30) for reference.

Background

The internal data model in the Prometheus server already supports metrics with the same name but different metadata (TYPE, UNIT, HELP) as long as labels are different. Common use case: service1 exposes a metric of type gauge named request_size_bytes and service2 exposes a metric of type unknown named request_size_bytes, then you would have two metrics with the same name but different labels in the Prometheus server:

# TYPE request_size_bytes gauge
request_size_bytes{job="service1", ...} ...
# TYPE request_size_bytes unknown
request_size_bytes{job="service2", ...} ...

There is no issue with the name conflict because the job label differs. In the same way, there is no issue with name conflicts if the otel_scope_name label differs, even if job and instance are the same. Name conflicts are fine as long as the set of labels differs.

Why doesn't it work then?

The reason why name conflicts with OpenTelemetry instrumentation scope are an issue has nothing to do with Prometheus' internal data model. Internally Prometheus supports this. The issue has to do with the exposition format: Prometheus Text format (and in OpenMetrics format) does not allow metrics with the same name but different metadata (TYPE, HELP, UNIT).

What did the Prometheus Team decide

The Prometheus team decided to experiment with an # EOS (end of scope) marker in text format. This would look something like this:

# TYPE request_size_bytes gauge
# HELP request_size_bytes some help
my_metric{otel_scope_name="scope1", ...} ...
# EOS
# TYPE request_size_bytes unknown
# HELP request_size_bytes some other help
my_metric{otel_scope_name="scope2", ...} ...

If that works we will be able to map OpenTelemetry's instrumentation scope to Prometheus without conflicts.

dashpole commented 7 months ago

@fstab Let me know if you need any help on the OTel side of things here.