Open fstab opened 9 months ago
The specification currently says:
Prometheus SDK exporters MUST NOT allow duplicate UNIT, HELP, or TYPE comments for the same metric name to be returned in a single scrape of the Prometheus endpoint. Exporters MUST drop entire metrics to prevent conflicting TYPE comments, but SHOULD NOT drop metric points as a result of conflicting UNIT or HELP comments. Instead, all but one of the conflicting UNIT and HELP comments (but not metric points) SHOULD be dropped. If dropping a comment or metric points, the exporter SHOULD warn the user through error logging.
So, for conflicting unit or help, exporters SHOULD keep one of them. I believe all SDKs that have implemented this currently keep the first HELP or UNIT, and cache it to ensure it doesn't change between scrapes.
For conflicting (prometheus) TYPE, we currently drop the metric.
To summarize the differences between the current spec, and your comment:
Thankfully, unit conflicts should be difficult to hit, since we add a unit suffix when the unit is present, and otherwise discourage units in metric names. So to get a conflict, you would have to do something like:
my.metric.seconds
, unit: ` ->
my_metric_seconds`my.metric
, unit:s
-> my_metric_seconds
My only concern switching unit differences to drop/fail is units can't currently be configured using Views, which means a user using two libraries with conflicting units doesn't have an easy will have to change the names of one of the metrics to resolve it (maybe that isn't a bad thing)? With unit comments having low adoption, it seems like unit conflicts in the collection layer (e.g. in the collector's prometheus exporter) could also be very common.
For drop vs fail the scrape, I think we should defer to the recommendations of the prometheus community. @fstab do you think we should fail scrapes for all irreconcilable differences? Additional cases include:
Thanks a lot @dashpole. I didn't see that the "Metric Metadata" section answers some of these points, I just looked at the "Instrumentation Scope" section. Thanks for the pointer.
For drop vs fail the scrape, I think we should defer to the recommendations of the prometheus community.
I'll reach out and ask for opinions. I'll post the result here.
@fstab do you think we should fail scrapes for all irreconcilable differences?
Great question. My gut feeling:
I added an item to tomorrow's Prometheus Dev Summit agenda. If anyone's interested in joining, please do, everyone is welcome to participate and propose topics. Link to the calendar invite is at the top of the agenda doc.
Awesome news from the Prometheus Dev Summit: The Prometheus team decided to implement support for OTel's instrumentation scope. See Meeting Notes (2023-11-30) for reference.
Background
The internal data model in the Prometheus server already supports metrics with the same name but different metadata (TYPE, UNIT, HELP) as long as labels are different. Common use case: service1
exposes a metric of type gauge named request_size_bytes
and service2
exposes a metric of type unknown named request_size_bytes
, then you would have two metrics with the same name but different labels in the Prometheus server:
# TYPE request_size_bytes gauge
request_size_bytes{job="service1", ...} ...
# TYPE request_size_bytes unknown
request_size_bytes{job="service2", ...} ...
There is no issue with the name conflict because the job
label differs. In the same way, there is no issue with name conflicts if the otel_scope_name
label differs, even if job
and instance
are the same. Name conflicts are fine as long as the set of labels differs.
Why doesn't it work then?
The reason why name conflicts with OpenTelemetry instrumentation scope are an issue has nothing to do with Prometheus' internal data model. Internally Prometheus supports this. The issue has to do with the exposition format: Prometheus Text format (and in OpenMetrics format) does not allow metrics with the same name but different metadata (TYPE, HELP, UNIT).
What did the Prometheus Team decide
The Prometheus team decided to experiment with an # EOS
(end of scope) marker in text format. This would look something like this:
# TYPE request_size_bytes gauge
# HELP request_size_bytes some help
my_metric{otel_scope_name="scope1", ...} ...
# EOS
# TYPE request_size_bytes unknown
# HELP request_size_bytes some other help
my_metric{otel_scope_name="scope2", ...} ...
If that works we will be able to map OpenTelemetry's instrumentation scope to Prometheus without conflicts.
@fstab Let me know if you need any help on the OTel side of things here.
While implementing open-telemetry/opentelemetry-java#6015 I found some corner cases when mapping the OTel instrumentation scope to Prometheus. I think it would be worthwhile to clarify these cases in the Prometheus and OpenMetrics Compatibility spec.
Good Case Example
Let's say we have two counters named
processing.time
: One defined withscopeA
and one defined withscopeB
. In Prometheus these two counters get merged into a single counter with differentotel_scope...
attributes:This good case is trivial because the metadata for the counters is exactly the same. The corner cases arise when there are differences.
What if the
TYPE
differs?Thoughts:
TYPE
refers to the Prometheus type. Metrics might have different types in OpenTelemetry but the same type in Prometheus. If the Prometheus type is the same, the metrics can be merged. Example:DOUBLE_GAUGE
and non-monotonicLONG_SUM
both areGAUGE
in Prometheus, hence they can be merged.HISTOGRAM
andEXPONENTIAL_HISTOGRAM
are also the same type in Prometheus, so they can be merged.INFO
metric in OpenMetrics format becomes aGAUGE
metric in Prometheus Text format. I don't think that's relevant here, because OTel can only produceCOUNTER
,GAUGE
,HISTOGRAM
, andSUMMARY
. However, if OTel can produce more types we need to keep in mind that the other types depend on the exposition format.What to do if the type differs?
UNKNOWN
. This should work for all metrics except exponential histograms.My opinion is that responding the scrape request with HTTP 500 is better than dropping metrics, because then the Prometheus server will show an issue with scraping and users can fix the issue, rather than having a metric silently missing.
What if the
HELP
text differs?HELP
text?HELP
texts?HELP
?What if the
UNIT
differs?I don't think metrics with different units can be merged, but how to handle the error?
Curious what you think.