prometheus / common

Go libraries shared across Prometheus components and libraries.
Apache License 2.0
261 stars 320 forks source link

Potential bug in text parser #253

Open vikramraman opened 4 years ago

vikramraman commented 4 years ago

We use the text parser within our application to scrape prometheus metrics and are facing an issue related to one described here: https://github.com/uber/cadence/issues/3120

cadence exposes two histogram metrics named signal_info and signal_info_count.

When parsing the HELP line for the signal_info_count the parser appears to strip the _count suffix and sets the p.currentMF property: https://github.com/prometheus/common/blob/f39dfa2b000545b3a763b844012958c275a62266/expfmt/text_parse.go#L688

This ends up setting the property to the earlier signal_info histogram leading to the parser error here: https://github.com/prometheus/common/blob/f39dfa2b000545b3a763b844012958c275a62266/expfmt/text_parse.go#L485

brian-brazil commented 4 years ago

I'm not sure we ever formally defined this, however overlapping metric names like this would not be recommended. As this is the official parser, if it rejects it then this is a true positive.

In this case I'd suggest generally improving the metrics. Info as part of the name could be confused with the info metric pattern, and the count could be confused with a summary/histogram itself - which is why you're getting an error. There's also no indication in the name exactly what is being counted (entries in some struct?) nor what the unit of the size is (bytes?).

beorn7 commented 4 years ago

This is indeed intended behavior, at least "kind of"…

The code normalizes names of Summaries and Histograms (and their components) by pruning any of the "magic" suffixes. Then it reports the collision. On the part of the parser, the error message is confusing because the normalization is not reported.

However, if you use prometheus/client_golang to expose the metrics, the exposition will already fail. I assume the code in uber/cadence is rendering the exposition by its own means.

I totally understand your confusion because your two histograms actually don't create any direct name collisions in the text exposition. The reason to ban them nevertheless is that it would create a collision if you switched the type from Histogram to Summary (because the pre-calculated quantiles in the signal_info_count Summary would be called signal_info_count, colliding with the count component of the signal_info Summary (or even Histogram, FWIW).

On a more general note, I'm not a fan of the "magic" suffixes because handling collisions is as confusing and hairy as we see it right here. At least, the exact collision handling behavior needed to be spelled out. I guess/hope OpenMetrics will do exactly that. I would prefer to solve the problem for good, but apparently, that's not going to happen.