On-scrape conversion of classic histograms to native histograms (opt-in flag)

bwplotka commented 10 months ago

Proposal

Acceptance Criteria

Prometheus has a way to convert automatically all scraped classic histograms to native histograms.

Open Questions

A new have flag e.g. (--scrape.convert-histograms-to-native)? More flags to control some specifics of this behaviour? Field per job in scrape-config? 🤔
Should we convert histograms in place (essentially use the same family metric name?) and remove classic ones? Or different name and preserving classic ones, so they can be optionally dropped by relabelling?

I think it would be good to start with doing this in place and go from there. AFAIK native histograms have different name by design (same metric family name, but in practice different name) allowing no clash (one source of this info)

Motivation

Using native histograms can be slightly different on PromQL layer (e.g. new functions), but they are generally much cheaper for Prometheus and potential remote backends.

On top of that (why main rationale) native histograms are superior for remote write cases as they naturally make the streaming more atomic/transactional on scale (scraped information about histogram are now self-contained in one sample, instead of multiple series that could be send in different remote write streams/requests). This would a be huge improvement when adopting remote write (both 1.0 and 2.0).

However, migration to native histograms will take time, mostly due to required instrumentation changes (even if it's as simple as upgrading/configuring the SDKs).

Doing automatic migration, ideally in place would be an epic way to have one-off transition to new histograms from certain point of time. This is related to DevSummit topic for transition strategies. I don't think we ever had conclusion on this.

Alternatives

Different type of relabelling action (convert?) that can be applied on histogram metrics. Tricky as relabeling user had to ensure all bucket are covered. Also we never allowed relabelling to affect sample values or timestamps.

cc @beorn7 @SuperQ @roidelapluie

SuperQ commented 10 months ago

While not as efficient in exposition, this would also allow clients to expose more classic histogram buckets without the down side of increased cardinality on Prometheus.

bboreham commented 10 months ago

This only works if the buckets in your classic histogram match some set of native histogram buckets.

Maybe if you added some error tolerance, like "convert to native histogram if the maximum mismatch of bucket boundary is <1%" ?

bwplotka commented 10 months ago

Yes, I assume there will be some error tolerance, perhaps configurable 👍🏽

beorn7 commented 10 months ago

tl;dr: It was always the plan to do this, but we need custom bucket layouts #11277 first.

Longer version:

As @bboreham has mentioned already, converting a classic histogram into a native one only works well in the (unlikely) case that the bucket layout of the classic histogram closely matches the bucket layout of a native histogram. In practice, this will happen very rarely. The most believable scenario would be bucket boundaries like 1, 2, 4, 8, 16, … , which is schema 0 in the native histogram world. Even allowing a small-ish error tolerance will not create many more matches. We could use interpolation and use a significantly higher resolution for the native histogram, filling the (many) native buckets that are in the same range as one of the original classic buckets with equal parts of its count. This would create "equally bad" quantile estimations, maybe still at a somewhat lower resource cost. I'm not sure if it is worth going down that path. It will also create confusion.

Custom bucket layouts (see #11277) would solve all the problems. We could just directly emulate the classic histogram. And this very use case was one of the motivating factors of putting custom bucket layouts on the feature list. It is, however, quite involved, and we have many lower hanging fruit to harvest before.

bwplotka commented 10 months ago

Good points.

I wonder if despite no custom buckets support we could do some (opt in) translation, with some (big) error tolerance, even accepting all those "bad" consequences.

Rationales:

A) We could do this now. ~B) With custom bucket layouts, many downstream Prometheus users would still have exactly the same problem. That translation will be needed for systems which only support either static or exponential buckets (e.g. Otel and Google, but most likely everybody else who does not directly import Prometheus DB) and did not implement a mix mode (or don't plan to). The difference is that it will be not directly a Prometheus problem.~

EDIT: I somehow assumed we want a "mixed" histogram, so sample with both exponential and custom buckets 😱 verified with @beorn7 that's not the case, it's either one or another 🙈

~So my question is.. is there a room for adding no-custom-bucket mode for this conversion for now and perhaps later? Once custom buckets will land in native histograms we could either replace it or have two modes 🤔 @beorn7~

EDIT: Given above mistake, it might indeed much better to collab on custom bucket work 🤔 How I can help 🙈

prometheus / prometheus