prometheus / prometheus

The Prometheus monitoring system and time series database.
https://prometheus.io/
Apache License 2.0
54.08k stars 8.96k forks source link

Support for ingesting out of order exemplars #13577

Open dimitarvdimitrov opened 5 months ago

dimitarvdimitrov commented 5 months ago

Proposal

Background

OTLP ingestion includes out of order samples. As a result of that exemplars might also come out of order. This leads to errors being returned to clients and discarding data (code)

Proposal

Support out of order exemplars. Implementing this should be as easy as modifying the linked list of the exemplars for the series (code). We still insert the exemplar in the next available slot in the circular buffer as today (code) but then change the order of existing exemplars. Changing the order means traversing from the oldest exemplar of a series and finding the insertion point for the exemplar that's being inserted right now, then mutating the linked list elements.

Side effects/considerations

bboreham commented 5 months ago

Your analysis seems sound. One thing I would add is to benchmark to check the effect on performance.

Historically we have not looked much at exemplar performance; it hasn't shown up as a big element in profiles. One possible reason for this is there are relatively few exemplars in the wild.

If dealing with out-of-order exemplars has a significant impact on performance, perhaps treat them like samples and have a faster path for in-order exemplars.

dimitarvdimitrov commented 5 months ago

notes from @fionaliao:

It's worth going through the OOO design doc and seeing if anything applies to exemplars. I think the only thing we need to consider from there is whether OOO exemplars need to be added to the write-behind log as we do for OOO samples.

In the draft PR, there isn't a check that the out of order exemplar is within the ooo time window