grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.35k stars 190 forks source link

Prometheus: unexpected out-of-order errors when writing metrics from an Alloy cluster with exemplars and native histograms #1117

Open thampiotr opened 3 months ago

thampiotr commented 3 months ago

What's wrong?

When remote-writing metrics using prometheus.remote_write component to a backend with out-of-order ingestion enabled (and a sufficiently large time window) from a cluster of Alloy instances, with exemplars and/or native histograms enabled, users can observe an increase in remote_write errors, which correlates with restarts or adding new instances to Alloy cluster.

Upon closer inspection, errors are in the form of:

ts=2024-06-19T17:38:16.946175Z level=error msg="non-recoverable error" ... url=(...) count=5 exemplarCount=1 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester ...: user=...: err: out of order exemplar. timestamp=2024-06-19T17:35:15Z, series=a_test_total{...}, exemplar={...}"

Or a similar error mentioning an out-of-order sample, for example in case of Mimir backend it would be:

server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester ingester-zone-a-9: user=9960: the sample has been rejected because another sample with a more recent timestamp has already been ingested and out-of-order samples are not allowed (err-mimir-sample-out-of-order).

If you look at details from the "sample-out-of-order" error above and check what metrics are failing, these will be samples for native histograms.

Root cause and additional findings

I have been able to verify that the issue relates almost exclusively to exemplars and native histograms, as when I have disabled them, the errors went away and success rate went to 100% in a cluster writing over 1.5 million samples per second.

After discussing with engineers closer to the topic, we believe that the root cause for this is missing upstream support for out-of-order ingestion of exemplars and native histograms:

Note that in practice, some backends will still process all the regular samples and will only drop the exemplars and native histograms, but the remote write client will get incorrect error counts due to limited information surfaced by the protocol when the batch is partially successful. See this issue in Prometheus for details.

Possible workarounds

Users that want to avoid these errors are recommended to disable exemplars and native histograms until the support for OOO ingestion is added to Prometheus.

We could consider splitting the exemplars and native histograms into a separate pipeline, but that would require some work on Alloy to support such filtering and would be short-lived, until the upstream issues are resolved.

Steps to reproduce

System information

Likely every OS

Software version

v1.1.1 and likely every previous version

Configuration

No response

Logs

No response

thampiotr commented 3 months ago

Note: this issue can be used to track the status of the upstream issues (https://github.com/prometheus/prometheus/issues/11220 and https://github.com/prometheus/prometheus/issues/13577) - there's not much we can do right now to mitigate this.

github-actions[bot] commented 2 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!