Open thampiotr opened 3 months ago
Note: this issue can be used to track the status of the upstream issues (https://github.com/prometheus/prometheus/issues/11220 and https://github.com/prometheus/prometheus/issues/13577) - there's not much we can do right now to mitigate this.
This issue has not had any activity in the past 30 days, so the needs-attention
label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention
label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!
What's wrong?
When remote-writing metrics using
prometheus.remote_write
component to a backend with out-of-order ingestion enabled (and a sufficiently large time window) from a cluster of Alloy instances, with exemplars and/or native histograms enabled, users can observe an increase in remote_write errors, which correlates with restarts or adding new instances to Alloy cluster.Upon closer inspection, errors are in the form of:
Or a similar error mentioning an out-of-order sample, for example in case of Mimir backend it would be:
If you look at details from the "sample-out-of-order" error above and check what metrics are failing, these will be samples for native histograms.
Root cause and additional findings
I have been able to verify that the issue relates almost exclusively to exemplars and native histograms, as when I have disabled them, the errors went away and success rate went to 100% in a cluster writing over 1.5 million samples per second.
After discussing with engineers closer to the topic, we believe that the root cause for this is missing upstream support for out-of-order ingestion of exemplars and native histograms:
Note that in practice, some backends will still process all the regular samples and will only drop the exemplars and native histograms, but the remote write client will get incorrect error counts due to limited information surfaced by the protocol when the batch is partially successful. See this issue in Prometheus for details.
Possible workarounds
Users that want to avoid these errors are recommended to disable exemplars and native histograms until the support for OOO ingestion is added to Prometheus.
We could consider splitting the exemplars and native histograms into a separate pipeline, but that would require some work on Alloy to support such filtering and would be short-lived, until the upstream issues are resolved.
Steps to reproduce
System information
Likely every OS
Software version
v1.1.1 and likely every previous version
Configuration
No response
Logs
No response