grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.04k stars 512 forks source link

err-mimir-ingester-max-series behaviour #5923

Open fredsig opened 1 year ago

fredsig commented 1 year ago

Describe the bug

We recently hit the max number of in memory metric series per ingester. We were expecting existing in memory samples to be appended but once the rate limiting kicked in, we lost all of our metrics (ingestion stopped completely). During this period, Prometheus (running in agent mode), got mostly 5xx from the Mimir endpoint and kept retrying until we have increased the max and manually recycled ingesters and Prometheus.

To Reproduce

Steps to reproduce the behavior:

Configure Mimir with a max number of in memory metric series per ingester (example with my helm chart values file, for ~1.2 million max):

    ingester:
      instance_limits:
        max_series: 1200000

Write enough metrics series so it hits the max per ingester. We could see a 500 error being sent back to Prometheus on the remote write operation:

server returned HTTP status 500 Internal Server Error: failed pushing to ingester: rpc error: code = Unknown desc = user=main: the write request has been rejected because the ingester exceeded the allowed number of in-memory series (err-mimir-ingester-max-series). To adjust the related limit, configure -ingester.instance-limits.max-series, or contact your service administrator.

I believe Prometheus will keep retrying (by design) if an 500 http error is returned by the remote write endpoint. What we did observe is that we couldn't see existing metrics being appended and every pushed metric was being rejected. Ingestors had plenty of memory and we had no OOM issues (the ring was in a good state). We have fixed the issue by increasing the value of max series per ingester but we also had to recycle each ingester. In addition we had also to recycle Prometheus (running in agent mode) so it could start pushing metrics series to Mimir. The behaviour indicates that all of the metrics series were being rejected during the remote write, Prometheus would later retry since it got a 5xx for the batch.

We could also see the following errors from Prometheus logs during the event:

server returned HTTP status 500 Internal Server Error: failed pushing to ingester: rpc error: code = DeadlineExceeded desc = context deadline exceeded
ts=2023-09-04T14:03:15.921Z caller=dedupe.go:112 component=remote level=error remote_name=d25521 url=https://mimir-endpoint/api/v1/push msg="non-recoverable error" count=439 exemplarCount=0 err="context canceled"

Expected behavior

https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#MimirIngesterReachingSeriesLimit

When the limit on the number of in-memory series is reached, new series are rejected, while samples can still be appended to existing ones.

Environment

pstibrany commented 1 year ago

Expected behavior

https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#MimirIngesterReachingSeriesLimit

When the limit on the number of in-memory series is reached, new series are rejected, while samples can still be appended to existing ones.

Hello. This is in fact what's happening, however if there is even a single new series in the push request, and Mimir has already reached the limit, Mimir returns 500 status code in the response. Unfortunately that doesn't tell Prometheus which series was rejected, and which was accepted, so Prometheus/Agent keeps retrying the same request over and over, without advancing to newer data. That's why you don't see new data being added.

fredsig commented 1 year ago

Hi Peter, thanks for getting back on this.

Hello. This is in fact what's happening, however if there is even a single new series in the push request, and Mimir has already reached the limit, Mimir returns 500 status code in the response. Unfortunately that doesn't tell Prometheus which series was rejected, and which was accepted, so Prometheus/Agent keeps retrying the same request over and over, without advancing to newer data. That's why you don't see new data being added.

This is useful, now it makes sense why we couldn't see any new metrics being pushed when the rate limit kicked in. I presume that since metrics are pushed in batches, Prometheus gets a 500 when pushing the entire batch (there is no granularity of wich samples are being added successfully or which ones hit the rate limit). Prometheus clients running in normal TSDB mode lately retried and there was no gap but everything coming from Prometheus in agent mode was lost during that period (we have a very aggressive wal-truncate-frequency). Perhaps would be good to clarify this behaviour when using Prometheus in agent mode as opposed to the standard Prometheus with a full TSDB (or is this common to both?).