grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.93k stars 496 forks source link

Unhealthy ingester pods are not recalibrating back to normal #8648

Open sid-jar opened 1 month ago

sid-jar commented 1 month ago

Describe the bug

We have been facing an issue with mimir (after the update with MinIO) make use of dualstack URLs to access S3 as that increased costs excessively when used with private endpoints. But being able to use the following PR we were able to get rid of that issue. But since then there had been a persistent issue with the ingesters that everytime one of the ingesters go unhealthy, they dont resolve back to healthy state unless manually 'forget'-ten from the ingester ring page. There used to be only a downtime of two or more mins earlier until the ingester comes back to healthy state as the replication factor can only handle one ingester pod going down.

Its clearly visible that the change from the PR and the observed behaviour is not related, but that appears to be the only start point of the behaviour

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir with 2.13.0-rc.0 as the base image
  2. Perform Operations(Read/Write/Others)

Expected behavior

Anytime the ingester goes down because of excessive load, the unhealthy ingesters had to be manually forgotten to be able to remediate this issue.

Environment

Additional Context

I can probably attach a screenshot of a unhealthy ingester which had remained unhealthy for more than 12 hrs, when it used to never cross more than 8 mins earlier before the change

dimitarvdimitrov commented 1 month ago

Does the ingester log anything in this period? Anything at the start of it?

sid-jar commented 1 month ago

No it doesn't. I sometimes wonder why the ingesters crash even. Havent been able to find the right direction on where to look, would greatly appreciate if one of you can help me along there. I assume its mostly the OOM being the reason, sharing the resource utilizations if that'd help

mimir-ingester-zone-a-0 4573m 6187Mi mimir-ingester-zone-a-1 33m 5903Mi mimir-ingester-zone-a-2 13m 5990Mi mimir-ingester-zone-a-3 23m 6862Mi mimir-ingester-zone-a-4 452m 5888Mi mimir-ingester-zone-b-0 18m 5215Mi mimir-ingester-zone-b-1 4869m 6414Mi mimir-ingester-zone-b-2 940m 7263Mi mimir-ingester-zone-b-3 20m 5555Mi mimir-ingester-zone-b-4 36m 5259Mi mimir-ingester-zone-c-0 1029m 6184Mi mimir-ingester-zone-c-1 15m 4637Mi mimir-ingester-zone-c-2 160m 6561Mi mimir-ingester-zone-c-3 20m 5351Mi mimir-ingester-zone-c-4 20m 5490Mi

dimitarvdimitrov commented 1 month ago

the pod should have details on it when it crashes. This is one example

   State:          Running
      Started:      Thu, 11 Jul 2024 18:04:25 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 11 Jul 2024 14:01:41 +0200
      Finished:     Thu, 11 Jul 2024 18:04:24 +0200
    Ready:          True
dimitarvdimitrov commented 1 month ago

can you share all the logs of the ingester before it enters the "frozen" state?