grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.08k stars 522 forks source link

Ruler Pods OOM/spike in memory observed with warning log closing ingester client stream failed #8134

Open sivesh1989 opened 5 months ago

sivesh1989 commented 5 months ago

Issue: Ruler Pods OOM/spike in memory observed with warning closing ingester client stream failed. Mimir version: 2.11.0 Deployment: Microservice mode Environment: AKS cluster

We observe ruler pods having spike in memory utilization and at the same time observe below error,

method=Distributor.queryIngesterStream user=abc level=warn msg="closing ingester client stream failed" err="timed out waiting to exhaust stream after calling CloseSend, will continue exhausting stream in background"

Ruler POD Mem utilization graph:

image

Kindly let us know if there is any bug in cleaning up resources for rulers or help us resolve the issue.

GrgDev commented 1 month ago

We're running into this issue as well. We are currently looking into a couple of options to help and can report back later if they work.

There's two main things we're looking into.

  1. More aggressively split up rule groups to be smaller. I believe it is rule groups, not individual rules, that get assigned per ruler. If you have a rule group that's particularly heavy or large, it may take that much more memory to handle it.
  2. Switch the rulers to remote mode. This should offload most of the resource consumption from the rulers to the queriers. However, when we last tried this, we did not expect new CPU load behavior that we observed to be quite so extreme. We ended up having to roll back the change.