Ruler Pods OOM/spike in memory observed with warning log closing ingester client stream failed

grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.

GNU Affero General Public License v3.0

4.08k stars 522 forks source link

Issue: Ruler Pods OOM/spike in memory observed with warning closing ingester client stream failed. Mimir version: 2.11.0 Deployment: Microservice mode Environment: AKS cluster

We observe ruler pods having spike in memory utilization and at the same time observe below error,

method=Distributor.queryIngesterStream user=abc level=warn msg="closing ingester client stream failed" err="timed out waiting to exhaust stream after calling CloseSend, will continue exhausting stream in background"

Ruler POD Mem utilization graph:

Kindly let us know if there is any bug in cleaning up resources for rulers or help us resolve the issue.

We're running into this issue as well. We are currently looking into a couple of options to help and can report back later if they work.

There's two main things we're looking into.

More aggressively split up rule groups to be smaller. I believe it is rule groups, not individual rules, that get assigned per ruler. If you have a rule group that's particularly heavy or large, it may take that much more memory to handle it.
Switch the rulers to remote mode. This should offload most of the resource consumption from the rulers to the queriers. However, when we last tried this, we did not expect new CPU load behavior that we observed to be quite so extreme. We ended up having to roll back the change.

grafana / mimir

Ruler Pods OOM/spike in memory observed with warning log closing ingester client stream failed #8134