grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.02k stars 509 forks source link

Docs: Add runbook for memcached "out of memory storing object" error #2235

Open zenador opened 2 years ago

zenador commented 2 years ago

Is your documentation request related to a feature? If so, which one?

Memcached client can fail to store an item to the cache if the cache memory is full and memory can't be reclaimed (must enable debug logs for this to be shown):

level=debug ts=2022-06-23T10:28:27.220696365Z caller=memcached_client.go:406 name=frontend-cache msg="failed to store item to memcached" key=1@b2ae91c4319dafc4 sizeBytes=86848 server=10.70.1.208:11211 err="memcache: unexpected response line from \"set\": \"SERVER_ERROR out of memory storing object\\r\\n\""

The above is logged from here.

Hypothesis:

Memcached 1.5 and above uses a segmented LRU by default (blog post). Items can be evicted by a background routine if they’re expired, or directly during a mset operation if the cache memory is full. The latter operation is called a “direct reclaim”.

Query-frontend caches results with 7d TTL. Since the load test has run for less than 7d, presumably all evictions are caused by direct reclaims. Unfortunately, the “direct_reclaims” Prometheus metric is not exposed, but using the stat command on memcached shows:

Looking at memcached code, it seems that when you issue a mset command and cache is full, it tries to evict up to 10 items. If not enough room is made after 10 freed items, it returns “out of memory storing object”: https://github.com/memcached/memcached/blob/046c4bb5d8498420c13e5357c8299b60952b2595/items.c#L184

The hypothesis is that once the cache is full, we get an “out of memory storing object” each time we try to store an item which is bigger than the sum of the 10 least recently used items.

Describe the solution that you’d like or the expected outcome

First try to reproduce the issue and test the above hypothesis

Then write a runbook on how to handle this error.

osg-grafana commented 11 months ago

@cristiangsp and @osg-grafana agree that this is Engineering driven, and therefore I am removing this from the Docs Team backlog.