grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.17k stars 537 forks source link

Cross-tenant queries produce cardinality estimation cache keys that are too long #6931

Open chencs opened 11 months ago

chencs commented 11 months ago

Describe the bug

When running cross-tenant queries with the query-frontend results cache enabled, the query caching middleware attempts to concatenate all tenant IDs into the key (code). There's a 250-byte limit on memcached keys (code), so it's possible to have enough tenants (or long enough tenant names) that the key ends up exceeding that 250-byte limit, at which point gets will fail.

To Reproduce

Steps to reproduce the behavior:

  1. Create multiple tenants whose names, concatenated, exceed 250 bytes in length.
  2. Query across all tenants using the wildcard tenant.
  3. See malformed key logs on memcached fetches.

Expected behavior

Keys should be short enough that they don't exceed the 250-byte limit. Perhaps that means that cache keys should be created and looked up on a per-tenant basis?

56quarters commented 11 months ago

One potential solution to this is to sort the tenant IDs (the parsing logic might already do this) and hash them and use the hash in the cache key instead of the tenant IDs directly.

dimitarvdimitrov commented 11 months ago

the cache key is then hashed and that's used to key memcached. The full key is stored as part of the item to check for hash collisions, but it's not included verbatim in the cache key.

https://github.com/grafana/mimir/blob/efa26e743027bf495c340a893939b2671ba05959/pkg/frontend/querymiddleware/split_and_cache.go#L458

chencs commented 11 months ago

Ah, you're right, and then the hashed key gets read back out here, I missed that. So this isn't a problem for the results cache, but it seems like it is a problem for a cache somewhere, likely the cardinality estimation cache, since I do see the QS prefix in logs.

rnerkar commented 11 months ago

It seems we have been hitting the same issue since adding new tenants 2 days back image

We have the below in logs for query-frontend

{"time":"2023-12-19T17:27:04.952051693Z","message":"ts=2023-12-19T17:27:04.950734513Z caller=memcached_client.go:335 level=warn name=frontend-cache msg=\"failed to fetch items from memcached\" numKeys=1 firstKey=1@QS:||...|:819455e299e84fc0:236529:0 err=\"malformed: key is too long or contains invalid characters\"", "tenant":"somename"}

When I check the length of all the tenants, it is exceeding 250 characters @chencs Are you seeing similar log entries ?

chencs commented 11 months ago

@rnerkar yes, except my logs don't include that tenant key. My understanding is that the cardinality estimation cache is used to limit the number of shards a query can be broken into for parallelization (code).

From reading this issue describing the goals for implementing cardinality estimation for query sharding, I believe this means that the failure mode (not being able to fetch cardinality estimation cache keys) potentially results in unnecessary over-sharding (a nuisance with respect to overhead), but not unnecessary under-sharding (OOMs resulting in query failure).

rnerkar commented 11 months ago

Thanks for explaining @chencs So with the understanding that there are no adverse effects on Mimir's performance with these errors, except for the fact that small queries will get over-sharded Do we have a fix coming in, or something to be done from our end? We have had talks about using redis instead of Memcached to get around this issue but that's still not finalized Will wait till I hear back from you guys. Thanks in advance !

56quarters commented 11 months ago

Do we have a fix coming in, or something to be done from our end?

We apply a hash to keys to make them a consistent length elsewhere, we'll do the same here. No ETA but it should be a simple fix.

We have had talks about using redis instead of Memcached to get around this issue but that's still not finalized Will wait till I hear back from you guys. Thanks in advance !

We don't recommend using Redis since we (Grafana) don't use it in production and it's not part of the Helm chart or Jsonnet.

nabadger commented 8 months ago

We're hitting this issue as well.

Note that it also seems to increment thanos_memcached_operation_failures_total, so we actually get alerted on this.

rishabhkumar92 commented 3 months ago

We are hitting into this issue, I saw related change was closed Is there any planned work to address this?

chencs commented 3 months ago

Implementing the fix for this is in the backlog, but not currently planned, since the proposed quick fix above did not turn out to be as simple as expected. @56quarters did list some alternative options in the related PR before closing, though.