Open chencs opened 11 months ago
One potential solution to this is to sort the tenant IDs (the parsing logic might already do this) and hash them and use the hash in the cache key instead of the tenant IDs directly.
the cache key is then hashed and that's used to key memcached. The full key is stored as part of the item to check for hash collisions, but it's not included verbatim in the cache key.
It seems we have been hitting the same issue since adding new tenants 2 days back
We have the below in logs for query-frontend
{"time":"2023-12-19T17:27:04.952051693Z","message":"ts=2023-12-19T17:27:04.950734513Z caller=memcached_client.go:335 level=warn name=frontend-cache msg=\"failed to fetch items from memcached\" numKeys=1 firstKey=1@QS:
When I check the length of all the tenants, it is exceeding 250 characters @chencs Are you seeing similar log entries ?
@rnerkar yes, except my logs don't include that tenant
key. My understanding is that the cardinality estimation cache is used to limit the number of shards a query can be broken into for parallelization (code).
From reading this issue describing the goals for implementing cardinality estimation for query sharding, I believe this means that the failure mode (not being able to fetch cardinality estimation cache keys) potentially results in unnecessary over-sharding (a nuisance with respect to overhead), but not unnecessary under-sharding (OOMs resulting in query failure).
Thanks for explaining @chencs So with the understanding that there are no adverse effects on Mimir's performance with these errors, except for the fact that small queries will get over-sharded Do we have a fix coming in, or something to be done from our end? We have had talks about using redis instead of Memcached to get around this issue but that's still not finalized Will wait till I hear back from you guys. Thanks in advance !
Do we have a fix coming in, or something to be done from our end?
We apply a hash to keys to make them a consistent length elsewhere, we'll do the same here. No ETA but it should be a simple fix.
We have had talks about using redis instead of Memcached to get around this issue but that's still not finalized Will wait till I hear back from you guys. Thanks in advance !
We don't recommend using Redis since we (Grafana) don't use it in production and it's not part of the Helm chart or Jsonnet.
We're hitting this issue as well.
Note that it also seems to increment thanos_memcached_operation_failures_total
, so we actually get alerted on this.
We are hitting into this issue, I saw related change was closed Is there any planned work to address this?
Implementing the fix for this is in the backlog, but not currently planned, since the proposed quick fix above did not turn out to be as simple as expected. @56quarters did list some alternative options in the related PR before closing, though.
Describe the bug
When running cross-tenant queries with the query-frontend results cache enabled, the query caching middleware attempts to concatenate all tenant IDs into the key (code). There's a 250-byte limit on memcached keys (code), so it's possible to have enough tenants (or long enough tenant names) that the key ends up exceeding that 250-byte limit, at which point gets will fail.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Keys should be short enough that they don't exceed the 250-byte limit. Perhaps that means that cache keys should be created and looked up on a per-tenant basis?