Open tredman opened 3 months ago
I fired up tracing and looked into a few queries with this issue. The traces are pretty huge but eyeballing it, this error consistently occurs in FilterChunkRefs calls that have at least one resultsCache hit.
This lead me to run an experiment of setting:
bloom_gateway:
client:
cache_results: false
And I am no longer able to reproduce this error. So now my next question is - is this a bug, or do I simply have the bloom gateway results cache misconfigured? The Loki chart does not/did not have an obvious way to configure the bloom_gateway
or bloom_compactor
sections of the loki config, so we had to set this up by hand and re-used the results cache that the queriers use. Is it possible these need to be separate caches? The chart does not spin up a separate bloom gateway results cache (that I can see) - so I'll try to hack that together and report back.
I noticed we had configured the results cache to the bloom gateway to point at the chunks cache, so I updated that.
bloom_gateway:
client:
cache_results: true
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-chunks-cache.observability.svc
to
bloom_gateway:
client:
cache_results: true
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-results-cache.observability.svc
The error came back, so I started up an entirely different results cache (basically just cloned the results-cache statefulset with a new name/set of matching labels)
bloom_gateway:
client:
cache_results: true
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-bloom-results-cache.observability.svc
and am still getting the above error. So it looks like this is somehow related to having results cache enabled but it's not clear if it's a config problem or a bug. I've turned off bloom gateway results caching for now.
I'm getting the same issue, on 3.1.0. But I'm also seeing a panic in the bloom-gateway at the same time as the index-gateway. With a slightly different error:
Disabling results_cache also fixes the issue for me.
Describe the bug
In the index-gateway,
bloomquerier.FilterChunkRefs
appears to panic because more "postFilter" chunks are returned than "preFiltered" chunks. The actual panic is in the prometheuscounter.Add
call, which panics if the value passed to it is less than 0.With debug logging enabled, I am able to see that
preFilterChunks
is sometimes smaller thanpostFilterChunks
. Glancing at the code, the panic occurs whenfilteredChunks
is computed and the value is < 0 and added to the prometheus counter. Here are some examples of FilterChunkRefs calls that appear to return < 0 filteredChunks values.This causes the query to fail but doesn't occur consistently.
To Reproduce
We're running the latest pre-release build for 3.1.0:
k208-ede6941
- was also able to reproduce this issue in the last releasek207
.Here's a query we're running that triggers this. It only occurs when we're searching time periods that are covered by bloom filters - so most recent data doesn't seem to trigger the issue, but if I run a query from
now-48h
tonow-47h
I can repro this.Expected behavior I would expect this query to run reliably, leveraging the bloom filters to filter chunks that aren't needed in the search.
Environment:
Screenshots, Promtail config, or terminal output
here is our loki config for reference: