Closed bwplotka closed 4 years ago
Some potential issue is that overlaps are not handled really well: https://github.com/thanos-io/thanos/pull/2400/files
It can be something related to fact that GitLab is using Store GW in HA without loadbalancer (querying both in same time), so the data is duplicated and unsorted (chunks) for sure.
@bwplotka We initially reproduced the issue with direct to Prometheus sidecars, not Store GW. These are in HA pairs as well.
Got some test data from @SuperQ so hopefully will be able to repro locally :hugs:
Fingers crossed :crossed_fingers:
Looks like it is dup of https://github.com/thanos-io/thanos/issues/1326 - let's continue discussion there.
Actually let's not be so sure, this might be different (here deduplication does not cause it)
@SuperQ this repro is so amazing. can explore all details. Definitely we have overlapping and unsorted chunks. We should be able to find a problem in our algorithm soon, thanks!
BTW... I kind of overengineered (as you can imagine) and wrote thanos tools storeapi serve --json=<file x>
which can serve JSON (protobuf based) and as Store API :tada:
So I can get your file (actually anything generated by grpcurl
and put into thanos tools storeapi serve --json
, run querier and connect storeapi serve
as --store, and see your results:
Tooling looks like it works, but I think we don't have enough chunks to repro it :thinking:
Tried all sorts of time ranges, steps and rate ranges.. no luck:
All good everywhere...
cc @SuperQ , can you send me bit wider time span? :thinking: is this for sure time span you can reproduce the problem with? What if this is caching, some layer above Thanos Querier?
Here's another data sample that reproduces it. The time range to reproduce is this:
rate(gitlab_transaction_cache_read_hit_count_total[30m])
When I turn off dedupe, the issue goes away:
I don't see full data @SuperQ (you gave only half of it I think), but I can repro :tada:
Investigating
Step does not matter, it's deduplication bug:
Finally, found cause https://github.com/thanos-io/thanos/pull/2528
Essentially there are 3, exactly the same overlapping chunks with stale markers. For some reasons counter iterator, deduplication or bounded iterator does not handle this well.
Found root-cause, described minimal repro case: https://github.com/thanos-io/thanos/pull/2528#issuecomment-620659141
Closing https://github.com/thanos-io/thanos/issues/1326 as duplicate, now we know for sure.
Not super clear how to fix the issue long term (: Some deep dive https://docs.google.com/spreadsheets/d/13A8ChunqbVdRq9j5kqrtfzknO6mvVFQPUkXwUiBuV_4/edit?usp=sharing
Let's say they scrape the same counter from the same application.
Accounting resets correctly in generic deduplication algorithm data is really hard as presented in this spreadsheet. This is due to a different view of END value for each counter by different replicas (different scrape time!).
Crafting a deduplication algorithm when we know it's counter metric is quite trivial. The problem is... we don't know. So ideally we need a generic dedup algorithm for replicas.
Any ideas @brancz @brian-brazil @beorn7 @SuperQ ? (:
The current idea is to actually have special deduplication for counters. Generally, we don't know what metric is a counter on the offline level (unless it's downsampled data, then we know). However, for query part it's clear. It is counter if rate
func was in hints from PromQL. So we can use special counter-based dedup.
On offline rewrite / deduplication level, for raw data, we have no idea what type is. However, for a quick win, we, for now, could just not worry about offline dedup yet and just solve query issues.
Then we can maybe for the offline figure something else. Maybe generic dedup that will work for those, or something that will base on _total
metric name (but that's sketchy)
Or... we should collaborate on different dedup algorithm for future. Maybe scrape interval based? (downside: What if scrape interval changes)
Help wanted for review!
(downside: What if scrape interval changes)
That's not common, but you could depend on noone having a scrape interval over 2 minutes as that's not sane for other reasons.
That's quite good idea :+1: (for https://github.com/thanos-io/thanos/issues/2547)
It actually saddens me that Prometheus “by design” doesn't really cope with scrape intervals >2m. I'd love to see future Prometheus versions lifting that arbitrary limit, and I'd therefore prefer if Thanos didn't bake in that limit into its own design, too.
Interestingly, I'd also love to see future Prometheus version to have 1st class support for metric types. That would then also solve your problem of how to safely recognize a counter.
Found by GitLab, we were investigating offline with @SuperQ
Their issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9293
This can be only reproducible with large rates
[30m+]
which means it has to do with chunks ordering or overlaps.