querier: Rate over deduplicated counter from many replicas can lead to double reset account. #2401

Closed bwplotka closed 4 years ago

Found by GitLab, we were investigating offline with @SuperQ

Their issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9293

This can be only reproducible with large rates [30m+] which means it has to do with chunks ordering or overlaps.

Some potential issue is that overlaps are not handled really well: https://github.com/thanos-io/thanos/pull/2400/files

It can be something related to fact that GitLab is using Store GW in HA without loadbalancer (querying both in same time), so the data is duplicated and unsorted (chunks) for sure.

@bwplotka We initially reproduced the issue with direct to Prometheus sidecars, not Store GW. These are in HA pairs as well.

Got some test data from @SuperQ so hopefully will be able to repro locally :hugs:

Fingers crossed :crossed_fingers:

Looks like it is dup of https://github.com/thanos-io/thanos/issues/1326 - let's continue discussion there.

Actually let's not be so sure, this might be different (here deduplication does not cause it)

@SuperQ this repro is so amazing. can explore all details. Definitely we have overlapping and unsorted chunks. We should be able to find a problem in our algorithm soon, thanks!

BTW... I kind of overengineered (as you can imagine) and wrote thanos tools storeapi serve --json=<file x> which can serve JSON (protobuf based) and as Store API :tada:

So I can get your file (actually anything generated by grpcurl and put into thanos tools storeapi serve --json , run querier and connect storeapi serve as --store, and see your results:

Tooling looks like it works, but I think we don't have enough chunks to repro it :thinking:

Tried all sorts of time ranges, steps and rate ranges.. no luck:

All good everywhere...

cc @SuperQ , can you send me bit wider time span? :thinking: is this for sure time span you can reproduce the problem with? What if this is caching, some layer above Thanos Querier?

Here's another data sample that reproduces it. The time range to reproduce is this:

End time: 2020-04-24 04:00
Range: 4h
Step: 1800
Query: rate(gitlab_transaction_cache_read_hit_count_total[30m])

data-web-08.json.gz

When I turn off dedupe, the issue goes away:

I don't see full data @SuperQ (you gave only half of it I think), but I can repro :tada:

Investigating

Step does not matter, it's deduplication bug:

Finally, found cause https://github.com/thanos-io/thanos/pull/2528

Essentially there are 3, exactly the same overlapping chunks with stale markers. For some reasons counter iterator, deduplication or bounded iterator does not handle this well.

Found root-cause, described minimal repro case: https://github.com/thanos-io/thanos/pull/2528#issuecomment-620659141

Closing https://github.com/thanos-io/thanos/issues/1326 as duplicate, now we know for sure.

Not super clear how to fix the issue long term (: Some deep dive https://docs.google.com/spreadsheets/d/13A8ChunqbVdRq9j5kqrtfzknO6mvVFQPUkXwUiBuV_4/edit?usp=sharing

TL;DR: The problem is with deduplicating a counter series from 2 or more Prometheus replicas.

Let's say they scrape the same counter from the same application.

Accounting resets correctly in generic deduplication algorithm data is really hard as presented in this spreadsheet. This is due to a different view of END value for each counter by different replicas (different scrape time!).

Crafting a deduplication algorithm when we know it's counter metric is quite trivial. The problem is... we don't know. So ideally we need a generic dedup algorithm for replicas.

Any ideas @brancz @brian-brazil @beorn7 @SuperQ ? (:

My current idea to move this forward:

The current idea is to actually have special deduplication for counters. Generally, we don't know what metric is a counter on the offline level (unless it's downsampled data, then we know). However, for query part it's clear. It is counter if rate func was in hints from PromQL. So we can use special counter-based dedup.

On offline rewrite / deduplication level, for raw data, we have no idea what type is. However, for a quick win, we, for now, could just not worry about offline dedup yet and just solve query issues.

Then we can maybe for the offline figure something else. Maybe generic dedup that will work for those, or something that will base on _total metric name (but that's sketchy)

Or... we should collaborate on different dedup algorithm for future. Maybe scrape interval based? (downside: What if scrape interval changes)

Fix: https://github.com/thanos-io/thanos/pull/2548 Tests: https://github.com/thanos-io/thanos/pull/2528

Help wanted for review!

(downside: What if scrape interval changes)

That's not common, but you could depend on noone having a scrape interval over 2 minutes as that's not sane for other reasons.

That's quite good idea :+1: (for https://github.com/thanos-io/thanos/issues/2547)

It actually saddens me that Prometheus “by design” doesn't really cope with scrape intervals >2m. I'd love to see future Prometheus versions lifting that arbitrary limit, and I'd therefore prefer if Thanos didn't bake in that limit into its own design, too.

Interestingly, I'd also love to see future Prometheus version to have 1st class support for metric types. That would then also solve your problem of how to safely recognize a counter.

thanos-io / thanos

querier: Rate over deduplicated counter from many replicas can lead to double reset account. #2401

TL;DR: The problem is with deduplicating a counter series from 2 or more Prometheus replicas.

My current idea to move this forward: