thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

querier: Rate over deduplicated counter from many replicas can lead to double reset account. #2401

Closed bwplotka closed 4 years ago

bwplotka commented 4 years ago

Found by GitLab, we were investigating offline with @SuperQ

Their issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9293

This can be only reproducible with large rates [30m+] which means it has to do with chunks ordering or overlaps.

bwplotka commented 4 years ago

Some potential issue is that overlaps are not handled really well: https://github.com/thanos-io/thanos/pull/2400/files

bwplotka commented 4 years ago

It can be something related to fact that GitLab is using Store GW in HA without loadbalancer (querying both in same time), so the data is duplicated and unsorted (chunks) for sure.

SuperQ commented 4 years ago

@bwplotka We initially reproduced the issue with direct to Prometheus sidecars, not Store GW. These are in HA pairs as well.

bwplotka commented 4 years ago

Got some test data from @SuperQ so hopefully will be able to repro locally :hugs:

Fingers crossed :crossed_fingers:

bwplotka commented 4 years ago

Looks like it is dup of https://github.com/thanos-io/thanos/issues/1326 - let's continue discussion there.

bwplotka commented 4 years ago

Actually let's not be so sure, this might be different (here deduplication does not cause it)

bwplotka commented 4 years ago

@SuperQ this repro is so amazing. can explore all details. Definitely we have overlapping and unsorted chunks. We should be able to find a problem in our algorithm soon, thanks!

BTW... I kind of overengineered (as you can imagine) and wrote thanos tools storeapi serve --json=<file x> which can serve JSON (protobuf based) and as Store API :tada:

So I can get your file (actually anything generated by grpcurl and put into thanos tools storeapi serve --json , run querier and connect storeapi serve as --store, and see your results:

image

bwplotka commented 4 years ago

Tooling looks like it works, but I think we don't have enough chunks to repro it :thinking:

Tried all sorts of time ranges, steps and rate ranges.. no luck:

image

All good everywhere...

cc @SuperQ , can you send me bit wider time span? :thinking: is this for sure time span you can reproduce the problem with? What if this is caching, some layer above Thanos Querier?

SuperQ commented 4 years ago

Here's another data sample that reproduces it. The time range to reproduce is this:

data-web-08.json.gz image

When I turn off dedupe, the issue goes away:

image

bwplotka commented 4 years ago

I don't see full data @SuperQ (you gave only half of it I think), but I can repro :tada:

image

bwplotka commented 4 years ago

Investigating

bwplotka commented 4 years ago

Step does not matter, it's deduplication bug:

image

bwplotka commented 4 years ago

Finally, found cause https://github.com/thanos-io/thanos/pull/2528

Essentially there are 3, exactly the same overlapping chunks with stale markers. For some reasons counter iterator, deduplication or bounded iterator does not handle this well.

bwplotka commented 4 years ago

Found root-cause, described minimal repro case: https://github.com/thanos-io/thanos/pull/2528#issuecomment-620659141

Closing https://github.com/thanos-io/thanos/issues/1326 as duplicate, now we know for sure.

bwplotka commented 4 years ago

Not super clear how to fix the issue long term (: Some deep dive https://docs.google.com/spreadsheets/d/13A8ChunqbVdRq9j5kqrtfzknO6mvVFQPUkXwUiBuV_4/edit?usp=sharing

bwplotka commented 4 years ago

TL;DR: The problem is with deduplicating a counter series from 2 or more Prometheus replicas.

Let's say they scrape the same counter from the same application.

Accounting resets correctly in generic deduplication algorithm data is really hard as presented in this spreadsheet. This is due to a different view of END value for each counter by different replicas (different scrape time!).

Crafting a deduplication algorithm when we know it's counter metric is quite trivial. The problem is... we don't know. So ideally we need a generic dedup algorithm for replicas.

Any ideas @brancz @brian-brazil @beorn7 @SuperQ ? (:

My current idea to move this forward:

The current idea is to actually have special deduplication for counters. Generally, we don't know what metric is a counter on the offline level (unless it's downsampled data, then we know). However, for query part it's clear. It is counter if rate func was in hints from PromQL. So we can use special counter-based dedup.

On offline rewrite / deduplication level, for raw data, we have no idea what type is. However, for a quick win, we, for now, could just not worry about offline dedup yet and just solve query issues.

Then we can maybe for the offline figure something else. Maybe generic dedup that will work for those, or something that will base on _total metric name (but that's sketchy)

bwplotka commented 4 years ago

Or... we should collaborate on different dedup algorithm for future. Maybe scrape interval based? (downside: What if scrape interval changes)

bwplotka commented 4 years ago

Fix: https://github.com/thanos-io/thanos/pull/2548 Tests: https://github.com/thanos-io/thanos/pull/2528

bwplotka commented 4 years ago

Help wanted for review!

brian-brazil commented 4 years ago

(downside: What if scrape interval changes)

That's not common, but you could depend on noone having a scrape interval over 2 minutes as that's not sane for other reasons.

bwplotka commented 4 years ago

That's quite good idea :+1: (for https://github.com/thanos-io/thanos/issues/2547)

beorn7 commented 4 years ago

It actually saddens me that Prometheus “by design” doesn't really cope with scrape intervals >2m. I'd love to see future Prometheus versions lifting that arbitrary limit, and I'd therefore prefer if Thanos didn't bake in that limit into its own design, too.

Interestingly, I'd also love to see future Prometheus version to have 1st class support for metric types. That would then also solve your problem of how to safely recognize a counter.