Deduplication returning deduped and non-deduped results in 0.31.0+

diranged commented 1 year ago

Thanos, Prometheus and Golang version used:

Thanos Version: 0.31.0 Prometheus Version: 2.42.0

Object Storage Provider:

S3

What happened:

Slack Thread: https://cloud-native.slack.com/archives/CK5RSSC10/p1680534545839619

We recently upgraded Thanos from v0.28.1 to v0.31.0… and we’re seeing some odd behavior now with portions of our queries. Basically queries that pull data from the last few hours are all correct. However, longer queries will return duplicate data as long as “use deduplication” is checked (which has been the default setting forever). The really weird part is if we uncheck that box, we suddenly get the right data.

We have the following CLI flags configured on our thanos-query pods:

            - --query.replica-label=replica
            - --query.replica-label=prometheus_replica
            - --query.replica-label=cluster
            - --query.replica-label=__replica__
            - --query.replica-label=tenant_id
            - --query.replica-label=receive
            - --query.replica-label=prometheus
            - --query.replica-label=thanos_ruler_replica
            - --query.auto-downsampling
            - --query.max-concurrent-select=25
            - --store.response-timeout=1m
            - --query.max-concurrent=500
            - --query.metadata.default-time-range=6h

What you expected to happen:

I expect to get the right results as happened in the 0.28.0, 0.29.0, 0.30.0, 0.30.2 releases.

Anything else we need to know:

After discovering the issue, we rolled back to 0.28.0 and then upgraded one release at a time... the problem only occurs when we upgrade to the 0.31.0 release.

fpetkovski commented 1 year ago

This is likely introduced in https://github.com/thanos-io/thanos/pull/5988.

Do you know which component returns non deduplicated results, and are all Thanos components updated to 0.31? Also, are all replica labels external labels, or are some of them stored inside TSDBs are series labels?

diranged commented 1 year ago

All of the components are running 0.31.0 when the problem happens. I believe some of the labels are stored inside TSDBs because they are coming from our prometheus scraper pods (which push into the receive pods) - but I'm not totally sure.

fpetkovski commented 1 year ago

@GiedriusS maybe we should strip replica labels from series as well?

zhangrj commented 1 year ago

same problem here: https://github.com/thanos-io/thanos/issues/6248

zhangrj commented 1 year ago

This is likely introduced in #5988.

Do you know which component returns non deduplicated results, and are all Thanos components updated to 0.31? Also, are all replica labels external labels, or are some of them stored inside TSDBs are series labels?

according to my test, it's storegateway (I'm not sure whether ruler has this problem)

fpetkovski commented 1 year ago

@zhangrj do you have any replica labels that are stored in TSDB and are not external labels?

zhangrj commented 1 year ago

@zhangrj do you have any replica labels that are stored in TSDB and are not external labels?

sorry, I don't have.

amro-krg commented 1 year ago

Same issue here, instantly fixed after a rollback to 0.30.2

Very strange indeed. I have 2 prometheuses and replica-label is set. In Thanos-query, this query: myquery{instance="10.0.0.1"} returns 1 series only, all good myquery{instance="10.0.0.2"} also returns 1 series However, myquery{instance=~"10.0.0.*"} doesn't return 2 series as it should, but 4! And they have the EXACT same labels 2 by 2. If I disable deduplication I can see the different prometheus_replica.

Thanos-query seems to have trouble merging different series based on replica-label, but only when the query returns more than one series.

GiedriusS commented 1 year ago

Do you use Thanos Receive? Also, could you please print the list of series verbatim with all of their labels like you see them on Thanos Query with deduplication off?

amro-krg commented 1 year ago

Yes, we use Thanos Receive and Thanos Store, and the problem does indeed come from Receive. Take a look: Query1: node_os_info{instance=~"hulbdcfr1-00(1|2).*"} with deduplication off:


node_os_info{env="production", id="centos", id_like="rhel fedora", instance="hulbdcfr1-001.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}
node_os_info{env="production", id="centos", id_like="rhel fedora", instance="hulbdcfr1-002.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}
node_os_info{env="production-nl1", id="centos", id_like="rhel fedora", instance="hulbdcfr1-001.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}
node_os_info{env="production-nl1", id="centos", id_like="rhel fedora", instance="hulbdcfr1-002.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}

same query with deduplication on:


node_os_info{id="centos", id_like="rhel fedora", instance="hulbdcfr1-001.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}
node_os_info{id="centos", id_like="rhel fedora", instance="hulbdcfr1-001.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}
node_os_info{id="centos", id_like="rhel fedora", instance="hulbdcfr1-002.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}
node_os_info{id="centos", id_like="rhel fedora", instance="hulbdcfr1-002.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}

You can see these metrics are doubled, and have the exact same labels.

Query 2: Another query for one specific node only, without deduplication: node_os_info{instance=~"hulbdcfr1-001.*"}

node_os_info{env="production", id="centos", id_like="rhel fedora", instance="hulbdcfr1-001.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}
node_os_info{env="production-nl1", id="centos", id_like="rhel fedora", instance="hulbdcfr1-001.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}

and with deduplication:

node_os_info{id="centos", id_like="rhel fedora", instance="hulbdcfr1-001.REDACTED:9106", job="Node Exporter", name="CentOS Linux", pretty_name="CentOS Linux 7 (Core)", receive_cluster="eu1", receive_replica="0", tenant_id="default-tenant", version="7 (Core)", version_id="7"}

With this last one, Thanos-query does indeed merge as it's supposed to.

Again, just rolling back from 0.31.0 to 0.30.2 and redeploying without changing 1 byte, solves the problem :/

fpetkovski commented 1 year ago

Do you happen to have prometheus HA pairs that both remote write to Receivers? Trying to understand where the prometheus_replica label is coming from.

GiedriusS commented 1 year ago

@GiedriusS maybe we should strip replica labels from series as well?

I wonder if this could be the culprit. I'll try to reproduce but I haven't been able so far.

amro-krg commented 1 year ago

@fpetkovski yes, 2 prometheuses with exact same targets (no sharding or whatsoever) are remote_writing to thanos receive:

  remote_write:
    - url: http://xx.xx.xx.xx:10908/api/v1/receive

where 10908 is given to thanos-receive as arg --remote-write.address 0.0.0.0:10908

One prometheus is set with

    external_labels:
      env: production

and the other

    external_labels:
      env: production-nl1

and of course, thanos-query as: --query.replica-label=env

fpetkovski commented 1 year ago

@amro-krg you are running into https://github.com/thanos-io/thanos/issues/6257#issuecomment-1497844767.

In https://github.com/thanos-io/thanos/pull/5988 we made an optimization which improves query performance by deduplicating on external labels only. Even though prometheus_replica is an external label in Prometheus, it is remote written into a Receiver TSDB. Each receivers in your case has two prometheus_replica values in its TSDB and this is why deduplication fails.

@diranged This could explain the issue you are seeing as well if you have some sort of HA scraping + remote write setup.

Link to the proposal: https://github.com/thanos-io/thanos/blob/main/docs/proposals-accepted/20221129-avoid-global-sort.md.

hagaibarel commented 1 year ago

Hitting the same issue as @amro-krg with a similar setup, prometheus writes to receive via remote write and query returning deduped and non-deduped results with prometheus_replica label.

Any suggestions on workaround or how to handle this?

fpetkovski commented 1 year ago

I don't know a way to mitigate this issue right now. Writing to Receivers with HA prometheus pairs is not something we predicted as a potential use case.

KubeCon EU is happening next week so it will be a good opportunity for us to look into solutions. We will probably need to cut 0.31.1 once we have a fix.

trevorriles commented 1 year ago

We are seeing this same issue without receive in the picture.

We've since rolled back but I can try and reproduce again in a dev environment next week.

We just have Prometheus + sidecar, store, and ruler. All behind the query servers.

fpetkovski commented 1 year ago

@trevorriles that would be great, especially if you can share which replica labels you use and what the incorrect query result looks like.

wallee94 commented 1 year ago

I believe this comes from the stores not re-ordering the series after adding the external labels. You can replicate it by adding the metrics:

metric{}
metric{alocal="label"}

And an external label ext="label". Before the external label, metric{} < metric{alocal="label"} because it has fewer labels. However, after adding it, both sidecar and ruler returned this when I tested them:

metric{ext="label"}
metric{alocal="label", ext="label"}

and since alocal < ext, the order is incorrect. If we have two stores returning the same duplicated data, ProxyResponseHeap will yield the following:

metric{ext="label"}
metric{alocal="label", ext="label"}
metric{ext="label"}
metric{alocal="label", ext="label"}

Making dedupResponseHeap ignore the duplication. I tried to work out a solution in the ProxyResponseHeap Less function, but I think this should probably be re-sorted in the stores instead. The same incorrect order is in prometheus remote read response to the sidecar

fpetkovski commented 1 year ago

I think you are right, this could be one way the issue could happen. We should probably exclude external labels from the comparison because they interfere with the sorting order.

fpetkovski commented 1 year ago

I think there are two separate problems here. The first one is that external labels can affect the ordering of series coming from tsdb. This one could potentially be fixed by #6299.

The second problem is deduplicating by labels which are stored inside the tsdb and are not external labels. In this case, removing replica labels will lead to unsorted data. This scenario can happen when two HA prometheus instances write into a receiver. I do not see a way to fix this other than to introduce a flag in the querier (e.g. dedup-by-internal-labels) to tell it that it should resort the data before deduplication.

GiedriusS commented 1 year ago

IMO having yet another flag for that would be a mess and it would be hard to explain to users whether it is needed or not. Would there be any downsides to removing given labels from the response regardless of whether they are external labels? With deduplication, the replica labels disappear either way.

fpetkovski commented 1 year ago

This will not solve the problem, we can have a single store response like:

metric{replica="a", z="1"}
metric{replica="a", z="2"}
metric{replica="b", z="1"}
metric{replica="b", z="2"}

If we dedup by replica and we strip it from the response, we end up with an unsorted series set:

metric{z="1"}
metric{z="2"}
metric{z="1"}
metric{z="2"}

Also, we already do this: https://github.com/thanos-io/thanos/blob/6d838e7142d97cce81bcaada932992a4c40a5d80/pkg/store/tsdb.go#L178

Alternatively we can detect if dedup labels are replica labels everywhere, and avoid resorting in that case. However, then we need to set the same replica label in all stores if we want to avoid the resort.

fpetkovski commented 1 year ago

@GiedriusS wdyt about a solution that uses bloom filters to detect dedup by internal labels: https://github.com/thanos-io/thanos/pull/6317.

I haven't tested this in a staging environment to see how much memory/bandwidth overhead we would have, but I can do that next week once I am back in the office. It could also open the door to additional bloom filters (e.g. on metric names).

douglascamata commented 1 year ago

@fpetkovski @GiedriusS @bwplotka: I know that #5988 is a huge PR that introduces a lot of changes... but this bug is kind of critical for anyone running the. 0.31.0 release. Given I see it'll take some time to fix it, would it be better to try to make the behavior introduced by #5988 be behind an experimental flag until the fix is done? WDYT?

GiedriusS commented 1 year ago

I think let's start the work on this by introducing e2e tests that show the bug and then we could work on adding fixes.

douglascamata commented 1 year ago

@GiedriusS yesterday I could write some e2e tests that can show the bug. I'm adding one more scenario to it and will get a PR opened. 👍

douglascamata commented 1 year ago

@GiedriusS test to reproduce the issue available at #6377.

bwplotka commented 1 year ago

Thanks for reporting @diranged and reproducing @douglascamata, I will take a look later this week too. Sounds like we missed some edge cases, sorry for that.

avidpontoon commented 1 year ago

Hi All, is this likely to be what I'm seeing here: https://github.com/thanos-io/thanos/discussions/6419 ?. It sounds the same from reading into the comments.

Cho1409 commented 1 year ago

Got the same error when upgrading from 0.29.0 to 0.31.0.

Everything is fine now after rolling back to 0.30.0.

GiedriusS commented 1 year ago

Got some time to think about this and fix it so hopefully this will get done soon.

GiedriusS commented 1 year ago

To recap: dedup used to work by saving all storepb.SeriesResponse into a slice and then resorting everything so that series would be sorted by replica labels first. In 0.31.0 we added withoutReplicaLabels as per the proposal: https://thanos.io/tip/proposals-accepted/20221129-avoid-global-sort.md/. The idea was to remove replica labels before getting the response so that we could use k-way merge to merge everything in order.

However, we missed that a StoreAPI might start sending unsorted responses with this functionality. Consider the following:

a=1,b=1 a=1,b=2 a=2,b=1 a=2,b=2

If the replica label is a then the response becomes:

b=1 b=2 b=1 b=2

Because storage.SeriesSet is an iterator over a sorted series set, it means that to have lazy, in order evaluation of the iterator we'd need to buffer everything that comes between the two b=1 series in the response from a StoreAPI.

If a is an external label then the situation changes a bit. This means that the response would look like:

a=1,b=1 a=1,b=2 a=1,b=1 a=1,b=2

Since external labels always have the same value, it means that such a response is not possible and it would look something like this instead:

a=1,b=1,c=1 a=1,b=2,c=1 a=1,b=1 a=1,b=2

Deduplication on external labels makes it so that even with the removal of the external label, the response is sorted and we can use k-way merge directly.

This is the difference between deduplication on internal labels and external labels. Due to this, it seems like we would inevitably have to buffer everything to provide a general deduplication algorithm. But, we should avoid buffering as much as possible. The small ad-hoc benchmark in the original proposal shows a huge performance improvement.

Thus, my suggestion would be to:

If deduplication works on internal labels (or internal+external labels) then we have to buffer + resort;
If deduplication works only on external labels then everything can stay as is.

I also suggest removing this https://github.com/thanos-io/thanos/issues/6257#issuecomment-1514160495 hack because a StoreAPI must always send a sorted series set. If it doesn't then that's a bug in the StoreAPI implementation.

I filled https://github.com/thanos-io/thanos/issues/6563 and https://github.com/prometheus/prometheus/issues/12605 for these bugs. But these are other problems impacting deduplication since the beginning of Thanos, I think. So, let's fix the original problem first.

fpetkovski commented 1 year ago

Thus, my suggestion would be to:

If deduplication works on internal labels (or internal+external labels) then we have to buffer + resort; If deduplication works only on external labels then everything can stay as is.

Can we then adapt https://github.com/thanos-io/thanos/pull/6317 to resort in stores instead of doing it in the querier?

gkope commented 1 year ago

issue was happening here as well with prometheus v2.45.0 and thanos on v.0.31.0

Im using a model with observer/observee clusters. My observee clusters prometheus don't have thanos sidecar, so I scrap the /federate of this prometheus with my prometheus-thanos from the observer cluster. When I have 2 replicas in my prometheus-thanos i get sometimes the duplicated labelset error, sometimes we got a empty result.

we rollback to thanos version 0.30.0 and everythink works fine now.

ahurtaud commented 1 year ago

Can confirm fix is working in our setup. Thank you @fpetkovski !

benjaminhuo commented 1 year ago

Cannot wait any longer for a new release that includes this fix! Thanks, @fpetkovski

thanos-io / thanos

Deduplication returning deduped and non-deduped results in 0.31.0+ #6257