Issue with deduplication alogrithm in Thanos

pardha-visa commented 4 months ago

We have a pretty straightforward Thanos setup which consists of a querier, two Prometheus replicas and their corresponding two sidecars, each co-existing with their own Prometheus instance. Both the Prometheus replicas share the exact same configuration and scrape the same set of targets. The sidecars use Prometheus remote read API's for querying.

Recently we saw that for some target, one of the Prometheus replicas experienced scrape failures due to timeouts, which created data collection gaps. The other prometheus replica, however, didn't face any such issues and there were no data collection gaps there.

Our expectation was that while querying data for this target via Thanos Querier, these gaps will be automatically filled by the deduplication algorithm. However, this didn't happen, and Thanos selected data from the replica which had data gaps.

Here's the graph with deduplication disabled (first replica selected): Screenshot 2024-05-16 at 9 35 38 AM

Here's the graph with deduplication disabled (second replica selected):

Screenshot 2024-05-16 at 9 35 50 AM

Here's the graph with deduplication enabled:

Screenshot 2024-05-16 at 9 36 04 AM

Here is the raw data from both the replicas for the same time range:

Raw data for this timeseries from both the replicas

Query = node_cpu_seconds_total{mode='iowait',instance='<masked>',cpu="0"}[5m]

_replica=occ-node-A 9389.87 @1713668216.753 9390.03 @1713668306.753 9390.33 @1713668336.753 9391.36 @1713668426.753 9391.38 @1713668456.753 9393.49 @1713668486.753

_replica=oce-node-A 9389.94 @1713668224.198 9389.95 @1713668254.198 9390.02 @1713668284.198 9390.03 @1713668314.198 9390.33 @1713668344.198 9390.83 @1713668374.198 9391.13 @1713668404.198 9391.38 @1713668434.198 9391.61 @1713668464.198 9393.53 @1713668494.198

Thanos version: 0.33.0 Prometheus version: 2.51.1

MichaHoffmann commented 4 months ago

The query in the UI uses [1m] ~ from the sample it feels like you have 30s scrape frequency. Does it also happen with [5m]? I wrote a qucik test with your given inputs and the result series looks somewhat like:

                    samples: []sample{
                        {t: 1713668216000, f: 9389.87},
                        {t: 1713668224000, f: 9389.94},
                        {t: 1713668254000, f: 9389.95},
                        {t: 1713668284000, f: 9390.02},
                        {t: 1713668314000, f: 9390.03},
                        {t: 1713668344000, f: 9390.33},
                        {t: 1713668374000, f: 9390.83},
                        {t: 1713668404000, f: 9391.13},
                        {t: 1713668434000, f: 9391.38},
                        {t: 1713668464000, f: 9391.61},
                        {t: 1713668494000, f: 9393.53}},
                                         },

MichaHoffmann commented 4 months ago

It looks like all samples are there and do have proper 30s scrape interval between them; it could be that your 1m windows are aligned in a way that only one sample is contained in the window which would break rate. I think this is an issue with too small window, but the deduplication result looks somewhat correct to me except that we have one sample too much at the beginning

jnyi commented 4 months ago

We have evidence about dedup logic bug as well, here are the proof:

First graph is data points missing when getting the results from thanos receiver with replicationFactor == 3, and because we are rolling update the receiver pods, so 1 copy was absent for sure, however after check Deduplication, it still have dips:

No Dedup

With Dedup

After the data got compact and returned from store, the results become correct with no dips:

jnyi commented 4 months ago

seems related to this issue: https://github.com/thanos-io/thanos/issues/981

thanos-io / thanos

Issue with deduplication alogrithm in Thanos #7364