Open pardha-visa opened 4 months ago
The query in the UI uses [1m] ~ from the sample it feels like you have 30s scrape frequency. Does it also happen with [5m]? I wrote a qucik test with your given inputs and the result series looks somewhat like:
samples: []sample{
{t: 1713668216000, f: 9389.87},
{t: 1713668224000, f: 9389.94},
{t: 1713668254000, f: 9389.95},
{t: 1713668284000, f: 9390.02},
{t: 1713668314000, f: 9390.03},
{t: 1713668344000, f: 9390.33},
{t: 1713668374000, f: 9390.83},
{t: 1713668404000, f: 9391.13},
{t: 1713668434000, f: 9391.38},
{t: 1713668464000, f: 9391.61},
{t: 1713668494000, f: 9393.53}},
},
It looks like all samples are there and do have proper 30s scrape interval between them; it could be that your 1m windows are aligned in a way that only one sample is contained in the window which would break rate. I think this is an issue with too small window, but the deduplication result looks somewhat correct to me except that we have one sample too much at the beginning
We have evidence about dedup logic bug as well, here are the proof:
First graph is data points missing when getting the results from thanos receiver with replicationFactor == 3
, and because we are rolling update the receiver pods, so 1 copy was absent for sure, however after check Deduplication, it still have dips:
No Dedup
With Dedup
After the data got compact and returned from store, the results become correct with no dips:
seems related to this issue: https://github.com/thanos-io/thanos/issues/981
We have a pretty straightforward Thanos setup which consists of a querier, two Prometheus replicas and their corresponding two sidecars, each co-existing with their own Prometheus instance. Both the Prometheus replicas share the exact same configuration and scrape the same set of targets. The sidecars use Prometheus remote read API's for querying.
Recently we saw that for some target, one of the Prometheus replicas experienced scrape failures due to timeouts, which created data collection gaps. The other prometheus replica, however, didn't face any such issues and there were no data collection gaps there.
Our expectation was that while querying data for this target via Thanos Querier, these gaps will be automatically filled by the deduplication algorithm. However, this didn't happen, and Thanos selected data from the replica which had data gaps.
Here's the graph with deduplication disabled (first replica selected):
Here's the graph with deduplication disabled (second replica selected):
Here's the graph with deduplication enabled:
Here is the raw data from both the replicas for the same time range:
Raw data for this timeseries from both the replicas
Query =
node_cpu_seconds_total{mode='iowait',instance='<masked>',cpu="0"}[5m]
_replica=occ-node-A 9389.87 @1713668216.753 9390.03 @1713668306.753 9390.33 @1713668336.753 9391.36 @1713668426.753 9391.38 @1713668456.753 9393.49 @1713668486.753
_replica=oce-node-A 9389.94 @1713668224.198 9389.95 @1713668254.198 9390.02 @1713668284.198 9390.03 @1713668314.198 9390.33 @1713668344.198 9390.83 @1713668374.198 9391.13 @1713668404.198 9391.38 @1713668434.198 9391.61 @1713668464.198 9393.53 @1713668494.198
Thanos version: 0.33.0 Prometheus version: 2.51.1