increase/rate got abnormally large result on aggregated data sources

Qookyozy commented 9 months ago

Thanos, Prometheus and Golang version used: thanos receive：v0.32.5 thanos query：v0.33.0 thanos query-frontend：v0.33.0

What happened: increase/rate got abnormally large result on aggregated data sources

What you expected to happen: increase/rate got correct result on aggregated data sources

How to reproduce it (as minimally and precisely as possible): When aggregating two clusters with identical data using thanos query, and using the increase/rate function, abnormally large results occur when metrics undergo chunk switching.

Full logs to relevant components:

Anything else we need to know: have conducted the following validations:

For the same metric, during metric chunk switching (120 samples), the results are normal for independent data sources but abnormal for aggregated data sources.
For different metrics, during metric chunk switching (120 samples), the results are normal for independent data sources but abnormal for aggregated data sources.
For different metrics with different collection periods, the anomalies occur consistently when 120 samples of metrics are collected."

MichaHoffmann commented 9 months ago

I wonder if its related to what was discussed here before: https://cloud-native.slack.com/archives/CL25937SP/p1697127041904839.

MichaHoffmann commented 9 months ago

Can you describe your cluster a bit more please? That would help in figuring out whats going on!

Qookyozy commented 9 months ago

@MichaHoffmann Here is our cluster architecture and the reasons for setting it up this way. Thanks for your help! Alarm datasource: Querying two clusters to prevent alarms from becoming unavailable due to a single cluster failure. Dashboard datasource: Querying only B cluster to prevent large queries from causing two cluster receiver OOM and affecting alerts.

MichaHoffmann commented 9 months ago

Compactor compacts together blocks Uploaded from both clusters again right? It's configured with replica label "receive-cluster" right?

Qookyozy commented 9 months ago

@MichaHoffmann Compactor compacts together blocks Uploaded from both clusters again right? NO It's configured with replica label "receive-cluster" right? YES

A more complete architectural diagram is shown below.

Component startup parameters `Compactor B

args:
- compact
- --log.level=info
- --log.format=logfmt
- --http-address=0.0.0.0:10902
- --data-dir=/data
- --retention.resolution-raw=14d
- --retention.resolution-5m=30d
- --retention.resolution-1h=180d
- --consistency-delay=30m
- --objstore.config-file=/conf/objstore.yml
- --compact.enable-vertical-compaction
- --deduplication.func=penalty
- --deduplication.replica-label="replica"
- --deduplication.replica-label="thanosreplica"
- --deduplication.replica-label="receive_cluster"
- --deduplication.replica-label="receive"
- --compact.concurrency=4
- --block-files-concurrency=4
- --compact.blocks-fetch-concurrency=4
- --downsample.concurrency=4
- --wait image: quay.io/thanos/thanos:v0.34.0

Receive B

args:
- receive
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --remote-write.address=0.0.0.0:19291
- --objstore.config=$(OBJSTORE_CONFIG)
- --tsdb.path=/var/thanos/receive
- --label=thanosreplica="$(NAME)"
- --label=receive="true"
- --tsdb.retention=4d
- --receive.local-endpoint=$(NAME).prod-b-thanos-receive-headless.$(NAMESPACE).svc.cluster.local:10901
- --receive.grpc-compression=snappy
- --tsdb.out-of-order.time-window=1h
- --store.limits.request-samples=86400000
- --store.limits.request-series=100000

Receive A

args:
- receive
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --remote-write.address=0.0.0.0:19291
- --objstore.config=$(OBJSTORE_CONFIG)
- --tsdb.path=/var/thanos/receive
- --label=thanosreplica="$(NAME)"
- --label=receive="true"
- --tsdb.retention=4d
- --receive.local-endpoint=$(NAME).prod-a-thanos-receive-headless.$(NAMESPACE).svc.cluster.local:10901
- --receive.grpc-compression=snappy
- --tsdb.out-of-order.time-window=1h `

Qookyozy commented 8 months ago

Hello @MichaHoffmann, may I inquire if there have been any recent developments on this issue?

thanos-io / thanos

increase/rate got abnormally large result on aggregated data sources #7146