thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

The difference between the raw metrics and downsampling metrics #7800

Open anarcher opened 3 weeks ago

anarcher commented 3 weeks ago

Thanos, Prometheus and Golang version used: thanos:0.36.1

Object Storage Provider: S3

What happened: There is a difference between the raw metrics and downsampling metrics as follows. (I couldn't see any particular issues in compaction.) Could there be a reason for this difference? Is there any specific area I should check? image

image

kube_pod_info had the following skip series warn log: image

ts=2024-10-06T11:38:49.185869258Z caller=streamed_block_writer.go:116 level=warn msg="empty chunks happened, skip series" series="{__cluster__='prod-kr-a-k8s', __name__='kube_pod_info', __replica__='prometheus-agent-k8s-thanos-0', cluster='prod-kr-a', container='kube-rbac-proxy-main', created_by_kind='Workflow', created_by_name='sync-ehr-1727999700', env='prod', host_ip='10.128.91.30', host_network='false', instance='10.128.72.3:8443', job='kube-state-metrics', namespace='katalog', node='ip-10-128-91-30.ap-northeast-2.compute.internal', pod='sync-ehr-1727999700-hook-621784931', pod_ip='10.128.91.196', priority_class='default', prometheus='addon-monitoring/agent-k8s-thanos', region='kr', role='service', uid='839e9db9-035d-4c4b-854a-e6862a7ece28'}"

running thanos tools bucket verify does not report any issues for the downsampled block

thanos tools bucket verify --objstore.config-file=./cfg/thanos-p01.yaml --id=01J9B6AC49SWMBZRE5G4Q333EK --issues=index_known_issues
ts=2024-10-06T12:45:59.363563Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2024-10-06T12:45:59.36678Z caller=verify.go:138 level=info verifiers=index_known_issues msg="Starting verify task"
ts=2024-10-06T12:45:59.366804Z caller=index_issue.go:33 level=info verifiers=index_known_issues verifier=index_known_issues msg="started verifying issue" with-repair=false
ts=2024-10-06T12:46:03.995076Z caller=fetcher.go:623 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.628019583s duration_ms=4628 cached=1438 returned=244 partial=0
ts=2024-10-06T13:18:32.289597Z caller=index_issue.go:75 level=info verifiers=index_known_issues verifier=index_known_issues msg="verified issue" with-repair=false
ts=2024-10-06T13:18:32.295277Z caller=verify.go:157 level=info verifiers=index_known_issues msg="verify task completed"
ts=2024-10-06T13:18:32.377922Z caller=main.go:174 level=info msg=exiting

What you expected to happen: The trend in both the raw data and the downsampled data is similar.

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know:

dosubot[bot] commented 3 weeks ago

Hello, @anarcher! I'm here to help with any bugs, questions, or contributions you have. Let's tackle this issue together!

The difference between raw metrics and downsampling metrics in Thanos could be due to the presence of empty chunks in the series, as indicated by the skip series warn log related to "empty chunks" in the kube_pod_info series. During the downsampling process, Thanos skips empty XOR chunks, which can lead to discrepancies between raw and downsampled data [1].

To address this, you should check for the presence of empty chunks in the kube_pod_info series and ensure that the data is correctly populated before downsampling. This might involve verifying the data ingestion process or checking for any issues in the data source that could lead to empty chunks being created. Additionally, ensure there are no overlapping blocks with the same external labels and time periods, as this can also affect the downsampling process [2].

To continue talking to Dosu, mention @dosu.