thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.04k stars 2.09k forks source link

Returned result not stable when query large range data #6988

Open suxiaoxiaomm opened 9 months ago

suxiaoxiaomm commented 9 months ago

Hi Experts, I have query range for the last 30days as this: (sum_over_time(app_availability_5m[30d]))/count_over_time(app_availability_5m[30d])

When query from Grafana UI, the returned value is not stable, sometimes it is 100.000%, for the next seconds when I click Run Queries, it returns 99.954%.

My thanos query has basically 4 endpoints: thanos-storegateway, thanos-ruler and 2 replicas of prometheus(running thanos-sidecar). I am thinking maybe when the value is 100.000%, it is the result from Prometheus(thanos-sidecar) as it only contains the last 2days' data. And when the value is 99.954%, this is the result from thanos-storegateway as it truly contains last 30days' data. And obviously, 99.954% is the Correct result that I expected.

I am wondering my assumption is correct or not? How can I ensure I only get the result from thanos-storegateway?

Thanks for you help!

MichaHoffmann commented 9 months ago

Hey,

The way the query is written i would suspect it to always be 1 ( if the metric has value 1 always ), right? Can you try giving it an "offset 10m" to see if it fixes the issue that its not stable?

 (sum_over_time(app_availability_5m[30d] offset 10m))/count_over_time(app_availability_5m[30d] offset 10m)

But generally i dont understand the query i think.

suxiaoxiaomm commented 9 months ago

Hi @MichaHoffmann , Thanks for the reply. The metric actually have 2 types of value: 1 or 0.

I am sure that there are several points this metric's value is 0, which make the ratio results to 99.954%

Actually I tried to make this query: count_over_time(app_availability_5m[30d]), it should return something around 8640(it records every 5mins, and for 30days, it should have 30 24 (60 / 5) records). However, it sometimes returns around 630. As below snapshot: image

What I expected: image

I also tried apply the offset, it still not stable.

I am wondering how does thanos decides: among all its store API endpoints, which values should it return to the client? I am still thinking maybe the cause is using the result from Prometheus thanos sidecar, which is imcomplete data.

BTW, I have already disabled using partial-response by applying "--no-labels.partial-response"/"--no-query.partial-response" to thanos-query-frontend and thanos-query. But it seems not helping.

suxiaoxiaomm commented 9 months ago

I checked the debug log for thanos-query, find out sometimes I will get below error(NOT everytime when I got wrong values, but randomly. Sometimes there is no such error log, I still got wrong result value) While there is NO error log at thanos-storegateway side.

level=error ts=2023-12-18T13:28:07.580694458Z caller=proxy.go:299 component=proxy request="min_time:1700313720000 max_time:1702906080000 matchers:<name:\"__name__\" value:\"app_availability_5m\" > max_resolution_window:12000 aggregates:COUNT step:60000 range:2592000000 " err="fetch series for {cluster=\"aks-001\", prometheus_replica=\"prometheus-observability-0\"},{cluster=\"aks-001\", prometheus_replica=\"prometheus-observability-1\"},{prometheus_replica=\"observability-thanos-ruler-0\", ruler_cluster=\"aks-001\"} Addr: 172.20.7.167:10901 LabelSets: {cluster=\"aks-001\", prometheus_replica=\"prometheus-observability-0\"},{cluster=\"aks-001\", prometheus_replica=\"prometheus-observability-1\"},{prometheus_replica=\"observability-thanos-ruler-0\", ruler_cluster=\"aks-001\"} Mint: 1675296000012 Maxt: 1702886400000: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.20.7.167:10901: connect: connection refused\""

MichaHoffmann commented 9 months ago

Can you maybe request raw data? I think you might get downsampled data in storage gateway right now.

suxiaoxiaomm commented 9 months ago

Can you maybe request raw data? I think you might get downsampled data in storage gateway right now.

I did try to run query against raw data, still got randomly partial data returns: as below, last point, you can see the value droped to around 630, however, it normally should be 8640.

image

image

MichaHoffmann commented 9 months ago

Which verson of thanos again? Is it the latest? Also how often is this metric collected? If its every 5min; can you try a bigger offset maybe? Is it always the "latest" datapoint in the graph?

suxiaoxiaomm commented 9 months ago

No it is not the latest. Thanos version: thanos:0.29.0-scratch-r0

Also how often is this metric collected? This metrics is collected every 5mins.

And Yes I can try a bigger offset, such as 1h. But can I know why would this offset value helps on this situation.

Is it always the "latest" datapoint in the graph? No, sometimes it is in the middle, them back to normal.

MichaHoffmann commented 9 months ago

Can you use latest version? I thought offset might help in cases where recent data is not yet fully collected. I its sometimes in the middle then that theory doesnt hold sadly.

suxiaoxiaomm commented 9 months ago

@MichaHoffmann I observed there are multiple times, the thanos-storegateway get OOMKilled as too much data loaded. I guess because of this, the thanos-query ignored thanos-storegateway as an endpoint, thus only use the returned value from Prometheus thanos-sidecar.

I am wondering is there any configuration to specify: for this expression, I only want thanos-storegateway as my datasource? If thanos-storegateway is down, I can accept No Data as the result.

MichaHoffmann commented 9 months ago

@MichaHoffmann I observed there are multiple times, the thanos-storegateway get OOMKilled as too much data loaded. I guess because of this, the thanos-query ignored thanos-storegateway as an endpoint, thus only use the returned value from Prometheus thanos-sidecar.

I am wondering is there any configuration to specify: for this expression, I only want thanos-storegateway as my datasource? If thanos-storegateway is down, I can accept No Data as the result.

You can use store filtering: https://thanos.io/tip/components/query.md/#store-filtering or configure querier to not ask the store at all probably!

yeya24 commented 9 months ago

I am wondering is there any configuration to specify: for this expression, I only want thanos-storegateway as my datasource? If thanos-storegateway is down, I can accept No Data as the result.

I suggest that you have a new Thanos Query deployment that points to Store Gateway only