Open suxiaoxiaomm opened 9 months ago
Hey,
The way the query is written i would suspect it to always be 1 ( if the metric has value 1 always ), right? Can you try giving it an "offset 10m" to see if it fixes the issue that its not stable?
(sum_over_time(app_availability_5m[30d] offset 10m))/count_over_time(app_availability_5m[30d] offset 10m)
But generally i dont understand the query i think.
Hi @MichaHoffmann , Thanks for the reply. The metric actually have 2 types of value: 1 or 0.
I am sure that there are several points this metric's value is 0, which make the ratio results to 99.954%
Actually I tried to make this query: count_over_time(app_availability_5m[30d])
, it should return something around 8640(it records every 5mins, and for 30days, it should have 30 24 (60 / 5) records). However, it sometimes returns around 630. As below snapshot:
What I expected:
I also tried apply the offset, it still not stable.
I am wondering how does thanos decides: among all its store API endpoints, which values should it return to the client? I am still thinking maybe the cause is using the result from Prometheus thanos sidecar, which is imcomplete data.
BTW, I have already disabled using partial-response by applying "--no-labels.partial-response"/"--no-query.partial-response" to thanos-query-frontend and thanos-query. But it seems not helping.
I checked the debug log for thanos-query, find out sometimes I will get below error(NOT everytime when I got wrong values, but randomly. Sometimes there is no such error log, I still got wrong result value) While there is NO error log at thanos-storegateway side.
level=error ts=2023-12-18T13:28:07.580694458Z caller=proxy.go:299 component=proxy request="min_time:1700313720000 max_time:1702906080000 matchers:<name:\"__name__\" value:\"app_availability_5m\" > max_resolution_window:12000 aggregates:COUNT step:60000 range:2592000000 " err="fetch series for {cluster=\"aks-001\", prometheus_replica=\"prometheus-observability-0\"},{cluster=\"aks-001\", prometheus_replica=\"prometheus-observability-1\"},{prometheus_replica=\"observability-thanos-ruler-0\", ruler_cluster=\"aks-001\"} Addr: 172.20.7.167:10901 LabelSets: {cluster=\"aks-001\", prometheus_replica=\"prometheus-observability-0\"},{cluster=\"aks-001\", prometheus_replica=\"prometheus-observability-1\"},{prometheus_replica=\"observability-thanos-ruler-0\", ruler_cluster=\"aks-001\"} Mint: 1675296000012 Maxt: 1702886400000: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.20.7.167:10901: connect: connection refused\""
Can you maybe request raw data? I think you might get downsampled data in storage gateway right now.
Can you maybe request raw data? I think you might get downsampled data in storage gateway right now.
I did try to run query against raw data, still got randomly partial data returns: as below, last point, you can see the value droped to around 630, however, it normally should be 8640.
Which verson of thanos again? Is it the latest? Also how often is this metric collected? If its every 5min; can you try a bigger offset maybe? Is it always the "latest" datapoint in the graph?
No it is not the latest. Thanos version: thanos:0.29.0-scratch-r0
Also how often is this metric collected? This metrics is collected every 5mins.
And Yes I can try a bigger offset, such as 1h. But can I know why would this offset value helps on this situation.
Is it always the "latest" datapoint in the graph? No, sometimes it is in the middle, them back to normal.
Can you use latest version? I thought offset might help in cases where recent data is not yet fully collected. I its sometimes in the middle then that theory doesnt hold sadly.
@MichaHoffmann I observed there are multiple times, the thanos-storegateway get OOMKilled as too much data loaded. I guess because of this, the thanos-query ignored thanos-storegateway as an endpoint, thus only use the returned value from Prometheus thanos-sidecar.
I am wondering is there any configuration to specify: for this expression, I only want thanos-storegateway as my datasource? If thanos-storegateway is down, I can accept No Data as the result.
@MichaHoffmann I observed there are multiple times, the thanos-storegateway get OOMKilled as too much data loaded. I guess because of this, the thanos-query ignored thanos-storegateway as an endpoint, thus only use the returned value from Prometheus thanos-sidecar.
I am wondering is there any configuration to specify: for this expression, I only want thanos-storegateway as my datasource? If thanos-storegateway is down, I can accept No Data as the result.
You can use store filtering: https://thanos.io/tip/components/query.md/#store-filtering or configure querier to not ask the store at all probably!
I am wondering is there any configuration to specify: for this expression, I only want thanos-storegateway as my datasource? If thanos-storegateway is down, I can accept No Data as the result.
I suggest that you have a new Thanos Query deployment that points to Store Gateway only
Hi Experts, I have query range for the last 30days as this:
(sum_over_time(app_availability_5m[30d]))/count_over_time(app_availability_5m[30d])
When query from Grafana UI, the returned value is not stable, sometimes it is 100.000%, for the next seconds when I click Run Queries, it returns 99.954%.
My thanos query has basically 4 endpoints: thanos-storegateway, thanos-ruler and 2 replicas of prometheus(running thanos-sidecar). I am thinking maybe when the value is 100.000%, it is the result from Prometheus(thanos-sidecar) as it only contains the last 2days' data. And when the value is 99.954%, this is the result from thanos-storegateway as it truly contains last 30days' data. And obviously, 99.954% is the Correct result that I expected.
I am wondering my assumption is correct or not? How can I ensure I only get the result from thanos-storegateway?
Thanks for you help!