thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.05k stars 2.09k forks source link

Thanos fails to calculate an instant result on downsampled data #6573

Open konstantin-tkachuk opened 1 year ago

konstantin-tkachuk commented 1 year ago

I have recently deployed Thanos with the goal of being able to use long term storage for metrics. I am using this chart to deploy Thanos: https://github.com/bitnami/charts/tree/main/bitnami/thanos Chart: 12.6.2 Thanos: docker.io/bitnami/thanos:0.31.0-scratch-r5 Object Storage: S3 in AWS Instead of Prometheus I am using Thanos Receive and Grafana Agent.

Thanos query arguments: query --log.level=info --log.format=logfmt --grpc-address=0.0.0.0:10901 --http-address=0.0.0.0:10902 --query.replica-label=replica --endpoint=dnssrv+_grpc._tcp.thanos-storegateway.thanos.svc.cluster.local --endpoint=dnssrv+_grpc._tcp.thanos-receive.thanos.svc.cluster.local --query.auto-downsampling

What happened: Goal: I wish to show the total number of requests that happened in a time period of 1d.

Scraping seems to work fine, the data is written to S3 and I can see the downsampled blocks if I open the Bucketweb UI: image

When I try to query the downsampled data using Thanos Query in Graph form it also works fine. The upper graph is for recent data, when raw data is still available. The bottom graph is for data that is only available in 5m downsampled blocks. Screenshot 2023-07-31 at 15 35 48 You can see that the data is available in downsampled form and Thanos is able to visualise it.

However, I am interested in the Table view (in Grafana it would be an "Instant" query) - if I simply switch to the Table tabs, the following is shown: image

You see that for a query where raw data is available Thanos successfully calculates the total number of requests over the past 1d, as specified in the query. However, for the bottom query, where only 5m downsampled data is available it only shows Empty query result.

What you expected to happen: Instead of empty query result I expect to see the calculated result of the query.

(I see this behaviour fully reflected in Grafana as well, when I am trying to show the data in a table there, independent of which min step I configure for the query there.)

What is going wrong here? Have I misconfigured something or is this a bug? Or is my expectation for this to work wrong and this is some inherent limitation of Thanos that I am not yet aware of? If so, can you suggest a workaround or alternative means to achieve my desired goal? I feel like being able to check the total number of requests per month over the past year is an absolutely standard use case of long term storage for metrics and should be supported somehow.

How to reproduce it (as minimally and precisely as possible): I assume simply try a sum(increase(...)) query on data that is only available in downsampled form.

Full logs to relevant components: I don't see any errors in any of the Thanos components.

MichaHoffmann commented 1 year ago

Hey,

Instead of sum(increase(...)) can you try another range vector query like max_over_time(X[1d]); just for the sake of debugging.

Also, is it possible to provide the raw data to ease debugging and reproducing? ( like it was done here https://github.com/thanos-io/thanos/issues/2401#issuecomment-619061475 )

fpetkovski commented 1 year ago

I see that the range query chart for downsampled data gets cut off after a certain point. Are you able to see the full range or do you also have problems with range queries?

konstantin-tkachuk commented 1 year ago

Hi,

thanks for the response. Based on your answer I assume that it should work. 😃

Instead of sum(increase(...)) can you try another range vector query like max_over_time(X[1d])

It behaves the same way, shows the data in a graph, but not in the table view.

I see that the range query chart for downsampled data gets cut off after a certain point. Are you able to see the full range or do you also have problems with range queries?

It's just the way the screenshot cuts of, the data is fully available.

Also, is it possible to provide the raw data to ease debugging and reproducing?

I have looked into this a bit and hope the following links will help with debugging:

  1. Link to the Thanos Query UI&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=sum(increase(http_response_timers_count%7Bnamespace%3D~%22deliveryplatform-multitenant-caas%22%2C%20database%3D%22all%22%2C%20pod%3D~%22caas-rest-api.*%22%2C%20collection%3D%22all%22%2C%20code%3D~%22%5C%5Cdxx%22%2C%20type%3D%22COLLECTION%22%7D%5B1d%5D))&g1.tab=0&g1.stacked=0&g1.range_input=1w&g1.max_source_resolution=5m&g1.deduplicate=1&g1.partial_response=0&g1.store_matches=%5B%5D&g1.end_input=2023-08-07%2013%3A02%3A21&g1.moment_input=2023-08-07%2013%3A02%3A21) in our DEV environment (which is configured the same as the prod one and has the same problem). You can access it with admin/github. I already added similar queries as seen in the screenshots.
  2. I have chosen one of 5 min downsampled blocks, downloaded it from S3 and made it available here. I hope you can somehow import this to your system; if anything is missing please don't hesitate to ask.

Thank you for taking the time to look into this! I hope the above helps narrow it down.

MichaHoffmann commented 1 year ago

Oh wow, thats perfect thank you! Ill have a look in a bit.

fpetkovski commented 1 year ago

So the resolution is calculated based on the query range step, the formula is simply step size / 5. For the range query the resolution is 2490s, and the requested resolution is 5m. For an instant query, we don't really have a step and I think Thanos tries to query the raw resolution. This could in fact be a gap in the current implementation for downsampling and in this case we might want to estimate the resolution based on the selector range.

MichaHoffmann commented 1 year ago

If i replay the query in the browser and add the "max_source_resolution=300000" query parameter it returns data: {"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1691413341,"4042.0929376461977"]}]}}

MichaHoffmann commented 1 year ago

There is this option in the querier:

    instantDefaultMaxSourceResolution := extkingpin.ModelDuration(cmd.Flag("query.instant.default.max_source_resolution", "default value for max_source_resolution for instant queries. If not set, defaults to 0s only taking raw resolution into account. 1h can be a good value if you use instant queries over time ranges that incorporate times outside of your raw-retention.").Default("0s").Hidden())
konstantin-tkachuk commented 1 year ago

There is this option in the querier:

@MichaHoffmann are you saying query.instant.default.max_source_resolution is a hidden command line flag I should attempt to set and see whether the behaviour changes? Should I also remove the query.auto-downsampling then or is that still needed for the range queries?

MichaHoffmann commented 1 year ago

Yeah, would be interesting if it works with the flag. Autodownsampling is fine to keep i think.

konstantin-tkachuk commented 1 year ago

Yeah, would be interesting if it works with the flag. Autodownsampling is fine to keep i think.

Change is live and at first look it seems to have solved the problem! First look into Grafana seems to also show that it works correctly now. Thank you! 👍

For me this seems to be a solution to my problem, but I'm curious about the next steps on Thanos side. Will this extra parameter become the "intended behaviour", in which case I recommend that you update the documentation here or will there be some changes in a future version so that it works out of the box?

konstantin-tkachuk commented 1 year ago

I'm going to remove access to the links with the reproduction data sometime tomorrow. Please download what you might need before then. Thanks again!

MichaHoffmann commented 1 year ago

Thanks for the great repro. How to address the issue properly i dont know yet.

mrliptontea commented 8 months ago

Hi,

Ran into this today. Setting --query.instant.default.max_source_resolution=1h fixed it for instant queries but only if --query.auto-downsampling enabled too, which I found curious.

However, I still see this issue when querying label values, e.g. when Grafana dashboards are looking up variables/filters.

For example, request:

/api/v1/label/cluster/values?match[]=up{job%3D"kube-state-metrics"}&end=1693562400

returns no values, despite the fact that this query:

/api/v1/query?query=up{job%3D"kube-state-metrics"}&time=1693562400

does.

This makes it hard to view historical graphs, because I need to ensure the time range ends in the past 48h where I still have raw data, otherwise all filters and panels will show "No data".

I tried Thanos from v0.28.0 to v0.33.0 plus various combinations of flags, but nothing seems to help.