grafana / athena-datasource

Apache License 2.0
36 stars 12 forks source link

Retry / rate-limit queries that failed due to S3 throttling #320

Open skuzzle opened 2 months ago

skuzzle commented 2 months ago

Is your feature request related to a problem? Please describe. We are sometimes seeing S3 throttling errors on the UI on some of our dashboards. This even happens for queries that are already cached with Athena's query reuse feature. I understand those might be root caused by some sub-optimal partitioning/data layout in our Athena setup (which we are unable to change at the moment). However, I think throttling can naturally happen if you have lots of data to crawl through. The way I understood S3 is, that throttling happens while S3 is trying to scale up to the amount of concurrent requests it needs to handle. Thus it is signalling to the client to slow down its request rate. This situation is currently not handled gracefully by the Grafana Athena datasource.

Describe the solution you'd like If my understanding of S3 throttling is correct then there should be some client site retry with backoff mechanism for queries that fail because of S3 throttling. I understand that introducing a rate limit might not be straight-forward as it likely requires tracking some global state on the Grafana Server.

Describe alternatives you've considered Sadly, I've found no alternatives yet. In a perfect world maybe Athena should already handle this situation more gracefully but we have found no respective configuration options.

Additional context We have some automation in place that tests all of our dashboard's Athena queries against Grafana's /api/ds/query endpoint. In these tests we faced the same throttling issues and were able to overcome them by adding a retry mechanism and stepwise lowering the rate limit.

iwysiu commented 2 months ago

Hi @skuzzle , thanks for the feature request! I looked into it and I can understand it being an issue, though most of the advice I see about it involves changing the Athena configuration instead of the querying. I'll move it into the backlog for us to consider.