prometheus-community / stackdriver_exporter

Google Stackdriver Prometheus exporter
Apache License 2.0
253 stars 97 forks source link

Error 429: Query aborted. Please reduce the query rate. #304

Open skykwame opened 6 months ago

skykwame commented 6 months ago

We are getting a few ...

ts=2024-01-29T10:29:44.926Z caller=monitoring_collector.go:212 level=error msg="Error while getting Google Stackdriver Monitoring metrics" err="googleapi: Error 429: Query aborted. Please reduce the query rate., rateLimitExceeded

Could you introduce an argument that allows a bit of sleep between API Calls to ease the pressure on the backend?

Probably somewhere here https://github.com/prometheus-community/stackdriver_exporter/blob/master/collectors/monitoring_collector.go#L323

Google is looking into it as we don't seem to be hitting our quotas However after making changes to your source, adding a slight delay between api calls seem to help

kgeckhart commented 2 months ago

I'm not sure that adding an arbitrary sleep is the best way to combat being rate limited. ATM the GCP client being used current has a handful of params that can be helpful for tuning here, https://github.com/prometheus-community/stackdriver_exporter/blob/master/stackdriver_exporter.go#L66-L84

I would suggest adjusting the retry-statuses to include a 429 which will give you a delay + retries in the event of a rate limit. FYI @hspens since you commented on the closed PR.

brodin commented 1 month ago

Seeing similar issues, is there anything in the configuration that I can tweak? 🙏

kgeckhart commented 1 month ago

Running with --stackdriver.retry-statuses 503 --stackdriver.retry-statuses 429 will include 429's in the default retry policy. I'm not sure how much that will help though depending on which API is giving you a 429 you might need to go further with your tuning.

GCP's monitoring API docs do not have a published quota and instead suggest using the quota dashboard. If you can determine which quota is being breached it would help with tuning, "Time series queries per minute", being breached would indicate you should increase your scrape interval.