Closed janiskemper closed 3 months ago
All calls made through hcloud-go should be visible through metrics too, with their endpoint and response code. So if you already scrape the metrics for the provider pods you should already have some data on this.
The rate limit might also be consumed by other services like CCM. We are aware of at least one bug in hccm that can cause some tight request loop and I am currently investigating it (no GH issue yet).
@apricote thanks for the input. This issue you are investigating, can you give us more infos here? Because we still search for the root cause of why we run into the rate limit. Haven't found it yet. Or maybe some info how we can see the loop in CCM? Is it visible through normal logs?
@apricote another thing: You have told me in the past that you had HCloud rate limits in your system. Did your system stabilize itself eventually? Or did you have to manually intervene and maybe shut something down?
Because I fear that Hcloud gives us always some requests back and those requests are immediately taken by the n objects that need API calls. Therefore, we always run into the rate limit again.
So far I have a bunch of customer HCCMs that call GET /v1/servers/{id}
for some of the servers every few seconds. The default rate limit is 1 req/s, so this quickly burns through the allowed requests. I did not see anything yet in the logs that indicate this, metrics should show this though. Still looking into the extent and possible causes of this. Will keep you updated once I know more.
My system did stabilize after deploying CAPH with the fixes we built back then (#925, #933, #934).
For the rate limit, I really recommend scraping the metrics from HCCM, csi-driver and CAPH so you always know how many requests you send from inside the cluster and possibly alert on this. A query like sum by (api_endpoint, job) (rate(hcloud_api_requests_total[60m]))
should show what service makes the requests and for which endpoint. Anything above 1 in total is going to cause issues (with default rate limits for the account). Right now my CAPH scraping is actually broken, so I do not know how much it uses.
If your reconcile loop is doing >1 request, it might start to run, fail somewhere through, not make any progress and be restarted. Depending on the amount of parallel reconciles and backoff you have, this can always maneuver you into a situation where you can not recover without pausing the reconciles for a bit so you can "save up" some rate limits to later burst and finish the reconciles.
thanks for the info @apricote :) it sounds valid that in a larger cluster we quickly go beyond the one request per second if we reconcile all at once. I think we should implement some random backoff time so that not all services call the API at the same time after the rate limit is lifted.
For the rate limit there are three relevant numbers: Bucket Size, Current Bucket and Replenish Rate. These are set to 3600 requests
, 3600 requests
and 1 request/s
by default.
We had a race condition that is solved in https://github.com/syself/cluster-api-provider-hetzner/pull/1347 and that happened because we use the ClusterClass and changed the spec of the HetznerCluster that is manged by the ClusterClass.
CAPI acted kind of like a GitOps controller that wanted to get back to the original state, so that the two controllers changed the spec of HetznerCluster back and forth.
This should be fixed now, so I don't think there is any (other) issue with stabilizing the environment!
/kind bug
What steps did you take and what happened: We have an environment that constantly stays in the rate limit.
What did you expect to happen: We should stabilize after a rate limit has been reached
Anything else you would like to add: We should do the following: