syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner :rocket: The best way to manage Kubernetes clusters on Hetzner, fully declarative, Kubernetes-native and with self-healing capabilities
https://caph.syself.com
Apache License 2.0
619 stars 58 forks source link

Difficulties of stabilizing environment after HCloud rate limit has been reached #1331

Closed janiskemper closed 3 months ago

janiskemper commented 3 months ago

/kind bug

What steps did you take and what happened: We have an environment that constantly stays in the rate limit.

What did you expect to happen: We should stabilize after a rate limit has been reached

Anything else you would like to add: We should do the following:

  1. Log all API calls to HCloud and see whether there are any unexpected calls or in the worst case some loop that makes us reach the rate limit quickly. If this is the root cause of our issue, we don't have to change anything in the rate limit handling
  2. Try to reproduce the following situation: we get only some API calls and these API calls are immediately taken again by our objects reconciling.
  3. If this behavior in 2. can be reproduced, we can avoid that by implementing a larger timeout as well as a randomness in the reconciling of the objects, so that not all reconcile immediately after the rate limit.
apricote commented 3 months ago

All calls made through hcloud-go should be visible through metrics too, with their endpoint and response code. So if you already scrape the metrics for the provider pods you should already have some data on this.

The rate limit might also be consumed by other services like CCM. We are aware of at least one bug in hccm that can cause some tight request loop and I am currently investigating it (no GH issue yet).

janiskemper commented 3 months ago

@apricote thanks for the input. This issue you are investigating, can you give us more infos here? Because we still search for the root cause of why we run into the rate limit. Haven't found it yet. Or maybe some info how we can see the loop in CCM? Is it visible through normal logs?

janiskemper commented 3 months ago

@apricote another thing: You have told me in the past that you had HCloud rate limits in your system. Did your system stabilize itself eventually? Or did you have to manually intervene and maybe shut something down?

Because I fear that Hcloud gives us always some requests back and those requests are immediately taken by the n objects that need API calls. Therefore, we always run into the rate limit again.

apricote commented 3 months ago

So far I have a bunch of customer HCCMs that call GET /v1/servers/{id} for some of the servers every few seconds. The default rate limit is 1 req/s, so this quickly burns through the allowed requests. I did not see anything yet in the logs that indicate this, metrics should show this though. Still looking into the extent and possible causes of this. Will keep you updated once I know more.

My system did stabilize after deploying CAPH with the fixes we built back then (#925, #933, #934).

For the rate limit, I really recommend scraping the metrics from HCCM, csi-driver and CAPH so you always know how many requests you send from inside the cluster and possibly alert on this. A query like sum by (api_endpoint, job) (rate(hcloud_api_requests_total[60m])) should show what service makes the requests and for which endpoint. Anything above 1 in total is going to cause issues (with default rate limits for the account). Right now my CAPH scraping is actually broken, so I do not know how much it uses.

If your reconcile loop is doing >1 request, it might start to run, fail somewhere through, not make any progress and be restarted. Depending on the amount of parallel reconciles and backoff you have, this can always maneuver you into a situation where you can not recover without pausing the reconciles for a bit so you can "save up" some rate limits to later burst and finish the reconciles.

janiskemper commented 3 months ago

thanks for the info @apricote :) it sounds valid that in a larger cluster we quickly go beyond the one request per second if we reconcile all at once. I think we should implement some random backoff time so that not all services call the API at the same time after the rate limit is lifted.

apricote commented 3 months ago

For the rate limit there are three relevant numbers: Bucket Size, Current Bucket and Replenish Rate. These are set to 3600 requests, 3600 requests and 1 request/s by default.

janiskemper commented 3 months ago

We had a race condition that is solved in https://github.com/syself/cluster-api-provider-hetzner/pull/1347 and that happened because we use the ClusterClass and changed the spec of the HetznerCluster that is manged by the ClusterClass.

CAPI acted kind of like a GitOps controller that wanted to get back to the original state, so that the two controllers changed the spec of HetznerCluster back and forth.

This should be fixed now, so I don't think there is any (other) issue with stabilizing the environment!