Reducing hcloud API calls for hcloudmachines that are up and running

janiskemper commented 3 weeks ago

/kind proposal

Sometimes we hit the rate-limit because the caph controller does too many calls to the hcloud API.

Checks we do for a running hcloudmachine

validate that labels are still correctly set and let the machine fail if not
server status is off and gets switched on
server doesn't exist anymore, so it gets created again (-> this is actually even a problem - we don't want to re-create a server if the first one was deleted while having the same machine object)
server is not attached to network anymore and gets attached again
server (control plane) is not a target of the load balancer anymore and gets added again

The following points are based on an action of the user. The user removes a label of the server in the HCloud UI, and we cannot validate it. The user deletes a server manually and we realize that. The user removes the server from the load balancer or the network and we add it again.

These are potentially valid use cases - the question is whether they are so relevant that we need to keep them.

Possible ways of handling these checks

Right now we do the following: One extreme (current one): Do all API calls to check everything in every reconcile loop.

The other extreme: Stop doing any API calls once the server is up and running. If something is wrong with the server, the Machine Health Checks should discover that. We don't do anything if the user actively misconfigures something and for example removes a server from the load balancer.

Middle way 1: To specific checks and stop doing all others We could stop checking that the server is part of the network and continue checking that it is added as target to the load balancer. For example. Any combination of things that are important to us is possible.

Middle way 2: Heavily cache API calls once a server is running We could also use a cache to not call the API regularly. If something goes wrong, we would realize it later, but eventually we would.

Any thoughts?

I'm curious to hear opinions, also from people outside of Syself! The overall goal is to reduce the number of API calls that can be rather high. Hundreds of calls per hour for a stable (not scaling) cluster is normal.

A similar question could be asked also for the general load balancer, placement group and network configuration, which we reconcile in the hetznercluster-controller. I'm also looking forward to opinions there!

apricote commented 3 weeks ago

I do think all of the above requests should be fine. My question would be how often you are triggering reconciles of the HCloudMachines controller and if that can be optimized.

I described my previous solution to investigate this here: https://github.com/syself/cluster-api-provider-hetzner/issues/926#issuecomment-1728075863

janiskemper commented 3 weeks ago

I think that we can probably slightly improve the work that you have started already @apricote .

Why I have written down my thoughts here is more that a large use case is more likely to run into a rate limit than a smaller one.

This is in general just a question of optimizing rather than fixing a certain bug or issue.

We currently reconcile all objects every three minutes as default, this is a controller-runtime setting that can also be used as a parameter to reduce the amount of reconcilements

syself / cluster-api-provider-hetzner