syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner 🚀 Kubernetes Infrastructure as Software 🔧 Terraform/Kubespray/kOps alternative for running Kubernetes on Hetzner
https://caph.syself.com
Apache License 2.0
539 stars 51 forks source link

Reducing hcloud API calls for hcloudmachines that are up and running #1336

Open janiskemper opened 3 weeks ago

janiskemper commented 3 weeks ago

/kind proposal

Sometimes we hit the rate-limit because the caph controller does too many calls to the hcloud API.

Checks we do for a running hcloudmachine

The following points are based on an action of the user. The user removes a label of the server in the HCloud UI, and we cannot validate it. The user deletes a server manually and we realize that. The user removes the server from the load balancer or the network and we add it again.

These are potentially valid use cases - the question is whether they are so relevant that we need to keep them.

Possible ways of handling these checks

Right now we do the following: One extreme (current one): Do all API calls to check everything in every reconcile loop.

The other extreme: Stop doing any API calls once the server is up and running. If something is wrong with the server, the Machine Health Checks should discover that. We don't do anything if the user actively misconfigures something and for example removes a server from the load balancer.

Middle way 1: To specific checks and stop doing all others We could stop checking that the server is part of the network and continue checking that it is added as target to the load balancer. For example. Any combination of things that are important to us is possible.

Middle way 2: Heavily cache API calls once a server is running We could also use a cache to not call the API regularly. If something goes wrong, we would realize it later, but eventually we would.

Any thoughts?

I'm curious to hear opinions, also from people outside of Syself! The overall goal is to reduce the number of API calls that can be rather high. Hundreds of calls per hour for a stable (not scaling) cluster is normal.

A similar question could be asked also for the general load balancer, placement group and network configuration, which we reconcile in the hetznercluster-controller. I'm also looking forward to opinions there!

apricote commented 3 weeks ago

I do think all of the above requests should be fine. My question would be how often you are triggering reconciles of the HCloudMachines controller and if that can be optimized.

I described my previous solution to investigate this here: https://github.com/syself/cluster-api-provider-hetzner/issues/926#issuecomment-1728075863

janiskemper commented 3 weeks ago

I think that we can probably slightly improve the work that you have started already @apricote .

Why I have written down my thoughts here is more that a large use case is more likely to run into a rate limit than a smaller one.

This is in general just a question of optimizing rather than fixing a certain bug or issue.

We currently reconcile all objects every three minutes as default, this is a controller-runtime setting that can also be used as a parameter to reduce the amount of reconcilements