:bug: Improve handling of bm server reboot timeouts - Githubissues

syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner 🚀 Kubernetes Infrastructure as Software 🔧 Terraform/Kubespray/kOps alternative for running Kubernetes on Hetzner

https://caph.syself.com

Apache License 2.0

539 stars 51 forks source link

:bug: Improve handling of bm server reboot timeouts #1327

Closed janiskemper closed 2 weeks ago

janiskemper commented 1 month ago

What this PR does / why we need it: Bare metal servers currently can time out while rebooting. However, the timeouts don't make too much sense right now.

The timeout is too big, so that usually MachineHealthChecks will trigger before the timeout is reached. We reduce this here.
The server was just rebooted again if a timeout is reached. However, if the timeout is actually reached, we don't want to continue, as there is probably something wrong with the server. Now we set a permanent error.

Additionally, there are some code improvements:

The check whether a reboot has been triggered requires an API call to Hetzner API. This check is now only done when necessary, so that we safe the unneccessary API calls.
We set a permanent error if a reboot is marked as failed, so that we stop reconciling.
The ProvisionSucceeded condition of a host is saved in case of a permanent error to give the user some feedback that the permanent error happened and why.

TODOs:

[x] squash commits
[ ] include documentation
[ ] add unit tests