Closed thecodeassassin closed 2 months ago
Are you able to get the server into the rescue system? Because this is a problem with some of the servers. This will be the (not ideal) current reaction of the controller. We already have merged a better error handling. The problem is probably the server though and not CAPH
Are you able to get the server into the rescue system? Because this is a problem with some of the servers. This will be the (not ideal) current reaction of the controller. We already have merged a better error handling. The problem is probably the server though and not CAPH
yeah the server is perfectly accessible, the server reboots and enters rescue mode.
{"level":"INFO","time":"2024-05-04T15:18:34.117Z","file":"controllers/hetznerbaremetalmachine_controller.go:82","message":"Machine Controller has not yet set OwnerRef","controller":"hetznerbaremetalmachine","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalMachine","HetznerBareMetalMachine":{
Also somehow it's no longer updating the name of the baremetal server(s).
It boots into rescue mode and then does nothing.
Last Transition Time: 2024-05-04T15:18:35Z
Reason: WaitingForNodeRef
Severity: Info
Status: False
Type: NodeHealthy
CAPI keeps throwing this error:
I0505 01:05:50.371509 1 machine_controller_noderef.go:60] "Waiting for infrastructure provider to report spec.providerID" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/europe-north-md-1-n89bb-pscwh" namespace="default" name="europe-north-md-1-n89bb-pscwh" reconcileID="b505c126-e886-4894-9700-fec3b6991de1" MachineSet="default/europe-north-md-1-n89bb" MachineDeployment="default/europe-north-md-1" Cluster="default/europe-north" HetznerBareMetalMachine="default/europe-north-md-1-n89bb-pscwh"
E0505 01:05:50.372039 1 controller.go:329] "Reconciler error" err="failed to retrieve Spec.ProviderID from infrastructure provider for Machine \"europe-north-md-1-n89bb-pscwh\" in namespace \"default\": field not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/europe-north-md-1-n89bb-pscwh" namespace="default" name="europe-north-md-1-n89bb-pscwh" reconcileID="b505c126-e886-4894-9700-fec3b6991de1"
Machine Controller has not yet set OwnerRef
This is something CAPI has to do and apparently didn't (yet). That's not related to the host object being stuck during provisioning.
And related to the other issue that it got stuck in "registering": Do you have any logs, events, status info from HetznerBareMetalHost that looks interesting? There will be some problem (I thought with rescue) but might be also something else. We will see it somewhere for sure.
@janiskemper
Closing this issue, the problem was that the machine hosting the caph controller could not use port SSH because of a firewall issue.
Maybe in the future this should be included in error handling.
if you have some logs, we could maybe do that. This is indeed a case we haven't had (internally) until now
/kind bug
What steps did you take and what happened: Followed the guide here:
What did you expect to happen:
Kamaji Control plane + 2 bare metal hosts
Anything else you would like to add:
Environment:
cluster-api-provider-hetzner version: v1.0.0-beta.33
Kubernetes version: (use
kubectl version
)OS (e.g. from
/etc/os-release
):Output of
go run github.com/guettli/check-conditions@latest all
https://gist.github.com/thecodeassassin/71401f01d4da1b85929990b2a3b1ceee