HetznerBaremetalHosts stuck in: host is still provisioning - state "registering"

thecodeassassin commented 2 months ago

/kind bug

What steps did you take and what happened: Followed the guide here:

What did you expect to happen:

Kamaji Control plane + 2 bare metal hosts

Anything else you would like to add:

clusterctl -n europe-north describe cluster europe-north
NAME                                                   READY  SEVERITY  REASON                       SINCE  MESSAGE                                                                    
Cluster/europe-north                                   True                                          10m                                                                                
├─ClusterInfrastructure - HetznerCluster/europe-north  True                                          10m                                                                                
├─ControlPlane - KamajiControlPlane/europe-north                                                                                                                                        
└─Workers                                                                                                                                                                               
  ├─MachineDeployment/europe-north-baremetal           False  Warning   WaitingForAvailableMachines  11m    Minimum availability requires 2 replicas, current 0 available               
  │ └─2 Machines...                                    False  Info      StillProvisioning            10m    See europe-north-baremetal-c496z-2q8fz, europe-north-baremetal-c496z-9lw49  
  └─MachineDeployment/europe-north-cloud               True                                          11m                                                                                
(⎈ |management:default) stephen@stephen-laptop  ~/srv-folder/maxroll-media/k8s-edge-clusters/management-cluster   main ±  clusterctl -n europe-north describe cluster europe-north --show-conditions=all
NAME                                                   READY  SEVERITY  REASON                       SINCE  MESSAGE                                                                    
Cluster/europe-north                                   True                                          10m                                                                                
│           ├─ControlPlaneInitialized                  True                                          11m                                                                                
│           ├─ControlPlaneReady                        True                                          11m                                                                                
│           └─InfrastructureReady                      True                                          10m                                                                                
├─ClusterInfrastructure - HetznerCluster/europe-north  True                                          10m                                                                                
│             ├─ControlPlaneEndpointSet                True                                          11m                                                                                
│             ├─HCloudTokenAvailable                   True                                          11m                                                                                
│             ├─PlacementGroupsSynced                  True                                          11m                                                                                
│             ├─TargetClusterReady                     True                                          11m                                                                                
│             └─TargetClusterSecretReady               True                                          10m                                                                                
├─ControlPlane - KamajiControlPlane/europe-north                                                                                                                                        
│             ├─InfrastructureClusterPatched           True             Succeeded                    11m                                                                                
│             ├─KamajiControlPlaneIsInitialized        True             Succeeded                    11m                                                                                
│             ├─KamajiControlPlaneIsReady              True             Succeeded                    11m                                                                                
│             ├─KubeadmResourcesCreated                True             Succeeded                    11m                                                                                
│             ├─TenantControlPlaneAddressReady         True             Succeeded                    11m                                                                                
│             └─TenantControlPlaneCreated              True             Succeeded                    11m                                                                                
└─Workers                                                                                                                                                                               
  ├─MachineDeployment/europe-north-baremetal           False  Warning   WaitingForAvailableMachines  11m    Minimum availability requires 2 replicas, current 0 available               
  │ │           ├─Available                            False  Warning   WaitingForAvailableMachines  11m    Minimum availability requires 2 replicas, current 0 available               
  │ │           └─MachineSetReady                      False  Warning   ScalingUp                    11m    Scaling up MachineSet to 2 replicas (actual 0)                              
  │ └─2 Machines...                                    False  Info      StillProvisioning            10m    See europe-north-baremetal-c496z-2q8fz, europe-north-baremetal-c496z-9lw49  
  └─MachineDeployment/europe-north-cloud               True                                          11m                                                                                
                ├─Available                            True                                          11m

Environment:

cluster-api-provider-hetzner version: v1.0.0-beta.33

Kubernetes version: (use kubectl version)

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.3", GitCommit:"434bfd82814af038ad94d62ebe59b133fcb50506", GitTreeState:"clean", BuildDate:"2022-10-12T10:47:25Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.4+k3s1", GitCommit:"94e29e2ef5d79904f730e2024c8d1682b901b2d5", GitTreeState:"clean", BuildDate:"2024-04-25T17:33:09Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/arm64"}
WARNING: version difference between client (1.25) and server (1.29) exceeds the supported minor version skew of +/-1

OS (e.g. from /etc/os-release):

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerBareMetalMachineTemplate
metadata:
name: europe-north-md-1
namespace: europe-north
spec:
template:
spec:
  installImage:
    swraid: 1
    swraidLevel: 1
    image:
      path: /root/.oldroot/nfs/images/Ubuntu-2204-jammy-amd64-base.tar.gz
    partitions:
    - fileSystem: esp
      mount: /boot/efi
      size: 512M
    - fileSystem: ext4
      mount: /boot
      size: 1024M
    - fileSystem: ext4
      mount: /
      size: all
    postInstallScript: |
      #!/bin/bash
      mkdir -p /etc/cloud/cloud.cfg.d && touch /etc/cloud/cloud.cfg.d/99-custom-networking.cfg
      echo "network: { config: disabled }" > /etc/cloud/cloud.cfg.d/99-custom-networking.cfg
      apt-get update && apt-get install -y cloud-init apparmor apparmor-utils
      cloud-init clean --logs
  sshSpec:
    portAfterCloudInit: 22
    portAfterInstallImage: 22
    secretRef:
      key:
        name: sshkey-name
        privateKey: ssh-privatekey
        publicKey: ssh-publickey
      name: robot-ssh

Output of go run github.com/guettli/check-conditions@latest all https://gist.github.com/thecodeassassin/71401f01d4da1b85929990b2a3b1ceee

janiskemper commented 2 months ago

Are you able to get the server into the rescue system? Because this is a problem with some of the servers. This will be the (not ideal) current reaction of the controller. We already have merged a better error handling. The problem is probably the server though and not CAPH

thecodeassassin commented 2 months ago

Are you able to get the server into the rescue system? Because this is a problem with some of the servers. This will be the (not ideal) current reaction of the controller. We already have merged a better error handling. The problem is probably the server though and not CAPH

yeah the server is perfectly accessible, the server reboots and enters rescue mode.

thecodeassassin commented 2 months ago

{"level":"INFO","time":"2024-05-04T15:18:34.117Z","file":"controllers/hetznerbaremetalmachine_controller.go:82","message":"Machine Controller has not yet set OwnerRef","controller":"hetznerbaremetalmachine","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalMachine","HetznerBareMetalMachine":{

Also somehow it's no longer updating the name of the baremetal server(s).

It boots into rescue mode and then does nothing.

    Last Transition Time:  2024-05-04T15:18:35Z
    Reason:                WaitingForNodeRef
    Severity:              Info
    Status:                False
    Type:                  NodeHealthy

thecodeassassin commented 2 months ago

CAPI keeps throwing this error:

I0505 01:05:50.371509       1 machine_controller_noderef.go:60] "Waiting for infrastructure provider to report spec.providerID" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/europe-north-md-1-n89bb-pscwh" namespace="default" name="europe-north-md-1-n89bb-pscwh" reconcileID="b505c126-e886-4894-9700-fec3b6991de1" MachineSet="default/europe-north-md-1-n89bb" MachineDeployment="default/europe-north-md-1" Cluster="default/europe-north" HetznerBareMetalMachine="default/europe-north-md-1-n89bb-pscwh"
E0505 01:05:50.372039       1 controller.go:329] "Reconciler error" err="failed to retrieve Spec.ProviderID from infrastructure provider for Machine \"europe-north-md-1-n89bb-pscwh\" in namespace \"default\": field not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/europe-north-md-1-n89bb-pscwh" namespace="default" name="europe-north-md-1-n89bb-pscwh" reconcileID="b505c126-e886-4894-9700-fec3b6991de1"

janiskemper commented 2 months ago

Machine Controller has not yet set OwnerRef

This is something CAPI has to do and apparently didn't (yet). That's not related to the host object being stuck during provisioning.

And related to the other issue that it got stuck in "registering": Do you have any logs, events, status info from HetznerBareMetalHost that looks interesting? There will be some problem (I thought with rescue) but might be also something else. We will see it somewhere for sure.

thecodeassassin commented 2 months ago

@janiskemper

Closing this issue, the problem was that the machine hosting the caph controller could not use port SSH because of a firewall issue.

Maybe in the future this should be included in error handling.

janiskemper commented 2 months ago

if you have some logs, we could maybe do that. This is indeed a case we haven't had (internally) until now

syself / cluster-api-provider-hetzner

HetznerBaremetalHosts stuck in: host is still provisioning - state "registering" #1293