bm-node routing broken after cloud-init (wrong gateway, duplicate routes)

Privatecoder commented 1 year ago

/kind bug /lifecycle active

What steps did you take and what happened:

Create a three-node cluster with flavor hetzner-baremetal-control-planes-remediation / Ubuntu 20.04 HWE.
Nodes with IPs within the same subnet – in my case 94.130.10.24 and 94.130.10.27 – won't be able to communicate with each other due to a wrong gateway route-setting:

=> non-working routing after HetznerBareMetalHost has finished rebooting and trying to join the cluster:

Pinging between the nodes by ssh-ing to either machine (first controlplane and second) does not work Destination Host Unreachable.

routing looks like this in rescue-mode:

This can be fixed manually by replacing the wrong gateway (0.0.0.0):

route del -net 94.130.10.0 gw 0.0.0.0 netmask 255.255.255.192 enp35s0 route add -net 94.130.10.0 gw 94.130.10.1 netmask 255.255.255.192 enp35s0

also route del -net 0.0.0.0 gw 94.130.10.1 netmask 0.0.0.0 enp35s0 as the first route is set-up twice.

What did you expect to happen:

Correct gateway in routing, no duplicate routes.

Environment:

cluster-api-provider-hetzner version: v1.0.0-beta.7
Kubernetes version: v1.24.8
OS: Ubuntu 20.04 HWE

janiskemper commented 1 year ago

@batistein will have a look into this

Privatecoder commented 1 year ago

As in the other issue: I am happy to re-create the cluster as many times as possibly required to help and debug.

Privatecoder commented 1 year ago

bm-worker nodes are also affected – (md-1) routes in rescue:

and after installimage:

Routing on HCloud machines seems fine:

This is likely a cloud-init-config-issue but I don't know where to look at right now..

janiskemper commented 1 year ago

We talked about it quickly. We use autosetup, cloud init, and installimage. The network configuration comes only from Hetzner through installimage. Therefore, we probably cannot do anything about it.

Instead of doing it manually, you could also use pre kubeadm commands to automate things - just as an idea.

Privatecoder commented 1 year ago

had this in preKubeadmCommands to work around the issue however it was overwritten on the bm-workers (stayed intact on the control planes though):

- export IFACE=$(netstat -rn | awk '{if($4=="U")print $8}')
- export GATEWAY=$(netstat -rn | awk -v I="$IFACE" '{if($4=="UH" && $8==I)print $1}')
- export MASK=$(netstat -rn | awk -v I="$IFACE" '{if($4=="U" && $8==I)print $3}')
- export NET_IP=$(netstat -rn | awk -v I="$IFACE" '{if($4=="U" && $8==I)print $1}')
- route del -net 0.0.0.0 gw $GATEWAY netmask 0.0.0.0 $IFACE
- route del -net $NET_IP gw 0.0.0.0 netmask $GATEWAY $IFACE
- route add -net $NET_IP gw $GATEWAY netmask $GATEWAY $IFACE

Privatecoder commented 1 year ago

We talked about it quickly. We use autosetup, cloud init, and installimage. The network configuration comes only from Hetzner through installimage. Therefore, we probably cannot do anything about it.

As routing is fine in Rescue

and very basic in installs through the Robot (thought they use installimage as well so I expected the same issue here)

I created a ticket for Hetzner and see what they say about it.

Privatecoder commented 1 year ago

Hetzner did some tests and here is their reply:

apparently cloud-init generates another own DHCP entry for netplan despite existing static configuration. This leads of course to problems and duplicate network configuration.

Probably you have to set appropriate parameters so that cloud-init uses the existing network configuration instead of creating its own "basic configuration".

@batistein

Privatecoder commented 1 year ago

disabling network-config for cloud-init on bm-control-planes and bm-workers fixes the issue and should be applied with every bm-template as per Hetzners suggestion:

postInstallScript: |
  #!/bin/bash
  mkdir -p /etc/cloud/cloud.cfg.d && touch /etc/cloud/cloud.cfg.d/99-custom-networking.cfg
  echo "network: { config: disabled }" > /etc/cloud/cloud.cfg.d/99-custom-networking.cfg
  apt-get update && apt-get install -y cloud-init apparmor apparmor-utils

batistein commented 1 year ago

That's very interesting! Thank you very much for digging into it! Would you be able to make a PR to the templates?

Privatecoder commented 1 year ago

sure!

syself / cluster-api-provider-hetzner

bm-node routing broken after cloud-init (wrong gateway, duplicate routes) #457