syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner :rocket: The best way to manage Kubernetes clusters on Hetzner, fully declarative, Kubernetes-native and with self-healing capabilities
https://caph.syself.com
Apache License 2.0
673 stars 59 forks source link

bm-node routing broken after cloud-init (wrong gateway, duplicate routes) #457

Closed Privatecoder closed 1 year ago

Privatecoder commented 1 year ago

/kind bug /lifecycle active

What steps did you take and what happened:

=> non-working routing after HetznerBareMetalHost has finished rebooting and trying to join the cluster:

non-working

Pinging between the nodes by ssh-ing to either machine (first controlplane and second) does not work Destination Host Unreachable.

routing looks like this in rescue-mode:

working-rescue

This can be fixed manually by replacing the wrong gateway (0.0.0.0):

route del -net 94.130.10.0 gw 0.0.0.0 netmask 255.255.255.192 enp35s0 route add -net 94.130.10.0 gw 94.130.10.1 netmask 255.255.255.192 enp35s0

also route del -net 0.0.0.0 gw 94.130.10.1 netmask 0.0.0.0 enp35s0 as the first route is set-up twice.

What did you expect to happen:

Correct gateway in routing, no duplicate routes.

Environment:

janiskemper commented 1 year ago

@batistein will have a look into this

Privatecoder commented 1 year ago

As in the other issue: I am happy to re-create the cluster as many times as possibly required to help and debug.

Privatecoder commented 1 year ago

bm-worker nodes are also affected – (md-1) routes in rescue:

image

and after installimage:

image

Routing on HCloud machines seems fine:

image

This is likely a cloud-init-config-issue but I don't know where to look at right now..

janiskemper commented 1 year ago

We talked about it quickly. We use autosetup, cloud init, and installimage. The network configuration comes only from Hetzner through installimage. Therefore, we probably cannot do anything about it.

Instead of doing it manually, you could also use pre kubeadm commands to automate things - just as an idea.

Privatecoder commented 1 year ago

had this in preKubeadmCommands to work around the issue however it was overwritten on the bm-workers (stayed intact on the control planes though):

- export IFACE=$(netstat -rn | awk '{if($4=="U")print $8}')
- export GATEWAY=$(netstat -rn | awk -v I="$IFACE" '{if($4=="UH" && $8==I)print $1}')
- export MASK=$(netstat -rn | awk -v I="$IFACE" '{if($4=="U" && $8==I)print $3}')
- export NET_IP=$(netstat -rn | awk -v I="$IFACE" '{if($4=="U" && $8==I)print $1}')
- route del -net 0.0.0.0 gw $GATEWAY netmask 0.0.0.0 $IFACE
- route del -net $NET_IP gw 0.0.0.0 netmask $GATEWAY $IFACE
- route add -net $NET_IP gw $GATEWAY netmask $GATEWAY $IFACE
Privatecoder commented 1 year ago

We talked about it quickly. We use autosetup, cloud init, and installimage. The network configuration comes only from Hetzner through installimage. Therefore, we probably cannot do anything about it.

As routing is fine in Rescue

image

and very basic in installs through the Robot (thought they use installimage as well so I expected the same issue here)

image

I created a ticket for Hetzner and see what they say about it.

Privatecoder commented 1 year ago

Hetzner did some tests and here is their reply:

apparently cloud-init generates another own DHCP entry for netplan despite existing static configuration. This leads of course to problems and duplicate network configuration.

Probably you have to set appropriate parameters so that cloud-init uses the existing network configuration instead of creating its own "basic configuration".

@batistein

Privatecoder commented 1 year ago

disabling network-config for cloud-init on bm-control-planes and bm-workers fixes the issue and should be applied with every bm-template as per Hetzners suggestion:

postInstallScript: |
  #!/bin/bash
  mkdir -p /etc/cloud/cloud.cfg.d && touch /etc/cloud/cloud.cfg.d/99-custom-networking.cfg
  echo "network: { config: disabled }" > /etc/cloud/cloud.cfg.d/99-custom-networking.cfg
  apt-get update && apt-get install -y cloud-init apparmor apparmor-utils
image
batistein commented 1 year ago

That's very interesting! Thank you very much for digging into it! Would you be able to make a PR to the templates?

Privatecoder commented 1 year ago

sure!