Closed Privatecoder closed 1 year ago
@batistein will have a look into this
As in the other issue: I am happy to re-create the cluster as many times as possibly required to help and debug.
bm-worker nodes are also affected – (md-1) routes in rescue:
and after installimage:
Routing on HCloud machines seems fine:
This is likely a cloud-init-config-issue but I don't know where to look at right now..
We talked about it quickly. We use autosetup, cloud init, and installimage. The network configuration comes only from Hetzner through installimage. Therefore, we probably cannot do anything about it.
Instead of doing it manually, you could also use pre kubeadm commands to automate things - just as an idea.
had this in preKubeadmCommands
to work around the issue however it was overwritten on the bm-workers (stayed intact on the control planes though):
- export IFACE=$(netstat -rn | awk '{if($4=="U")print $8}')
- export GATEWAY=$(netstat -rn | awk -v I="$IFACE" '{if($4=="UH" && $8==I)print $1}')
- export MASK=$(netstat -rn | awk -v I="$IFACE" '{if($4=="U" && $8==I)print $3}')
- export NET_IP=$(netstat -rn | awk -v I="$IFACE" '{if($4=="U" && $8==I)print $1}')
- route del -net 0.0.0.0 gw $GATEWAY netmask 0.0.0.0 $IFACE
- route del -net $NET_IP gw 0.0.0.0 netmask $GATEWAY $IFACE
- route add -net $NET_IP gw $GATEWAY netmask $GATEWAY $IFACE
We talked about it quickly. We use autosetup, cloud init, and installimage. The network configuration comes only from Hetzner through installimage. Therefore, we probably cannot do anything about it.
As routing is fine in Rescue
and very basic in installs through the Robot (thought they use installimage as well so I expected the same issue here)
I created a ticket for Hetzner and see what they say about it.
Hetzner did some tests and here is their reply:
apparently cloud-init generates another own DHCP entry for netplan despite existing static configuration. This leads of course to problems and duplicate network configuration.
Probably you have to set appropriate parameters so that cloud-init uses the existing network configuration instead of creating its own "basic configuration".
@batistein
disabling network-config for cloud-init on bm-control-planes and bm-workers fixes the issue and should be applied with every bm-template as per Hetzners suggestion:
postInstallScript: |
#!/bin/bash
mkdir -p /etc/cloud/cloud.cfg.d && touch /etc/cloud/cloud.cfg.d/99-custom-networking.cfg
echo "network: { config: disabled }" > /etc/cloud/cloud.cfg.d/99-custom-networking.cfg
apt-get update && apt-get install -y cloud-init apparmor apparmor-utils
That's very interesting! Thank you very much for digging into it! Would you be able to make a PR to the templates?
sure!
/kind bug /lifecycle active
What steps did you take and what happened:
hetzner-baremetal-control-planes-remediation
/ Ubuntu 20.04 HWE.=> non-working routing after
HetznerBareMetalHost
has finished rebooting and trying to join the cluster:Pinging between the nodes by ssh-ing to either machine (first controlplane and second) does not work
Destination Host Unreachable
.routing looks like this in rescue-mode:
This can be fixed manually by replacing the wrong gateway (
0.0.0.0
):route del -net 94.130.10.0 gw 0.0.0.0 netmask 255.255.255.192 enp35s0
route add -net 94.130.10.0 gw 94.130.10.1 netmask 255.255.255.192 enp35s0
also
route del -net 0.0.0.0 gw 94.130.10.1 netmask 0.0.0.0 enp35s0
as the first route is set-up twice.What did you expect to happen:
Correct gateway in routing, no duplicate routes.
Environment: