siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.01k stars 488 forks source link

Can't get VIP to work on 1.7.0 on 2 test VM's on Hyper-V tried a bunch of times (1.6.7 worked first time) #8662

Closed rluiten closed 2 months ago

rluiten commented 2 months ago

Bug Report

I can't get VIP to work in Talos 1.7.0 no matter what I do on hyper-v. The VIP never gets assigned. I tried it in 1.6.7 Talos Linux with the 1.6.7 version of talosctl and it just worked immediately.

Description

Was setting up a simple 2 node cluster to learn about Talos. Tried installing VIP at initial install as documented below, also tried updating to vip afterward not as part of initial. The VIP is never assigned on 1.7.0 for me. I have 2 IP's assigned to my VM 10.19.67.201 and 10.19.67.202 and my VIP is 10.19.67.200. The VM's are set to static IP in the DHCP server on my lan.

Logs

Not sure how to split logs from actions so have results inline in Environment in the steps.

Environment

talosctl Client v1.7.0

talosctl gen config talos-cluster https://10.19.67.200:6443 \
  --config-patch @patches/interface-names.yaml \ 
  --config-patch @patches/dhcp.yaml \
  --config-patch-control-plane @patches/vip.yaml  \
  --config-patch @patches/install-disk.yaml

I have attached the generated controlplane.yaml as file controlplane.yaml.txt The keys and such are not active anymore so I left them in place.

patches/interface-names.yaml

---
machine:
  install:
    extraKernelArgs:
      - net.ifnames=0

patches/dhcp.yaml

---
machine:
  network:
    interfaces:
      - interface: eth0
        dhcp: true

patches/vip.yaml

---
machine:
  network:
    interfaces:
      - interface: eth0
        vip:
          ip: 10.19.67.200

patches/install-disk.yaml

---
machine:
    install:
        disk: /dev/sda

talosctl -n 10.19.67.201 -e 10.19.67.201 apply-config -f controlplane.yaml --insecure talosctl -n 10.19.67.202 -e 10.19.67.202 apply-config -f controlplane.yaml --insecure

After I see the warning containing the message talostctl boostrap on both servers I run.

talosctl -n 10.19.67.201 -e 10.19.67.201 bootstrap

With Talos 1.7.0 the VIP of .200 never appears on the dashboard ever after all services show Healthy.

Screen shot of Talos 1.7.0 Hyper-v consoles. image

Back on 1.7.0 i get the kubeconfig talosctl -n 10.19.67.201 -e 10.19.67.201 kubeconfig It contains server: https://10.19.67.200:6443 as expected.

Running kubectl get nodes fails as expected because 200 is never assigned to the IP of the talos nodes.

Unable to connect to the server: dial tcp 10.19.67.200:6443: connect: no route to host

If i modify my kubeconfig to point at https://10.19.67.201:6443 then kubectl version returns

Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

With Talos 1.6.7 the VIP of .200 appears on the dashboard of 1.6.7 using identical config files as those on 1.7.0 I use talosctl 1.6.7 for the talos 1.6.7 metal-amd64.iso. Screen shot of 1.6.7 soon after bootstrap Seconds after sending bootstrap the .200 VIP is attached to the Talos node as seen in the screenshot. image

After the services are healthy get the kubeconfig. kubeconfig contains server: https://10.19.67.200:6443 as expected and kubectl get nodes -o wide returns

NAME            STATUS   ROLES           AGE    VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
talos-3uj-nsx   Ready    control-plane   86s    v1.29.3   10.19.67.201   <none>        Talos (v1.6.7)   6.1.82-talos     containerd://1.7.13
talos-sm3-0y4   Ready    control-plane   100s   v1.29.3   10.19.67.202   <none>        Talos (v1.6.7)   6.1.82-talos     containerd://1.7.13

controlplane.yaml.txt

smira commented 2 months ago

You can figure out the problem yourself by doing some simple debugging steps.

rluiten commented 2 months ago

That has pointed me in the right direction it does appear I do not have and eth0 which is interesting.

The linked VIP documentation which I had read through before.

The predictable network interface names features can be disabled by specifying net.ifnames=0 in the kernel command line.

It appears it is not working the way I am setting it via this yaml patch in first post in 1.7.0 but it did in 1.6.7. The patch does appear to be applying where the controlplane.yaml file expects it to be though.

smira commented 2 months ago

You don't need to enforce net.ifnames=0, as even VIP documentation documents a way to select a single network using device selector.

But your problem is most probably using disk image which skips install and net.ifnames=0 is not applied. You can confirm by inspecting cmdline with talosctl read /proc/cmdline.

rluiten commented 2 months ago

I just confirmed using the deviceSelector physical: true for attaching VIP works in 1.7.0.

As you said cmdline does not show the argument for ifnames so it is not being set.

I don't understand what you mean by "using disk image" which skips install and the parameter.

The descriptive name "extraKernelArgs" in the config seems to document a way to do this but you are correct it is not doing it. I read at least one blog post saying that was how to do it at as well, live and learn.

Thanks for your help.

smira commented 2 months ago

extraKernelArgs is part of machine.install section, so it gets applied on install, which happens if Talos is installed.

If you boot from a disk image, then there's no install (Talos is already installed). Image Factory supports generating disk images with custom kernel args if needed, but skipping net.ifnames=0 in this case is easier

rluiten commented 2 months ago

I thought it was installing it as that is what the console dashboard says when I apply-config.