siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.9k stars 556 forks source link

QEMU dev environment unable to bring up NTP because of failure to reach 8.8.8.8 DNS #8296

Open dsseng opened 9 months ago

dsseng commented 9 months ago

Bug Report

QEMU test cluster ran from the guide fails because of inability to reach the 8.8.8.8 DNS resolver. 8.8.8.8 can be pinged from the host thus I believe it's a VM config issue or a compatibility problem.

Description

validating CIDR and reserving IPs
generating PKI and tokens
creating "/tmp/tl-dev-home/.talos/cni/bin"
creating "/tmp/tl-dev-home/.talos/cni/cache"
creating "/tmp/tl-dev-home/.talos/cni/conf.d"
downloading CNI bundle from "https://github.com/siderolabs/talos/releases/download/v1.7.0-alpha.0/talosctl-cni-bundle-amd64.tar.gz" to "/tmp/tl-dev-home/.talos/cni/bin"
creating state directory in "/tmp/tl-dev-home/.talos/clusters/talos-default"
creating network talos-default
creating load balancer
creating dhcpd
creating controlplane nodes
creating worker nodes
waiting for API
^CSignal received, aborting, press Ctrl+C once again to abort immediately...
bootstrap error: 3 error(s) occurred:
        rpc error: code = DeadlineExceeded desc = context deadline exceeded
        rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.5.0.2:50000: connect: no route to host"
        context canceled
HOME=/tmp/tl-dev-home sudo --preserve-env=HOME _out/talosctl-linux-amd64 cluster create \                                                                                           [±main ✓]
    --provisioner=qemu \
    --registry-mirror docker.io=http://172.20.0.1:5000 \
    --registry-mirror registry.k8s.io=http://172.20.0.1:5001  \
    --registry-mirror gcr.io=http://172.20.0.1:5003 \
    --registry-mirror ghcr.io=http://172.20.0.1:5004 \
    --registry-mirror 127.0.0.1:5005=http://172.20.0.1:5005 \
    --install-image=127.0.0.1:5005/siderolabs/installer:v1.7.0-alpha.0-19-g5324d3916 \
    --controlplanes 1 \
    --workers 2 \
    --with-bootloader=false \
    --extra-uefi-search-paths "/tmp"

HOME is set to tmp to make installation temporary (I typically do one-time stuff and build artifacts on tmpfs). Extra search path includes OVMF_CODE.fd and OVMF_VARS.fd because provisioner couldn't find them in locations SUSE puts those in by default.

Logs

==> /tmp/tl-dev-home/.talos/clusters/talos-default/talos-default-worker-1.log <==
[   33.843549] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

==> /tmp/tl-dev-home/.talos/clusters/talos-default/talos-default-controlplane-1.log <==
[   34.823032] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

==> /tmp/tl-dev-home/.talos/clusters/talos-default/talos-default-worker-2.log <==
[   34.657992] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

==> /tmp/tl-dev-home/.talos/clusters/talos-default/talos-default-worker-1.log <==
[   34.844676] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

==> /tmp/tl-dev-home/.talos/clusters/talos-default/talos-default-controlplane-1.log <==
[   35.824837] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

==> /tmp/tl-dev-home/.talos/clusters/talos-default/talos-default-worker-2.log <==
[   35.660024] [talos] failed looking up "pool.ntp.org", ignored {"component": "controller-runtime", "controller": "time.SyncController", "error": "lookup pool.ntp.org on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}

Environment

dsseng commented 9 months ago

Same if I use HOME=/tmp/tl-dev-home sudo --preserve-env=HOME talosctl cluster create --provisioner=qemu --extra-uefi-search-paths "/tmp" with a talosctl binary from talos 1.5.6

dsseng commented 9 months ago

Alright, that's a firewalld issue resolved after stopping it. Will keep the issue open as it should be documented somewhere that you need to ignore the interface in the firewall

I didn't find how to make firewalld accept everything so I just suspend it since I develop inside a network that is already firewalled and my only open ports are ones from Talos and a hardened ssh.

smira commented 9 months ago

We are not using Red Hat-based OSes with firewalld, so we don't know what the problem is. If you find a solution, please update the docs/fix it. In theory it should work, as CNI utils which set up networking should work directly with firewalld.

dsseng commented 9 months ago

Okay, let's keep this open for others who stumble across this problem. If I think of a solution I'll reply.

frezbo commented 3 months ago

Okay, this should fix it, for users using firewalld

sudo firewall-cmd --permanent --new-zone=talos
sudo firewall-cmd --permanent --zone=talos --set-target=ACCEPT
sudo firewall-cmd --permanent --zone=talos --add-interface="talos+"
sudo firewall-cmd --permanent --zone=talos --add-interface="veth+"
sudo firewall-cmd --reload
dsseng commented 1 month ago

Apparently this broke again

frezbo commented 1 month ago

Apparently this broke again

If you're using docker let it messes up with firewall rules, I've this in /etc/docker/daemon.json

{"iptables":false}

note that this will break non host network docker containers

dsseng commented 1 month ago

note that this will break non host network docker containers

:(

frezbo commented 1 month ago

note that this will break non host network docker containers

:(

I use podman for such cases and buildkit runs with host network, so all talos development should just work

dsseng commented 1 month ago

Well, maybe. But for now with secure network I can just disable firewalld while testing

paddy-hack commented 1 month ago

This behavior can also manifest itself when behind a proxy that does not let NTP traffic through. I work around that by setting an internal NTP server via a --config-patch that looks like

machine:
  time:
    servers:
      - ntp.example.com
paddy-hack commented 1 month ago

Forgot to mention that I also pass a --nameservers option to set a pair internal name servers.