k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.04k stars 2.35k forks source link

DNS problem for k3s multicloud cluster #10900

Open allnightlong opened 1 month ago

allnightlong commented 1 month ago

Discussed in https://github.com/k3s-io/k3s/discussions/10897

Originally posted by **allnightlong** September 15, 2024 I'm building my cluster with nodes from different datacenters. Actually, the cluster has lived in one dc for some time with 5 node (1 server + 4 agents). Now I'm adding new node in different dc. Using this tutorial as an example https://docs.k3s.io/networking/distributed-multicloud#embedded-k3s-multicloud-solution For server: ``` --node-external-ip= --flannel-backend=wireguard-native --flannel-external-ip ``` For agent: ``` --node-external-ip= ``` The problem is non of the agent's pod can resolve any hostname. I'm using official dns resolution guide https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ and `nslookup` is failing both for internal and external requests. Internal: ``` kubectl exec -i -t dnsutils -- nslookup kubernetes.default ;; connection timed out; no servers could be reached command terminated with exit code 1 ``` External: ``` kubectl exec -i -t dnsutils -- nslookup goo.gl ;; connection timed out; no servers could be reached command terminated with exit code 1 ``` External with cloudflare's dns: ``` kubectl exec -i -t dnsutils -- nslookup goo.gl 1.1.1.1 Server: 1.1.1.1 Address: 1.1.1.1#53 Non-authoritative answer: Name: goo.gl Address: 142.250.193.238 Name: goo.gl Address: 2404:6800:4002:81d::200e ``` What could be the problem of this DNS issue? How could I resolve it? P.S. I'm using k3s version `v1.30.4+k3s1` (latest at the time of writing) both for server and agents. ![2024-09-15_02-50-58](https://github.com/user-attachments/assets/54ca6f0d-5e26-40aa-b6d3-c1265b84c182)
brandond commented 1 month ago

This indicates that the wireguard mesh between nodes isn't functioning properly, and DNS traffic between the affected node, and the node running the coredns pod, is being dropped. Ensure that you've opened all the correct ports for wireguard, and that you have node external-IPs set correctly for wireguard to correctly establish the mesh between nodes.

allnightlong commented 1 month ago

Hi, @brandond , thank you for the answer.

Here is my cluster state:

k get no -o wide                                                                                                                    
NAME           STATUS     ROLES                       AGE   VERSION        INTERNAL-IP   EXTERNAL-IP       OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
core           Ready      control-plane,core,master   15d   v1.30.4+k3s1   10.0.1.4      146.185.xxx.xxx   Ubuntu 24.04.1 LTS   6.8.0-41-generic   containerd://1.7.20-k3s1
node-iota      Ready      node                        8d    v1.30.4+k3s1   10.0.1.2      <none>            Ubuntu 24.04.1 LTS   6.8.0-41-generic   containerd://1.7.20-k3s1
node-kappa     Ready      node                        22h   v1.30.4+k3s1   10.0.1.99     109.120.xxx.xx    Ubuntu 24.04.1 LTS   6.8.0-44-generic   containerd://1.7.20-k3s1
node-lambda    Ready      node                        22h   v1.30.4+k3s1   10.0.1.98     109.120.xxx.xx    Ubuntu 24.04.1 LTS   6.8.0-44-generic   containerd://1.7.20-k3s1
node-theta     Ready      node                        8d    v1.30.4+k3s1   10.0.1.8      <none>            Ubuntu 24.04.1 LTS   6.8.0-41-generic   containerd://1.7.20-k3s1

The main node core and nodes node-iota and note-theta are in dc1. Nodes node-kappa and node-lambda are in dc2.

I'm checking connectivity, according the page - https://docs.k3s.io/installation/requirements#networking.

From core node I'm able to connect to node-labmda by TCP to port 10250 and by UDP to port 51820. 2024-09-16_23-17-52

From node-labmda I can connect to core by TCP port 6443, TCP port 10250 and UDP port 51820. 2024-09-16_23-14-42

Here is my config for core server node:

cat /etc/systemd/system/k3s.service
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    server \
        '--tls-san' \
        'core.xxx.cloud' \
        '--node-external-ip=146.185.xxx.xxx' \
        '--flannel-backend=wireguard-native' \
        '--flannel-external-ip' \
        '--bind-address=0.0.0.0' \
        '--kubelet-arg=allowed-unsafe-sysctls=net.ipv6.*' \
        '--kubelet-arg=allowed-unsafe-sysctls=net.ipv4.*' \

Here is my config for node-lambda agent node:

cat /etc/systemd/system/k3s-agent.service
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s-agent.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    agent \
        '--kubelet-arg=allowed-unsafe-sysctls=net.ipv6.*' \
        '--kubelet-arg=allowed-unsafe-sysctls=net.ipv4.*' \
        '--node-ip=10.0.1.98' \
        '--node-external-ip=109.120.xxx.xx' \

TBH, not sure in which direction should I go at this point, so any suggestions are welcome.

brandond commented 1 month ago

@manuelbuil do you have any tips on how to check wireguard connectivity between nodes?

manuelbuil commented 1 month ago

@allnightlong could you run the following commands:

1 - Install wireguard-tools and then execute sudo wg in the node where dnsutils is running 2 - Search for the IP of the coredns pod ($COREDNSIP) and then execute: kubectl exec -i -t dnsutils -- nslookup goo.gl $COREDNSIP and see if that works 3 - Can you ping $COREDNSIP from the node where dnsutils is running?

allnightlong commented 1 month ago

Hi, @manuelbuil, thank you for the answers, here is my cluster's state:

  1. From node-lambda (in datacenter 2) I execute sudo wg:
    
    sudo wg
    interface: flannel-wg
    public key: UxoKiZzDtXIwVgpYKXSucgqm52oB+k4GT2LjDK6t0mI=
    private key: (hidden)
    listening port: 51820

peer: DIMwbxQYU3uKGxnLrY0N4/hp9u9oAvQg/dQOJAYLiVk= endpoint: 146.185.xxx.xxx:51820 allowed ips: 10.42.0.0/24 latest handshake: 24 seconds ago transfer: 18.61 MiB received, 17.27 MiB sent persistent keepalive: every 25 seconds

peer: +wGbtSsm5PDnDPB9N6n/SlKi3aeiKi2gsgEyeQBs7Wc= endpoint: 109.120.xxx.xxx:51820 allowed ips: 10.42.8.0/24 latest handshake: 1 minute, 40 seconds ago transfer: 221.55 KiB received, 303.81 KiB sent persistent keepalive: every 25 seconds

146.185.xxx.xxx - is a `core` server node (datacenter 1).
109.120.xxx.xxx:51820 0 is a `node-kappa` agent node (datacenter 2).

2. I've got `dnsutils` pod running on `node-lambda` (datacenter 2)

kubectl exec -i -t dnsutils -- nslookup goo.gl 10.42.6.128 ;; communications error to 10.42.6.128#53: timed out ;; communications error to 10.42.6.128#53: timed out ;; communications error to 10.42.6.128#53: timed out ;; no servers could be reached

command terminated with exit code 1


if I run `dnsutils` on `node-iota` (datacenter 1), the connection is ok

kubectl exec -i -t dnsutils -- nslookup goo.gl 10.42.6.128
Server: 10.42.6.128 Address: 10.42.6.128#53

Non-authoritative answer: Name: goo.gl Address: 64.233.165.138 Name: goo.gl Address: 64.233.165.113 Name: goo.gl Address: 64.233.165.100 Name: goo.gl Address: 64.233.165.101 Name: goo.gl Address: 64.233.165.139 Name: goo.gl Address: 64.233.165.102 Name: goo.gl Address: 2a00:1450:4010:c08::66 Name: goo.gl Address: 2a00:1450:4010:c08::64 Name: goo.gl Address: 2a00:1450:4010:c08::71 Name: goo.gl Address: 2a00:1450:4010:c08::65


3. PING

ping 10.42.6.128 PING 10.42.6.128 (10.42.6.128) 56(84) bytes of data. From 10.42.9.0 icmp_seq=1 Destination Host Unreachable ping: sendmsg: Required key not available From 10.42.9.0 icmp_seq=2 Destination Host Unreachable ping: sendmsg: Required key not available From 10.42.9.0 icmp_seq=3 Destination Host Unreachable ping: sendmsg: Required key not available From 10.42.9.0 icmp_seq=4 Destination Host Unreachable ping: sendmsg: Required key not available From 10.42.9.0 icmp_seq=5 Destination Host Unreachable ping: sendmsg: Required key not available From 10.42.9.0 icmp_seq=6 Destination Host Unreachable ping: sendmsg: Required key not available ^C --- 10.42.6.128 ping statistics --- 6 packets transmitted, 0 received, +6 errors, 100% packet loss, time 5146ms

brandond commented 1 month ago

Run those tests on all the nodes. You need full connectivity between all cluster members, since the coredns pod may run on any node.

allnightlong commented 1 month ago

you are right, @brandond , coredns pod is on node-iota but I can connect to it from node-theta (dc1)

kubectl exec -i -t dnsutils -- nslookup goo.gl 10.42.6.128                                      
Server:         10.42.6.128
Address:        10.42.6.128#53

Non-authoritative answer:
Name:   goo.gl
Address: 64.233.165.138
Name:   goo.gl
Address: 64.233.165.102
Name:   goo.gl
Address: 64.233.165.100
Name:   goo.gl
Address: 64.233.165.113
Name:   goo.gl
Address: 64.233.165.101
Name:   goo.gl
Address: 64.233.165.139
Name:   goo.gl
Address: 2a00:1450:4010:c08::66
Name:   goo.gl
Address: 2a00:1450:4010:c08::64
Name:   goo.gl
Address: 2a00:1450:4010:c08::8a
Name:   goo.gl
Address: 2a00:1450:4010:c08::71
allnightlong commented 1 month ago

I think, I've figured out the problem. It was combination of 2 factors:

  1. Only server node in datacenter 1 had EXTERNAL-IP configured. Other two agent nodes (iota and theta) had only INTERNAL-IP.
  2. dns pod was running on agent node (iota).

My expectations were, that connectivity should be established only between any agent node and server node. And k3s should setup VPN between all node through server node. Apparently, it requires each node to have public IP for this stack to work.

Another expectation was, that all system pods would run on server node. Apparently this is not the case either. Thank you @brandond , @manuelbuil for helping me sorting things out.

In this situation my only request would be to make documentation more clear about that, as I've spent quite some time, trying figuring out the problem.

And I didn't found any config option to move all kube-system pods to server node - is it possible?

manuelbuil commented 1 month ago

Great that you found the problem! Thanks for taking the effort

My expectations were, that connectivity should be established only between any agent node and server node. And k3s should setup VPN between all node through server node. Apparently, it requires each node to have public IP for this stack to work.

We can add mor information in the docs but right now it is stated that K3s uses wireguard to establish a VPN mesh for cluster traffic. What you are describing would be a VPN star topology or hub-spoke, not a mesh

allnightlong commented 1 month ago

thank you, for clearing things out for me!

brandond commented 1 month ago

My expectations were, that connectivity should be established only between any agent node and server node. And k3s should setup VPN between all node through server node.

As Manuel (and the docs) said, wireguard is a full mesh. What you're asking for is closer to what tailscale does. If you want something more like a star/hub-and-spoke, you should look into using tailscale. This is covered in the docs.

Another expectation was, that all system pods would run on server node.

I'm curious where this expectation came from. There is nothing special about pods in the kube-system namespace, they will run on any available node in the cluster, same as any other pod.

allnightlong commented 1 month ago

In my setup, core in is command-only node, were a tasks distributor is located. All other nodes are high-intense CPU usage nodes.

I've already run into a problem, when core node was also a worker, and due to high-load k3s was very slow to response to kubectl commands.

That's why I don't want any of the system-important pods to run anywhere, but core. I've manage to even move system-upgrade-controller pod to core with complex:

spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: In
        values:
          - "true"

but I don't know, how to force this dns pod to run on main node.

shakibamoshiri commented 1 month ago
  1. make sure your coredns is up and running
  2. get the coredns service IP and test it (e.g. dig one.com @10.43.0.10
  3. make sure your machine firewall does not block k3s IPs (pod and services) check iptable and nft
  4. ping the k3s server flannel IP form each agent
  5. use wg show <INTERFACE> to see if interfaces have communicated
  6. enable How to see debug logs for WireGuard
  7. check ip route show to see if you have route for flannel , sometimes when the VPN (wg) is restarted but not k3s, then routes are gone and for recreate route restart k3s-agents or and them manually
  8. run tcpdump -qni any udp dst port 53 and check from agent nodes dig one.com @<YOUR_COREDNS_SVC_IP> and see tcpdump logs
github-actions[bot] commented 1 day ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.