k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.77k stars 2.33k forks source link

wireguard not populating with connections to other nodes #3287

Closed iameli closed 2 years ago

iameli commented 3 years ago

Environmental Info: K3s Version:

k3s version v1.20.6+k3s1 (8d043282)
go version go1.15.10

Node(s) CPU architecture, OS, and Version:

Linux dp2811 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: Four servers, at the moment I'm hitting this problem. (In the process of spinning up the servers.) Config looks like:

ExecStart=/usr/local/bin/k3s server \
  --disable=traefik \
  --kube-apiserver-arg=feature-gates='ServiceTopology=true,EndpointSlice=true,EndpointSliceProxying=true' \
  --cluster-cidr="10.42.0.0/16" \
  --service-cidr="10.43.0.0/16" \
  --tls-san=mdw-admin.livepeer.engineering \
  --default-local-storage-path=/home/data \
  --token="[redacted]" \
   --disable servicelb  \
  --flannel-backend=wireguard \
  --node-external-ip=143.244.61.205

Describe the bug: I had a single-server k3s cluster with a wireguard backend running last night. I recently tried adding additional servers to it. The additional servers came online, and they have wireguard connections to each other. Here's server 3.3.3.3 successfully connecting to 2.2.2.2 and 4.4.4.4:

> wg show
interface: flannel.1
  public key: BKyc3q6MpDFaVqLYrPU3NX7kmr9RahhgQ7JvYV0XFSg=
  private key: (hidden)
  listening port: 51820

peer: XjnFZcY1o/sbom6/8Z6SSuWbq0cbdMa/w4DOWC5q8Do=
  endpoint: 4.4.4.4:51820
  allowed ips: 10.42.1.0/24
  latest handshake: 25 seconds ago
  transfer: 6.11 KiB received, 4.30 KiB sent
  persistent keepalive: every 25 seconds

peer: pWv5a3iIfdURa/wlPK5wivy9KleCgeWL//ZJ2eAFbyY=
  endpoint: 2.2.2.2:51820
  allowed ips: 10.42.2.0/24
  latest handshake: 50 seconds ago
  transfer: 6.11 KiB received, 4.30 KiB sent
  persistent keepalive: every 25 seconds

The original server supposedly has all of this configuration, but none of the connections are open:

> wg show
interface: flannel.1
  public key: OeFOZblQVwEkYBEwRGx0cefR+ChN+KNYM1vJjYs70w0=
  private key: (hidden)
  listening port: 51820

peer: MwE3kq82mJGbBc55suKWQNq1+Tn5/DjHCFp05BrmalI=
  endpoint: 4.4.4.4:51820
  allowed ips: (none)
  transfer: 0 B received, 78.91 KiB sent
  persistent keepalive: every 25 seconds

peer: BECrn2WJkNGay50K405E7B7OT3TKoRevg9xZw5xupwc=
  endpoint: 2.2.2.2:51820
  allowed ips: (none)
  transfer: 0 B received, 78.48 KiB sent
  persistent keepalive: every 25 seconds

peer: BKyc3q6MpDFaVqLYrPU3NX7kmr9RahhgQ7JvYV0XFSg=
  endpoint: 3.3.3.3:51820
  allowed ips: 10.42.3.0/24
  transfer: 0 B received, 78.05 KiB sent
  persistent keepalive: every 25 seconds

I tried to fix this with a systemctl restart k3s, and now nothing shows up at all on the 1.1.1.1 server:

> wg show
interface: flannel.1
  public key: xOsXaDV5NZx0ZBbpuj9TGvwrFBYHJKs3f3bf0p5i534=
  private key: (hidden)
  listening port: 51820

Steps To Reproduce: Not sure yet.

Expected behavior: Established Wireguard connections. Or, presumably, k3s should fix up the local networking environment

Actual behavior: No connections, all traffic to other nodes gets blackholed.

EDIT: Trying to diagnose further... it looks like the other servers in the cluster can't contact each other either, even though the wireguard connections are up. Flannel clearly having problems here, not sure why yet.

k3s check-config says I'm okay but does come up with this... but aren't those routes the ones that k3s created?

System:
- /usr/sbin iptables v1.8.4 (legacy): ok
- swap: should be disabled
- routes: default CIDRs 10.42.0.0/16 or 10.43.0.0/16 already routed

EDIT 2: Seemingly resolved with rolling systemctl restart k3s on all affected servers... the routes came back one at a time. Interesting.

ieugen commented 3 years ago

We have also encountered issues when adding new nodes with wireguard via ansible. We believe that changing the wireguard keys (the ansible role does that on update) breakes k3s connections somehow.

Restarting the servers fixed the issue but we are planning on migrating away from wireguard to local private network.

iameli commented 3 years ago

@ieugen Interesting. I'm also using Ansible, but using my own playbooks, not https://github.com/k3s-io/k3s-ansible or anything like that. I wonder if it has to do with restarting k3s at the wrong time? Our playbook does restart the k3s service two or three times over the course of the playbook, that could be the issue.

stale[bot] commented 2 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.