squat / kilo

Kilo is a multi-cloud network overlay built on WireGuard and designed for Kubernetes (k8s + wg = kg)
https://kilo.squat.ai
Apache License 2.0
2.02k stars 122 forks source link

Single server control plane (kubeadm 1.22), no connectivity on fresh install. #247

Closed alekc closed 2 years ago

alekc commented 3 years ago

I am setting up a cluster, and got an issue when using kilo as the only CNI

kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:37:34Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/arm64"}

uname -a
Linux k8s-master-01 5.11.0-1019-oracle #20~20.04.1-Ubuntu SMP Tue Sep 21 14:20:46 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"

Installed kilo with

kubectl apply -f https://raw.githubusercontent.com/squat/kilo/main/manifests/crds.yaml
kubectl apply -f https://raw.githubusercontent.com/squat/kilo/main/manifests/kilo-kubeadm.yaml
# kubectl get po -A
NAMESPACE     NAME                                    READY   STATUS             RESTARTS        AGE
kube-system   coredns-78fcd69978-l5rdb                1/1     Running            0               73m
kube-system   coredns-78fcd69978-nqxtj                1/1     Running            0               73m
kube-system   etcd-k8s-master-01                      1/1     Running            32              73m
kube-system   kilo-g4kgq                              1/1     Running            0               2m11s
kube-system   kpubber-7lsgw                           0/1     CrashLoopBackOff   8 (2m14s ago)   26m
kube-system   kube-apiserver-k8s-master-01            1/1     Running            4               73m
kube-system   kube-controller-manager-k8s-master-01   1/1     Running            7               73m
kube-system   kube-proxy-zrr5m                        1/1     Running            0               73m
kube-system   kube-scheduler-k8s-master-01            1/1     Running            18              73m
# kubectl get nodes
NAME            STATUS   ROLES                  AGE   VERSION
k8s-master-01   Ready    control-plane,master   73m   v1.22.2
# kubectl get nodes k8s-master-01 -o yaml | grep annot -A 10
  annotations:
    kilo.squat.ai/discovered-endpoints: '{}'
    kilo.squat.ai/endpoint: 240.238.90.130:51820
    kilo.squat.ai/force-endpoint: 240.238.90.130
    kilo.squat.ai/granularity: location
    kilo.squat.ai/internal-ip: 10.0.0.27/24
    kilo.squat.ai/key: zUcsZwT7Qlz/wmzKfHzHNP5oCNaHpVLbcSiR9G64zgU=
    kilo.squat.ai/last-seen: "1633713905"
    kilo.squat.ai/location: oracle-alekc
    kilo.squat.ai/wireguard-ip: 10.4.0.1/16
    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock

# wg
interface: kilo0
  public key: zUcsZwT7Qlz/wmzKfHzHNP5oCNaHpVLbcSiR9G64zgU=
  private key: (hidden)
  listening port: 51820
IPTABLES

``` # iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes health check service ports */ KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */ KUBE-FIREWALL all -- anywhere anywhere KILO-IPIP ipencap-- anywhere anywhere /* Kilo: jump to IPIP chain */ DROP ipencap-- anywhere anywhere /* Kilo: reject other IPIP traffic */ Chain FORWARD (policy DROP) target prot opt source destination KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */ KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */ KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */ DOCKER-USER all -- anywhere anywhere DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED DOCKER all -- anywhere anywhere ACCEPT all -- anywhere anywhere ACCEPT all -- anywhere anywhere Chain OUTPUT (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */ KUBE-FIREWALL all -- anywhere anywhere Chain DOCKER (1 references) target prot opt source destination Chain DOCKER-ISOLATION-STAGE-1 (1 references) target prot opt source destination DOCKER-ISOLATION-STAGE-2 all -- anywhere anywhere RETURN all -- anywhere anywhere Chain DOCKER-ISOLATION-STAGE-2 (1 references) target prot opt source destination DROP all -- anywhere anywhere RETURN all -- anywhere anywhere Chain DOCKER-USER (1 references) target prot opt source destination RETURN all -- anywhere anywhere Chain KILO-IPIP (1 references) target prot opt source destination ACCEPT all -- k8s-master-1.subnet09021850.vcn09021850.oraclevcn.com anywhere /* Kilo: allow IPIP traffic */ Chain KUBE-EXTERNAL-SERVICES (2 references) target prot opt source destination Chain KUBE-FIREWALL (2 references) target prot opt source destination DROP all -- anywhere anywhere /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000 DROP all -- !localhost/8 localhost/8 /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT Chain KUBE-FORWARD (1 references) target prot opt source destination DROP all -- anywhere anywhere ctstate INVALID ACCEPT all -- anywhere anywhere /* kubernetes forwarding rules */ mark match 0x4000/0x4000 ACCEPT all -- anywhere anywhere /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED ACCEPT all -- anywhere anywhere /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED Chain KUBE-KUBELET-CANARY (0 references) target prot opt source destination Chain KUBE-NODEPORTS (1 references) target prot opt source destination Chain KUBE-PROXY-CANARY (0 references) target prot opt source destination Chain KUBE-SERVICES (2 references) target prot opt source destination ```

My understanding is that at this point (especially since we are just 1 single node), it should still be working

However, this is what's happening:

# kubectl get pods -o wide -A
NAMESPACE     NAME                                    READY   STATUS             RESTARTS        AGE   IP           NODE            NOMINATED NODE   READINESS GATES
kube-system   coredns-78fcd69978-l5rdb                1/1     Running            0               81m   10.244.0.3   k8s-master-01   <none>           <none>
kube-system   coredns-78fcd69978-nqxtj                1/1     Running            0               81m   10.244.0.4   k8s-master-01   <none>           <none>
kube-system   etcd-k8s-master-01                      1/1     Running            32              81m   10.0.0.27    k8s-master-01   <none>           <none>
kube-system   kilo-g4kgq                              1/1     Running            0               10m   10.0.0.27    k8s-master-01   <none>           <none>
kube-system   kpubber-7lsgw                           0/1     CrashLoopBackOff   9 (4m30s ago)   35m   10.244.0.2   k8s-master-01   <none>           <none>
kube-system   kube-apiserver-k8s-master-01            1/1     Running            4               81m   10.0.0.27    k8s-master-01   <none>           <none>
kube-system   kube-controller-manager-k8s-master-01   1/1     Running            7               81m   10.0.0.27    k8s-master-01   <none>           <none>
kube-system   kube-proxy-zrr5m                        1/1     Running            0               81m   10.0.0.27    k8s-master-01   <none>           <none>
kube-system   kube-scheduler-k8s-master-01            1/1     Running            18              81m   10.0.0.27    k8s-master-01   <none>           <none>

kubectl run --rm -it --image=alpine -- ash
# ping 10.244.0.3
PING 10.244.0.3 (10.244.0.3): 56 data bytes
^C
--- 10.244.0.3 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss

/ #  nslookup kubernetes.default
;; connection timed out; no servers could be reached

cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local vcn09021850.oraclevcn.com
options ndots:5

If I remove the kilo and install flannel for example

# kubectl delete -f https://raw.githubusercontent.com/squat/kilo/main/manifests/kilo-kubeadm.yaml
configmap "kilo" deleted
serviceaccount "kilo" deleted
clusterrole.rbac.authorization.k8s.io "kilo" deleted
clusterrolebinding.rbac.authorization.k8s.io "kilo" deleted
daemonset.apps "kilo" deleted

# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created

rm /etc/cni/net.d/10-kilo.conflist
reboot

# kubectl run --rm -it --image=alpine -- ash
# nslookup kubernetes.default
Server:     10.96.0.10
Address:    10.96.0.10:53

** server can't find kubernetes.default: NXDOMAIN

** server can't find kubernetes.default: NXDOMAIN

/ #

# ping google.com
PING google.com (142.250.179.238): 56 data bytes
64 bytes from 142.250.179.238: seq=0 ttl=117 time=1.336 ms
^C
--- google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.336/1.336/1.336 ms
/ #

everything works. If I go backwards (delete flannel, cni config, install kilo, reboot) networking is not working anymore

p.s.

# kubectl logs kilo-r5sf5 -n kube-system
{"caller":"main.go:273","msg":"Starting Kilo network mesh 'f90288133d5543398d032913a7d558960ccf2ad0'.","ts":"2021-10-08T17:55:14.142951243Z"}
{"caller":"cni.go:61","component":"kilo","err":"failed to read IPAM config from CNI config list file: no IP ranges specified","level":"warn","msg":"failed to get CIDR from CNI file; overwriting it","ts":"2021-10-08T17:55:14.243883511Z"}
{"caller":"cni.go:69","component":"kilo","level":"info","msg":"CIDR in CNI file is empty","ts":"2021-10-08T17:55:14.243920391Z"}
{"CIDR":"10.244.0.0/24","caller":"cni.go:74","component":"kilo","level":"info","msg":"setting CIDR in CNI file","ts":"2021-10-08T17:55:14.243932511Z"}
{"caller":"mesh.go:545","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2021-10-08T17:55:14.49968136Z"}
leonnicolas commented 2 years ago

Sorry @alekc for the long response time, Ubuntu uses the default policy DROP in the FORWARD chain of the filter table. To test if this is the problem, can you ssh onto the node and run iptables -t filter -P FORWARD ACCEPT?

If you are using the squat/kilo:latest you can specify --iptables-forward-rules=true as an arg to kg. Note, that in case you have leader nodes (with FORWARS DROP policy) in locations with more then one node, you have to wait until #248 is merged to have full connectivity.

leonnicolas commented 2 years ago

Also I think you should not use the force-endpoint annotation without port. I think, this will just be ignored. https://kilo.squat.ai/docs/annotations#force-endpoint

alekc commented 2 years ago

I will reset the node (meanwhile I've proceeded with zerotrust) and see if that's the case. Sadly the #248 might prevent me to adopt the solution anyway, but at very least I will have (or not) the assurance that it works on single node setup.

alekc commented 2 years ago

Took a while for me to test it out. So, I can preliminary confirm that adding a --iptables-forward-rules=true fixes the issue with the single node (It feels like it should be mentioned somewhere in the Readme, since it's going to be a major blocker for anyone attempting to install it on Ubuntu).

I will try to deploy a 3 node cluster later on and will see if the connectivity is on expected levels.

leonnicolas commented 2 years ago

Hey @alekc, feel free to reopen or make a PR about mentioning this in the Readme.