Closed idcmp closed 1 year ago
hi @idcmp can you share the Kilo logs from the laptop VM? This is the most important piece of information here for why the node is unready.
Here are some logs from the kilo pod running on the laptop VM:
{"caller":"main.go:280","msg":"Starting Kilo network mesh '0.5.0'.","ts":"2022-05-05T13:19:38.55270756Z"}
{"caller":"cni.go:61","component":"kilo","err":"failed to read IPAM config from CNI config list file: no IP ranges specified","level":"warn","msg":"failed to get CIDR from CNI file; overwriting it","ts":"2022-05-05T13:19:38.660242599Z"}
{"caller":"cni.go:69","component":"kilo","level":"info","msg":"CIDR in CNI file is empty","ts":"2022-05-05T13:19:38.660405863Z"}
{"CIDR":"10.244.1.0/24","caller":"cni.go:74","component":"kilo","level":"info","msg":"setting CIDR in CNI file","ts":"2022-05-05T13:19:38.660480957Z"}
{"caller":"mesh.go:309","component":"kilo","event":"add","level":"info","node":{"Endpoint":{},"Key":[8,....8],"NoInternalIP":false,"InternalIP":{"IP":"ec2.internal.ip","Mask":"////AA=="},"LastSeen":1651756768,"Leader":true,"Location":"aws","Name":".compute.internal","PersistentKeepalive":0,"Subnet":{"IP":"10.244.0.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.1","Mask":"//8AAA=="},"DiscoveredEndpoints":{"s...I=":{"IP":"my.external.laptop.ip","Port":52412,"Zone":""}},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T13:19:38.782897818Z"}
{"caller":"mesh.go:375","component":"kilo","level":"info","msg":"overriding endpoint","new endpoint":"10.0.2.15:51820","node":"fedora","old endpoint":"","ts":"2022-05-05T13:19:38.78309792Z"}
Looking at journalctl, there's a bunch of:
May 05 06:28:19 fedora kubelet[113625]: E0505 06:28:19.456144 113625 file.go:104] "Unable to read config path" err="/etc/kubernetes/manifests: couldn't parse as pod(Object 'Kind' is missing in 'null'), please check config file" path="/etc/kubernetes/manifests"
...
May 05 06:28:21 fedora kubelet[113625]: E0505 06:28:21.503848 113625 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-proxy pod=kube-proxy-bxmdk_kube-system(931c6970-dc21-4834-a73f-db7f33baacdd)\"" pod="kube-system/kube-proxy-bxmdk" podUID=931c6970-dc21-4834-a73f-db7f33baacdd
and of course
May 05 06:28:34 fedora kubelet[113625]: E0505 06:28:34.466078 113625 kubelet.go:2344] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
My dmesg
has wg debug messages:
[40101.276150] wireguard: kilo0: Keypair 260 destroyed for peer 19
[40101.276152] wireguard: kilo0: Keypair 262 created for peer 19
[40101.276155] wireguard: kilo0: Sending keepalive packet to peer 19 (ec2.public.ip:51820)
[40176.492541] wireguard: kilo0: Retrying handshake with peer 19 (ec2.public.ip:51820) because we stopped hearing back after 15 seconds
[40176.492596] wireguard: kilo0: Sending handshake initiation to peer 19 (ec2.public.ip:51820)
[40176.540046] wireguard: kilo0: Receiving handshake response from peer 19 (ec2.public.ip:51820)
[40176.540056] wireguard: kilo0: Keypair 261 destroyed for peer 19
[40176.540057] wireguard: kilo0: Keypair 263 created for peer 19
[40176.540061] wireguard: kilo0: Sending keepalive packet to peer 19 (ec2.public.ip:51820)
Hope this helps, just let me know if there are other logs you'd like!
Hi @idcmp, you mentioned that the Kilo pod on the laptop VM was crashlooping, but the logs you shared don't mention anything about a crash. Are there more logs that do include this? If not, is there any other indication as to why the pod is crashlooping?
I'll see if I can follow the logs and catch it in the act (logs -p
doesn't seem to be working).
Here's a describe of the pod:
[root@ip-172-31-0-114 ~]# kubectl describe pod kilo-7mkm8 -n kube-system
Name: kilo-7mkm8
Namespace: kube-system
Priority: 0
Node: fedora/10.0.2.15
Start Time: Thu, 05 May 2022 04:59:31 -0700
Labels: app.kubernetes.io/name=kilo
app.kubernetes.io/part-of=kilo
controller-revision-hash=dbcdbbbdb
pod-template-generation=1
Annotations: <none>
Status: Running
IP: laptop.vm.internal.ip
IPs:
IP: laptop.vm.internal.ip
Controlled By: DaemonSet/kilo
Init Containers:
install-cni:
Container ID: containerd://b07eee639a7ca47b81cedbee6a4086dcb2b5f3069775233e4304738eaf14cd29
Image: squat/kilo:0.5.0
Image ID: docker.io/squat/kilo@sha256:feb4d02f3c5a6d51ba9687b5f75d3e6b63ebb90675058a6a04f36ec46ff214ef
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
set -e -x; cp /opt/cni/bin/* /host/opt/cni/bin/; TMP_CONF="$CNI_CONF_NAME".tmp; echo "$CNI_NETWORK_CONFIG" > $TMP_CONF; rm -f /host/etc/cni/net.d/*; mv $TMP_CONF /host/etc/cni/net.d/$CNI_CONF_NAME
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 05 May 2022 07:03:53 -0700
Finished: Thu, 05 May 2022 07:03:53 -0700
Ready: True
Restart Count: 4
Environment:
CNI_CONF_NAME: 10-kilo.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni-conf.json' of config map 'kilo'> Optional: false
Mounts:
/host/etc/cni/net.d from cni-conf-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k55bv (ro)
Containers:
kilo:
Container ID: containerd://475caf1dca40a5509e9fe2fb7d18799da9e3130ceafd01f8b057d6ce3355e99c
Image: squat/kilo:0.5.0
Image ID: docker.io/squat/kilo@sha256:feb4d02f3c5a6d51ba9687b5f75d3e6b63ebb90675058a6a04f36ec46ff214ef
Port: 1107/TCP
Host Port: 1107/TCP
Args:
--kubeconfig=/etc/kubernetes/kubeconfig
--hostname=$(NODE_NAME)
State: Running
Started: Thu, 05 May 2022 07:04:44 -0700
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 05 May 2022 07:01:18 -0700
Finished: Thu, 05 May 2022 07:03:52 -0700
Ready: True
Restart Count: 23
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/cni/net.d from cni-conf-dir (rw)
/etc/kubernetes from kubeconfig (ro)
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/kilo from kilo-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k55bv (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-conf-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
kilo-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kilo
HostPathType:
kubeconfig:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kube-proxy
Optional: false
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
kube-api-access-k55bv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3h50m (x170 over 5h42m) kubelet Back-off restarting failed container
Normal Killing 3h45m (x21 over 5h44m) kubelet Stopping container kilo
Normal Pulled 3h40m (x24 over 5h45m) kubelet Container image "squat/kilo:0.5.0" already present on machine
Name: kube-proxy
Namespace: kube-system
Labels: app=kube-proxy
Annotations: kubeadm.kubernetes.io/component-config.hash: sha256:cb60c18eab70a0ff3b37943616f791331d863a180d6371236b15c186913967ae
Data
====
config.conf:
----
apiVersion: kubeproxy.config.k8s.io/v1alpha1
bindAddress: 0.0.0.0
bindAddressHardFail: false
clientConnection:
acceptContentTypes: ""
burst: 0
contentType: ""
kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
qps: 0
clusterCIDR: 10.244.0.0/16
configSyncPeriod: 0s
conntrack:
maxPerCore: null
min: null
tcpCloseWaitTimeout: null
tcpEstablishedTimeout: null
detectLocal:
bridgeInterface: ""
interfaceNamePrefix: ""
detectLocalMode: ""
enableProfiling: false
healthzBindAddress: ""
hostnameOverride: ""
iptables:
masqueradeAll: false
masqueradeBit: null
minSyncPeriod: 0s
syncPeriod: 0s
ipvs:
excludeCIDRs: null
minSyncPeriod: 0s
scheduler: ""
strictARP: false
syncPeriod: 0s
tcpFinTimeout: 0s
tcpTimeout: 0s
udpTimeout: 0s
kind: KubeProxyConfiguration
metricsBindAddress: ""
mode: ""
nodePortAddresses: null
oomScoreAdj: null
portRange: ""
showHiddenMetricsForVersion: ""
udpIdleTimeout: 0s
winkernel:
enableDSR: false
forwardHealthCheckVip: false
networkName: ""
rootHnsEndpointName: ""
sourceVip: ""
kubeconfig.conf:
----
apiVersion: v1
kind: Config
clusters:
- cluster:
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server: https://ec2.external.hostname:6443
name: default
contexts:
- context:
cluster: default
namespace: default
user: default
name: default
current-context: default
users:
- name: default
user:
tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
BinaryData
====
Events: <none>
# kubectl describe cm kilo -n kube-system
Name: kilo
Namespace: kube-system
Labels: app.kubernetes.io/name=kilo
Annotations: <none>
Data
====
cni-conf.json:
----
{
"cniVersion":"0.4.0",
"name":"kilo",
"plugins":[
{
"name":"kubernetes",
"type":"bridge",
"bridge":"kube-bridge",
"isDefaultGateway":true,
"forceAddress":true,
"mtu": 1420,
"ipam":{
"type":"host-local"
}
},
{
"type":"portmap",
"snat":true,
"capabilities":{
"portMappings":true
}
}
]
}
BinaryData
====
Events: <none>
Caught one:
# kubectl logs -f -n kube-system kilo-7mkm8
Defaulted container "kilo" out of: kilo, install-cni (init)
{"caller":"main.go:280","msg":"Starting Kilo network mesh '0.5.0'.","ts":"2022-05-05T18:09:35.628352703Z"}
{"caller":"cni.go:61","component":"kilo","err":"failed to read IPAM config from CNI config list file: no IP ranges specified","level":"warn","msg":"failed to get CIDR from CNI file; overwriting it","ts":"2022-05-05T18:09:35.737697765Z"}
{"caller":"cni.go:69","component":"kilo","level":"info","msg":"CIDR in CNI file is empty","ts":"2022-05-05T18:09:35.73777305Z"}
{"CIDR":"10.244.1.0/24","caller":"cni.go:74","component":"kilo","level":"info","msg":"setting CIDR in CNI file","ts":"2022-05-05T18:09:35.737800051Z"}
{"caller":"mesh.go:309","component":"kilo","event":"add","level":"info","node":{"Endpoint":{},"Key":[8,2.....8],"NoInternalIP":false,"InternalIP":{"IP":"172.31.0.114","Mask":"////AA=="},"LastSeen":1651774147,"Leader":true,"Location":"aws","Name":"ip-......compute.internal","PersistentKeepalive":0,"Subnet":{"IP":"10.244.0.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.1","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T18:09:35.855577419Z"}
{"caller":"mesh.go:375","component":"kilo","level":"info","msg":"overriding endpoint","new endpoint":"laptop.vm.internal.ip:51820","node":"fedora","old endpoint":"","ts":"2022-05-05T18:09:35.856321777Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"port: old='\\x00', new=\"0xc000560020\"","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-05T18:09:35.925980935Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[8,2.....8],"NoInternalIP":false,"InternalIP":{"IP":"ec2.instance.internal.ip","Mask":"////AA=="},"LastSeen":1651774327,"Leader":true,"Location":"aws","Name":"private.ip.hostname.compute.internal","PersistentKeepalive":0,"Subnet":{"IP":"10.244.0.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.1","Mask":"//8AAA=="},"DiscoveredEndpoints":{"s59...0I=":{"IP":"laptop.router.public.ip","Port":52382,"Zone":""}},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T18:12:07.756161466Z"}
{"caller":"main.go:299","msg":"caught interrupt; gracefully cleaning up; see you next time!","ts":"2022-05-05T18:13:31.193565763Z"}
Here are the events from the pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 4h22m (x170 over 6h15m) kubelet Back-off restarting failed container
Normal Pulled 4h12m (x24 over 6h17m) kubelet Container image "squat/kilo:0.5.0" already present on machine
Normal Killing 4h2m (x24 over 6h16m) kubelet Stopping container kilo
Normal SandboxChanged 4h2m (x24 over 6h16m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Killing 3h59m kubelet Stopping container kilo
Normal SandboxChanged 3h59m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 3h59m kubelet Container image "squat/kilo:0.5.0" already present on machine
Normal Created 3h59m kubelet Created container install-cni
Normal Started 3h59m kubelet Started container install-cni
Normal Pulled 3h59m kubelet Container image "squat/kilo:0.5.0" already present on machine
Normal Created 3h59m kubelet Created container kilo
Normal Started 3h59m kubelet Started container kilo
Normal Created 3h46m kubelet Created container install-cni
Normal Created 3h46m kubelet Created container kilo
Normal Pulled 3h46m kubelet Container image "squat/kilo:0.5.0" already present on machine
Normal Started 3h46m kubelet Started container install-cni
Normal Started 3h46m kubelet Started container kilo
Normal Pulled 3h46m kubelet Container image "squat/kilo:0.5.0" already present on machine
Normal Pulled 3m49s (x2 over 3h48m) kubelet Container image "squat/kilo:0.5.0" already present on machine
Normal Started 3m49s (x2 over 3h48m) kubelet Started container install-cni
Normal Created 3m49s (x2 over 3h48m) kubelet Created container install-cni
Normal SandboxChanged 3m49s (x2 over 3h48m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Killing 3m49s kubelet Stopping container kilo
Normal Pulled 3m48s (x2 over 3h48m) kubelet Container image "squat/kilo:0.5.0" already present on machine
Normal Created 3m48s (x2 over 7m45s) kubelet Created container kilo
Normal Started 3m48s (x2 over 7m45s) kubelet Started container kilo
Okay, I'm still experiencing wireguard dropouts, (as described in Situation), but both nodes think they're Ready
now. I restarted the laptop VM so possibly there was some junk network configuration lingering previously. The kilo
pod on the laptop VM still just gracefully cleans up and exits 4 minutes after the last event update it prints out.
Thanks @idcmp those logs are really helpful :) It's quite funny that Kilo is getting terminated. It seems that Kubernetes is shutting it down for some reason. Did you by chance add liveness checks to the daemonset?
Ah, it sounds like maybe it's related to this:
Pod sandbox changed, it will be killed and re-created.
I'll try testing on k8s v1.24 (and updating our ewe tests). So far our tests have been on v1.23. I wonder if this has caused an issue.
I wanted to add that this issue sounds to me like it is a bit more fundamental than WireGuard/Kilo/networking. It seems like something is breaking with the container runtime interrupting Kilo. Once we figure out why Kilo is being killed prematurely, we can investigate any networking issues. In other words, there is no reason why Kubernetes should kill the Kilo Pod just because the network (and thus the node) isn't yet read.
Could you share any details about the container runtime? Are there any differences between the VM's runtime and the EC2 node?
It's quite funny that Kilo is getting terminated. It seems that Kubernetes is shutting it down for some reason. Did you by chance add liveness checks to the daemonset?
Nope. No changes from what was fetched.
The EC2 instance is running Linux 5.10.109-104.500.amzn2.x86_64 with containerd 1.4.13, the laptop VM (which as you've probably guessed runs Fedora), runs Linux 5.17.5-200.fc35.x86_64 and containerd 1.6.2 in VirtualBox (as spun up by Vagrant).
I really think the problem relates to the EC2 instance having an Endpoint=
in its kilo0 config for the laptop VM Peer
, so when rekeying happens it tries to connect to the laptop (which it can't, because it's behind at least two NATs).
I really think the problem relates to the EC2 instance having an Endpoint= in its kilo0 config for the laptop VM Peer, so when rekeying happens it tries to connect to the laptop (which it can't, because it's behind at least two NATs).
This could be part of the explanation why networking isn't working, however wouldn't explain why the Kilo Pod is getting killed and the node is unready.
Added --log-level=debug to the kilo daemonset, will post an update when it dies.
Ran chronyc
first so both machines are within a second of each other:
Crashing kilo:
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:37:56.104998312Z"}
{"DiscoveredEndpoints":{},"caller":"mesh.go:817","component":"kilo","level":"debug","msg":"Discovered WireGuard NAT Endpoints","ts":"2022-05-05T21:38:13.127679088Z"}
{"allowedIPs":[{"IP":"10.244.0.0","Mask":"////AA=="},{"IP":"ec2.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.0.0","Mask":"////AA=="}],"component":"kilo","endpoint":"ec2.public.ip:51820","hostnames":["ip-.compute.internal"],"leader":0,"level":"debug","location":"location:aws","msg":"generated segment","privateIPs":["ec2.private.ip"],"ts":"2022-05-05T21:38:13.127713095Z"}
{"allowedIPs":[{"IP":"10.244.1.0","Mask":"////AA=="},{"IP":"laptop.vm.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.1.0","Mask":"////AA=="}],"component":"kilo","endpoint":"laptop.vm.private.ip:51820","hostnames":["fedora"],"leader":0,"level":"debug","location":"location:laptop","msg":"generated segment","privateIPs":["laptop.vm.private.ip"],"ts":"2022-05-05T21:38:13.127785181Z"}
{"caller":"topology.go:239","component":"kilo","hostname":"fedora","leader":true,"level":"debug","location":"location:laptop","msg":"generated topology","privateIP":"laptop.vm.private.ip/24","subnet":"10.244.1.0/24","ts":"2022-05-05T21:38:13.127804122Z","wireGuardIP":"10.4.0.2/16"}
{"caller":"mesh.go:368","component":"kilo","level":"debug","msg":"successfully checked in local node in backend","ts":"2022-05-05T21:38:13.441011818Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:38:13.451682025Z"}
{"caller":"mesh.go:282","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{},"Key":[17....6],"NoInternalIP":false,"InternalIP":{"IP":"laptop.private.ip","Mask":"////AA=="},"LastSeen":1651786693,"Leader":false,"Location":"laptop","Name":"fedora","PersistentKeepalive":10000000000,"Subnet":{"IP":"10.244.1.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.2","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T21:38:13.45170901Z"}
{"caller":"main.go:299","msg":"caught interrupt; gracefully cleaning up; see you next time!","ts":"2022-05-05T21:38:16.194754485Z"}
Non-crashing kilo:
{"caller":"mesh.go:368","component":"kilo","level":"debug","msg":"successfully checked in local node in backend","ts":"2022-05-05T21:37:56.092114638Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:37:56.094154858Z"}
{"caller":"mesh.go:282","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{},"Key":[8,...8],"NoInternalIP":false,"InternalIP":{"IP":"ec2.private.ip","Mask":"////AA=="},"LastSeen":1651786676,"Leader":true,"Location":"aws","Name":"ip-.compute.internal","PersistentKeepalive":0,"Subnet":{"IP":"10.244.0.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.1","Mask":"//8AAA=="},"DiscoveredEndpoints":{"s59IE....nlmbK0I=":{"IP":"216.180.70.97","Port":52382,"Zone":""}},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T21:37:56.09426365Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:38:13.42907842Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:38:16.250928634Z"}
{"caller":"mesh.go:291","component":"kilo","event":"update","in-mesh":true,"level":"debug","msg":"received non ready node","node":{"Endpoint":null,"Key":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"NoInternalIP":false,"InternalIP":null,"LastSeen":0,"Leader":false,"Location":"laptop","Name":"fedora","PersistentKeepalive":10000000000,"Subnet":{"IP":"10.244.1.0","Mask":"////AA=="},"WireGuardIP":null,"DiscoveredEndpoints":null,"AllowedLocationIPs":null,"Granularity":""},"ts":"2022-05-05T21:38:16.250977227Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":null,"Key":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"NoInternalIP":false,"InternalIP":null,"LastSeen":0,"Leader":false,"Location":"laptop","Name":"fedora","PersistentKeepalive":10000000000,"Subnet":{"IP":"10.244.1.0","Mask":"////AA=="},"WireGuardIP":null,"DiscoveredEndpoints":null,"AllowedLocationIPs":null,"Granularity":""},"ts":"2022-05-05T21:38:16.251872892Z"}
{"DiscoveredEndpoints":{},"caller":"mesh.go:817","component":"kilo","level":"debug","msg":"Discovered WireGuard NAT Endpoints","ts":"2022-05-05T21:38:16.25321525Z"}
{"allowedIPs":[{"IP":"10.244.0.0","Mask":"////AA=="},{"IP":"ec2.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.0.0","Mask":"////AA=="}],"component":"kilo","endpoint":"ec2.public.ip:51820","hostnames":["ip-.compute.internal"],"leader":0,"level":"debug","location":"location:aws","msg":"generated segment","privateIPs":["ec2.private.ip"],"ts":"2022-05-05T21:38:16.253256396Z"}
{"caller":"topology.go:239","component":"kilo","hostname":"ip-.compute.internal","leader":true,"level":"debug","location":"location:aws","msg":"generated topology","privateIP":"ec2.private.ip/24","subnet":"10.244.0.0/24","ts":"2022-05-05T21:38:16.253392438Z","wireGuardIP":"10.4.0.1/16"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=1, new=0","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-05T21:38:16.290128511Z"}
{"caller":"mesh.go:545","component":"kilo","config":{"PrivateKey":[11...0],"ListenPort":51820,"FirewallMark":null,"ReplacePeers":true,"Peers":[]},"level":"debug","msg":"changing wg config","ts":"2022-05-05T21:38:16.290180001Z"}
{"DiscoveredEndpoints":{},"caller":"mesh.go:817","component":"kilo","level":"debug","msg":"Discovered WireGuard NAT Endpoints","ts":"2022-05-05T21:38:26.069421992Z"}
{"allowedIPs":[{"IP":"10.244.0.0","Mask":"////AA=="},{"IP":"ec2.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.0.0","Mask":"////AA=="}],"component":"kilo","endpoint":"ec2.public.ip:51820","hostnames":["ip-.compute.internal"],"leader":0,"level":"debug","location":"location:aws","msg":"generated segment","privateIPs":["ec2.private.ip"],"ts":"2022-05-05T21:38:26.069532304Z"}
{"caller":"topology.go:239","component":"kilo","hostname":"ip-.compute.internal","leader":true,"level":"debug","location":"location:aws","msg":"generated topology","privateIP":"ec2.private.ip/24","subnet":"10.244.0.0/24","ts":"2022-05-05T21:38:26.070919063Z","wireGuardIP":"10.4.0.1/16"}
{"caller":"mesh.go:368","component":"kilo","level":"debug","msg":"successfully checked in local node in backend","ts":"2022-05-05T21:38:26.104728892Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:38:26.104780613Z"}
{"caller":"mesh.go:282","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{},"Key":[8....8],"NoInternalIP":false,"InternalIP":{"IP":"ec2.private.ip","Mask":"////AA=="},"LastSeen":1651786706,"Leader":true,"Location":"aws","Name":"ip-.compute.internal","PersistentKeepalive":0,"Subnet":{"IP":"10.244.0.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.1","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T21:38:26.104888963Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:38:32.565002006Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[17....6],"NoInternalIP":false,"InternalIP":{"IP":"laptop.vm.private.ip","Mask":"////AA=="},"LastSeen":1651786712,"Leader":false,"Location":"laptop","Name":"fedora","PersistentKeepalive":10000000000,"Subnet":{"IP":"10.244.1.0","Mask":"////AA=="},"WireGuardIP":null,"DiscoveredEndpoints":null,"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T21:38:32.565059163Z"}
{"DiscoveredEndpoints":{},"caller":"mesh.go:817","component":"kilo","level":"debug","msg":"Discovered WireGuard NAT Endpoints","ts":"2022-05-05T21:38:32.56638394Z"}
{"allowedIPs":[{"IP":"10.244.0.0","Mask":"////AA=="},{"IP":"ec2.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.0.0","Mask":"////AA=="}],"component":"kilo","endpoint":"ec2.public.ip:51820","hostnames":["ip-.compute.internal"],"leader":0,"level":"debug","location":"location:aws","msg":"generated segment","privateIPs":["ec2.private.ip"],"ts":"2022-05-05T21:38:32.566422323Z"}
{"allowedIPs":[{"IP":"10.244.1.0","Mask":"////AA=="},{"IP":"laptop.vm.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.1.0","Mask":"////AA=="}],"component":"kilo","endpoint":"laptop.vm.private.ip:51820","hostnames":["fedora"],"leader":0,"level":"debug","location":"location:laptop","msg":"generated segment","privateIPs":["laptop.vm.private.ip"],"ts":"2022-05-05T21:38:32.566456917Z"}
{"caller":"topology.go:239","component":"kilo","hostname":"ip-.compute.internal","leader":true,"level":"debug","location":"location:aws","msg":"generated topology","privateIP":"ec2.private.ip/24","subnet":"10.244.0.0/24","ts":"2022-05-05T21:38:32.567166717Z","wireGuardIP":"10.4.0.1/16"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=0, new=1","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-05T21:38:32.613991981Z"}
{"caller":"mesh.go:545","component":"kilo","config":{"PrivateKey":[1...0],"ListenPort":51820,"FirewallMark":null,"ReplacePeers":true,"Peers":[{"PublicKey":[17....6],"Remove":false,"UpdateOnly":false,"PresharedKey":null,"Endpoint":{"IP":"laptop.vm.private.ip","Port":51820,"Zone":""},"PersistentKeepaliveInterval":0,"ReplaceAllowedIPs":true,"AllowedIPs":[{"IP":"10.244.1.0","Mask":"////AA=="},{"IP":"laptop.vm.private.ip","Mask":"/////w=="},{"IP":"10.4.0.2","Mask":"/////w=="}]}]},"level":"debug","msg":"changing wg config","ts":"2022-05-05T21:38:32.614048003Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:38:32.635629998Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[179....6],"NoInternalIP":false,"InternalIP":{"IP":"laptop.vm.private.ip","Mask":"////AA=="},"LastSeen":1651786712,"Leader":false,"Location":"laptop","Name":"fedora","PersistentKeepalive":10000000000,"Subnet":{"IP":"10.244.1.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.2","Mask":"//8AAA=="},"DiscoveredEndpoints":null,"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T21:38:32.635677001Z"}
{"caller":"mesh.go:803","component":"kilo","endpoint":"laptop.vm.private.ip:51820","former-endpoint":"laptop.vm.private.ip:51820","latest-handshake":"0001-01-01T00:00:00Z","level":"debug","msg":"WireGuard Update NAT Endpoint","node":"fedora","same":true,"ts":"2022-05-05T21:38:32.636412618Z"}
{"DiscoveredEndpoints":{},"caller":"mesh.go:817","component":"kilo","level":"debug","msg":"Discovered WireGuard NAT Endpoints","ts":"2022-05-05T21:38:32.636451405Z"}
{"allowedIPs":[{"IP":"10.244.0.0","Mask":"////AA=="},{"IP":"ec2.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.0.0","Mask":"////AA=="}],"component":"kilo","endpoint":"ec2.public.ip:51820","hostnames":["ip-.compute.internal"],"leader":0,"level":"debug","location":"location:aws","msg":"generated segment","privateIPs":["ec2.private.ip"],"ts":"2022-05-05T21:38:32.636707965Z"}
{"allowedIPs":[{"IP":"10.244.1.0","Mask":"////AA=="},{"IP":"laptop.vm.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.1.0","Mask":"////AA=="}],"component":"kilo","endpoint":"laptop.vm.private.ip:51820","hostnames":["fedora"],"leader":0,"level":"debug","location":"location:laptop","msg":"generated segment","privateIPs":["laptop.vm.private.ip"],"ts":"2022-05-05T21:38:32.636748256Z"}
{"caller":"topology.go:239","component":"kilo","hostname":"ip-.compute.internal","leader":true,"level":"debug","location":"location:aws","msg":"generated topology","privateIP":"ec2-private-ip/24","subnet":"10.244.0.0/24","ts":"2022-05-05T21:38:32.63677386Z","wireGuardIP":"10.4.0.1/16"}
{"caller":"mesh.go:803","component":"kilo","endpoint":"laptop.PUBLIC.IP.ADDRESS:52382","former-endpoint":"laptop.private.ip:51820","latest-handshake":"2022-05-05T21:38:37.002033221Z","level":"debug","msg":"WireGuard Update NAT Endpoint","node":"fedora","same":false,"ts":"2022-05-05T21:38:56.084826409Z"}
{"DiscoveredEndpoints":{"s59IEyyKc0YEhhi6Q5h1Rey8YZ8RNnZSbOZZnlmbK0I=":{"IP":"216.180.70.97","Port":52382,"Zone":""}},"caller":"mesh.go:817","component":"kilo","level":"debug","msg":"Discovered WireGuard NAT Endpoints","ts":"2022-05-05T21:38:56.084952079Z"}
{"allowedIPs":[{"IP":"10.244.1.0","Mask":"////AA=="},{"IP":"laptop.vm.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.1.0","Mask":"////AA=="}],"component":"kilo","endpoint":"laptop.vm.private.ip:51820","hostnames":["fedora"],"leader":0,"level":"debug","location":"location:laptop","msg":"generated segment","privateIPs":["laptop.vm.private.ip"],"ts":"2022-05-05T21:38:56.085033834Z"}
{"allowedIPs":[{"IP":"10.244.0.0","Mask":"////AA=="},{"IP":"ec2.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.0.0","Mask":"////AA=="}],"component":"kilo","endpoint":"ec2.public.ip:51820","hostnames":["ip-.compute.internal"],"leader":0,"level":"debug","location":"location:aws","msg":"generated segment","privateIPs":["ec2.private.ip"],"ts":"2022-05-05T21:38:56.08650359Z"}
{"caller":"topology.go:239","component":"kilo","hostname":"ip-.compute.internal","leader":true,"level":"debug","location":"location:aws","msg":"generated topology","privateIP":"ec2.private.ip/24","subnet":"10.244.0.0/24","ts":"2022-05-05T21:38:56.087024217Z","wireGuardIP":"10.4.0.1/16"}
{"caller":"mesh.go:368","component":"kilo","level":"debug","msg":"successfully checked in local node in backend","ts":"2022-05-05T21:38:56.124735182Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:38:56.126058867Z"}
{"caller":"mesh.go:282","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{},"Key":[8...8],"NoInternalIP":false,"InternalIP":{"IP":"ec2.private.ip","Mask":"////AA=="},"LastSeen":1651786736,"Leader":true,"Location":"aws","Name":"ip-.compute.internal","PersistentKeepalive":0,"Subnet":{"IP":"10.244.0.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.1","Mask":"//8AAA=="},"DiscoveredEndpoints":{"s59IEyyKc0YEhhi6Q5h1Rey8YZ8RNnZSbOZZnlmbK0I=":{"IP":"216.180.70.97","Port":52382,"Zone":""}},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T21:38:56.126092994Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:39:02.592118461Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:39:25.448517125Z"}
{"caller":"mesh.go:282","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{},"Key":[8...8],"NoInternalIP":false,"InternalIP":{"IP":"ec2.private.ip","Mask":"////AA=="},"LastSeen":1651786736,"Leader":true,"Location":"aws","Name":"ip-.compute.internal","PersistentKeepalive":0,"Subnet":{"IP":"10.244.0.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.1","Mask":"//8AAA=="},"DiscoveredEndpoints":{"s59IEyyKc0YEhhi6Q5h1Rey8YZ8RNnZSbOZZnlmbK0I=":{"IP":"216.180.70.97","Port":52382,"Zone":""}},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-05-05T21:39:25.448677271Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:39:25.448904043Z"}
{"caller":"mesh.go:803","component":"kilo","endpoint":"216.180.70.97:52382","former-endpoint":"laptop.vm.private.ip:51820","latest-handshake":"2022-05-05T21:38:37.002033221Z","level":"debug","msg":"WireGuard Update NAT Endpoint","node":"fedora","same":false,"ts":"2022-05-05T21:39:26.099158903Z"}
{"DiscoveredEndpoints":{"s59IEyyKc0YEhhi6Q5h1Rey8YZ8RNnZSbOZZnlmbK0I=":{"IP":"216.180.70.97","Port":52382,"Zone":""}},"caller":"mesh.go:817","component":"kilo","level":"debug","msg":"Discovered WireGuard NAT Endpoints","ts":"2022-05-05T21:39:26.099273113Z"}
{"allowedIPs":[{"IP":"10.244.0.0","Mask":"////AA=="},{"IP":"ec2.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.0.0","Mask":"////AA=="}],"component":"kilo","endpoint":"ec2.public.ip:51820","hostnames":["ip-.compute.internal"],"leader":0,"level":"debug","location":"location:aws","msg":"generated segment","privateIPs":["ec2.private.ip"],"ts":"2022-05-05T21:39:26.100461957Z"}
{"allowedIPs":[{"IP":"10.244.1.0","Mask":"////AA=="},{"IP":"laptop.vm.private.ip","Mask":"/////w=="}],"allowedLocationIPs":null,"caller":"topology.go:191","cidrs":[{"IP":"10.244.1.0","Mask":"////AA=="}],"component":"kilo","endpoint":"laptop.vm.private.ip:51820","hostnames":["fedora"],"leader":0,"level":"debug","location":"location:laptop","msg":"generated segment","privateIPs":["laptop.vm.private.ip"],"ts":"2022-05-05T21:39:26.101160484Z"}
{"caller":"topology.go:239","component":"kilo","hostname":"ip-.compute.internal","leader":true,"level":"debug","location":"location:aws","msg":"generated topology","privateIP":"ec2.private.ip/24","subnet":"10.244.0.0/24","ts":"2022-05-05T21:39:26.102044874Z","wireGuardIP":"10.4.0.1/16"}
{"caller":"mesh.go:368","component":"kilo","level":"debug","msg":"successfully checked in local node in backend","ts":"2022-05-05T21:39:26.137148488Z"}
{"caller":"mesh.go:280","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2022-05-05T21:39:26.138095411Z"}
@squat is there anything i can do to help?
This is a duplicate of #189. I think we should close this issue.
Setup
kubeadm init ...
kubeadm join ...
(not as control plane)manifests/kilo-kubeadm.yaml
from this repoecho "module wireguard +p" >/sys/kernel/debug/dynamic_debug/control
KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Situation
Relevant log on the laptop VM:
CrashLoopBackOff
.Description
The wg configuration for the EC2 instance has
Endpoint=my.external.laptop.ip:52412
listed in the Peer configuration stanza. I think when key rotation happens, the EC2 instance tries to connect back to the laptop VM (which it can't), and the handshake errors are from the mismatch of the laptop VM trying to connect to the EC2 instance at the same time.I think if the Endpoint= wasn't listed in the Peer stanza, the laptop VM would just reconnect after rekeying and life would continue on. I'm hoping that would also fix why the laptop VM is still in
NotReady
too.