Open sbaildon opened 2 years ago
i think it is because your laptop's endpoint is discovered since #146 and now Kilo wants to reapply the spec of your Laptop's peer that has a nil endpoint because the actual endpoint has been added and spec and reality have diverged. Let me check why I haven't noticed this with my laptop. Maybe this is wrong.
What is the Peer spec of your laptop. Did you set persitent-keep-alive to 0? Because the endpoint is not updated if it is 0: https://github.com/squat/kilo/blob/05e8ded744207571389e208353209016c449ba79/pkg/mesh/topology.go#L275
What is the Peer spec of your laptop. Did you set persitent-keep-alive to 0? Because the endpoint is not updated if it is 0:
Brilliant, that's exactly what's happening. I've added a persistentKeepalive
and the network stays stable.
Defining a peer with a persistent keep alive of 0
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: laptop
spec:
allowedIPs:
- 10.5.0.1/32
publicKey: SzhsHapvJy61urzHTAvx3Iu7ANlO+PGbsPy/mKY8U10=
persistentKeepalive: 0
Still sees kilo attempt to reconcile the mesh network; line 3, 30~ seconds after apply:
{"caller":"mesh.go:344","component":"kilo","event":"add","level":"info","peer":{"PublicKey":[75,56,108,29,170,111,39,46,181,186,188,199,76,11,241,220,139,187,0,217,78,248,241,155,176,252,191,152,166,60,83,93],"Remove":false,"UpdateOnly":false,"PresharedKey":null,"PersistentKeepaliveInterval":0,"ReplaceAllowedIPs":false,"AllowedIPs":[{"IP":"10.5.0.1","Mask":"/////w=="}],"Endpoint":null,"Name":"laptop"},"ts":"2022-05-25T00:50:29.118108442Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=1, new=2","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-25T00:50:29.16908714Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"peer endpoints: nil value","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-25T00:50:59.040795773Z"}
Is the intention of this code-path to prevent mesh reconciliation if pka == nil || pka == 0
? Or am I misunderstanding?
FWIW, I'm not bothered about keeping otherwise silent connections alive through NAT
Some mysterious behaviour I don't quite understand; I have a peer configuration called phone
that is intended for my well, uh, phone, which didn't cause mesh reconciliation—I'm tailing kilo's logs. My phone is connected to the same WiFi network, there's no cellular involved here.
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: laptop
spec:
allowedIPs:
- 10.5.0.1/32
publicKey: SzhsHapvJy61urzHTAvx3Iu7ANlO+PGbsPy/mKY8U10=
persistentKeepalive: 0
---
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: phone
spec:
allowedIPs:
- 10.5.0.2/32
publicKey: urgVgSoHEwG5/7q0k5NpjWSBpAyxPfhvdT/v0zd561o=
persistentKeepalive: 0
Taking a stab in the dark that something is up with the laptop
peer, I created a third peer, dummy
, and connected from my laptop. No good; there's mesh reconciliation there too.
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: dummy
spec:
allowedIPs:
- 10.5.0.3/32
publicKey: AzckRiPfM30PNbyX/kxCv59YlIfaoj/hVU7LPkxuuAw=
persistentKeepalive: 0
Okay, so now thinking something is up with the clients, I migrate the laptop
peer config to my phone and connect from there. No good; reconciliation again. I try dummy
from my phone. Also reconciliation.
So now the reverse—export the phone
peer and import it on my laptop. Strange—there's no reconciliation at all. For whatever reason the phone
peer doesn't cause any undesired behaviour.
I moved the private key from dummy
to phone
, kept the rest the same; mesh reconciliation.
Reset phone
back to the original keypair—no reconciliation.
🤯
I have an issue where when I connect an outside peer (eg. my laptop) to the cluster,
kilo
sees that configurations aren't the same and recreates the mesh to reconcile the differences. However, the config is never as expected andkilo
will constantly attempt to reconcile, killing the network every ~30 secondsI'm going to keep debugging, but I created this issue just in case you know what's up before I spend time here.
I added some prints to see what was going on:
level.Info(logger).Log("reason", "peer endpoints", "c", c, "b", b)
Turns out my laptop peer,
10.5.0.1
, has a configured endpoint inoldConf
,b
, but isnull
in the new conf,c
, and that's what's causingkilo
to reconcile the differences