squat / kilo

Kilo is a multi-cloud network overlay built on WireGuard and designed for Kubernetes (k8s + wg = kg)
https://kilo.squat.ai
Apache License 2.0
2.01k stars 120 forks source link

Repeated attempts to reconcile mesh network #253

Open sbaildon opened 2 years ago

sbaildon commented 2 years ago

I have an issue where when I connect an outside peer (eg. my laptop) to the cluster, kilo sees that configurations aren't the same and recreates the mesh to reconcile the differences. However, the config is never as expected and kilo will constantly attempt to reconcile, killing the network every ~30 seconds

I'm going to keep debugging, but I created this issue just in case you know what's up before I spend time here.

I added some prints to see what was going on:

level.Info(logger).Log("reason", "peer endpoints", "c", c, "b", b)

B C
``` { "b": { "Interface": { "ListenPort": 51820, "PrivateKey": "redacted=" }, "Peers": [ { "AllowedIPs": [ { "IP": "10.0.0.2", "Mask": "/////w==" }, { "IP": "10.4.0.2", "Mask": "/////w==" }, { "IP": "10.42.0.0", "Mask": "////AA==" } ], "Endpoint": { "DNS": "", "IP": "10.0.0.2", "Port": 51820 }, "PersistentKeepalive": 0, "PresharedKey": null, "PublicKey": "MDN2K2trTzZmVGNTSW42MktibGs2d3BkMW5pdnEyOElXVU0wU3hhQ3AxMD0=", "LatestHandshake": "2021-11-14T13:15:34Z" }, { "AllowedIPs": [ { "IP": "10.5.0.2", "Mask": "/////w==" } ], "Endpoint": null, "PersistentKeepalive": 0, "PresharedKey": null, "PublicKey": "WFZjZDhEQjloZFAxUENTeXh1QVBha3BCOVpqRCt1TWdCUld2Q3lJbDAxZz0=", "LatestHandshake": "0001-01-01T00:00:00Z" }, { "AllowedIPs": [ { "IP": "10.5.0.1", "Mask": "/////w==" } ], "Endpoint": { "DNS": "", "IP": "91.130.160.180", "Port": 57943 }, "PersistentKeepalive": 0, "PresharedKey": null, "PublicKey": "YjJxN1ZaeEpiZnl3Nlh6ZFRQR1JkSGJqVHRIblpwVlZwY1FhNHpyTmtWRT0=", "LatestHandshake": "2021-11-14T13:17:02Z" } ] }, "caller": "conf.go:355", "level": "info", "reason": "peer endpoints", "ts": "2021-11-14T13:17:30.217962107Z" } ``` ``` { "c": { "Interface": { "ListenPort": 51820, "PrivateKey": "redacted=" }, "Peers": [ { "AllowedIPs": [ { "IP": "10.0.0.2", "Mask": "/////w==" }, { "IP": "10.4.0.2", "Mask": "/////w==" }, { "IP": "10.42.0.0", "Mask": "////AA==" } ], "Endpoint": { "DNS": "", "IP": "10.0.0.2", "Port": 51820 }, "PersistentKeepalive": 0, "PresharedKey": null, "PublicKey": "MDN2K2trTzZmVGNTSW42MktibGs2d3BkMW5pdnEyOElXVU0wU3hhQ3AxMD0=", "LatestHandshake": "0001-01-01T00:00:00Z" }, { "AllowedIPs": [ { "IP": "10.5.0.2", "Mask": "/////w==" } ], "Endpoint": null, "PersistentKeepalive": 0, "PresharedKey": null, "PublicKey": "WFZjZDhEQjloZFAxUENTeXh1QVBha3BCOVpqRCt1TWdCUld2Q3lJbDAxZz0=", "LatestHandshake": "0001-01-01T00:00:00Z" }, { "AllowedIPs": [ { "IP": "10.5.0.1", "Mask": "/////w==" } ], "Endpoint": null, "PersistentKeepalive": 0, "PresharedKey": null, "PublicKey": "YjJxN1ZaeEpiZnl3Nlh6ZFRQR1JkSGJqVHRIblpwVlZwY1FhNHpyTmtWRT0=", "LatestHandshake": "0001-01-01T00:00:00Z" } ] }, "caller": "conf.go:355", "level": "info", "reason": "peer endpoints", "ts": "2021-11-14T13:17:30.217962107Z" } ```

Turns out my laptop peer, 10.5.0.1, has a configured endpoint in oldConf, b, but is null in the new conf, c, and that's what's causing kilo to reconcile the differences

leonnicolas commented 2 years ago

i think it is because your laptop's endpoint is discovered since #146 and now Kilo wants to reapply the spec of your Laptop's peer that has a nil endpoint because the actual endpoint has been added and spec and reality have diverged. Let me check why I haven't noticed this with my laptop. Maybe this is wrong.

leonnicolas commented 2 years ago

What is the Peer spec of your laptop. Did you set persitent-keep-alive to 0? Because the endpoint is not updated if it is 0: https://github.com/squat/kilo/blob/05e8ded744207571389e208353209016c449ba79/pkg/mesh/topology.go#L275

sbaildon commented 2 years ago

What is the Peer spec of your laptop. Did you set persitent-keep-alive to 0? Because the endpoint is not updated if it is 0:

https://github.com/squat/kilo/blob/05e8ded744207571389e208353209016c449ba79/pkg/mesh/topology.go#L275

Brilliant, that's exactly what's happening. I've added a persistentKeepalive and the network stays stable.

sbaildon commented 2 years ago

Defining a peer with a persistent keep alive of 0

apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
  name: laptop
spec:
  allowedIPs:
  - 10.5.0.1/32
  publicKey: SzhsHapvJy61urzHTAvx3Iu7ANlO+PGbsPy/mKY8U10=
  persistentKeepalive: 0

Still sees kilo attempt to reconcile the mesh network; line 3, 30~ seconds after apply:

{"caller":"mesh.go:344","component":"kilo","event":"add","level":"info","peer":{"PublicKey":[75,56,108,29,170,111,39,46,181,186,188,199,76,11,241,220,139,187,0,217,78,248,241,155,176,252,191,152,166,60,83,93],"Remove":false,"UpdateOnly":false,"PresharedKey":null,"PersistentKeepaliveInterval":0,"ReplaceAllowedIPs":false,"AllowedIPs":[{"IP":"10.5.0.1","Mask":"/////w=="}],"Endpoint":null,"Name":"laptop"},"ts":"2022-05-25T00:50:29.118108442Z"}

{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=1, new=2","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-25T00:50:29.16908714Z"}

{"caller":"mesh.go:544","component":"kilo","diff":"peer endpoints: nil value","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-25T00:50:59.040795773Z"}

Is the intention of this code-path to prevent mesh reconciliation if pka == nil || pka == 0? Or am I misunderstanding?

https://github.com/squat/kilo/blob/4be792ea543a9c2656574ec060b335c587244a3d/pkg/mesh/topology.go#L291

FWIW, I'm not bothered about keeping otherwise silent connections alive through NAT

sbaildon commented 2 years ago

Some mysterious behaviour I don't quite understand; I have a peer configuration called phone that is intended for my well, uh, phone, which didn't cause mesh reconciliation—I'm tailing kilo's logs. My phone is connected to the same WiFi network, there's no cellular involved here.

apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
  name: laptop
spec:
  allowedIPs:
  - 10.5.0.1/32
  publicKey: SzhsHapvJy61urzHTAvx3Iu7ANlO+PGbsPy/mKY8U10=
  persistentKeepalive: 0
---
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
  name: phone
spec:
  allowedIPs:
  - 10.5.0.2/32
  publicKey: urgVgSoHEwG5/7q0k5NpjWSBpAyxPfhvdT/v0zd561o=
  persistentKeepalive: 0

Taking a stab in the dark that something is up with the laptop peer, I created a third peer, dummy, and connected from my laptop. No good; there's mesh reconciliation there too.

apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
  name: dummy
spec:
  allowedIPs:
  - 10.5.0.3/32
  publicKey: AzckRiPfM30PNbyX/kxCv59YlIfaoj/hVU7LPkxuuAw=
  persistentKeepalive: 0

Okay, so now thinking something is up with the clients, I migrate the laptop peer config to my phone and connect from there. No good; reconciliation again. I try dummy from my phone. Also reconciliation.

So now the reverse—export the phone peer and import it on my laptop. Strange—there's no reconciliation at all. For whatever reason the phone peer doesn't cause any undesired behaviour.

sbaildon commented 2 years ago

I moved the private key from dummy to phone, kept the rest the same; mesh reconciliation.

Reset phone back to the original keypair—no reconciliation.

🤯