k4yt3x / wg-meshconf

WireGuard full mesh configuration generator.
GNU General Public License v3.0
938 stars 105 forks source link

Bug: `PersistentKeepalive` is added to the wrong side. #30

Closed rudolfbyker closed 1 year ago

rudolfbyker commented 1 year ago

Here is a sample database.csv file:

"Name","Address","Endpoint","AllowedIPs","ListenPort","PersistentKeepalive","FwMark","PrivateKey","DNS","MTU","Table","PreUp","PostUp","PreDown","PostDown","SaveConfig"
"alpha","10.0.0.1","alpha.example.com","","51820","","","aCqUml43tjjw3SBVM+M9IyrApZ5pthzrFBMxWKnLPE0=","","","","","","","",""
"beta","10.0.0.2","beta.example.com","","51820","","","cJE+l3HYxY5eHzxUfvNP7i5nbR7TjaSVYwqxvjW4Hl4=","","","","","","","",""
"gamma","10.0.0.3","","","51820","25","","4B6WZu3OyCXIOWyuN7dFKh/FGsVIFzkSuGsszdMsLlg=","","","","","","","",""

This generates the following config:

alpha:

[Interface]
# Name: alpha
Address = 10.0.0.1
PrivateKey = aCqUml43tjjw3SBVM+M9IyrApZ5pthzrFBMxWKnLPE0=
ListenPort = 51820

[Peer]
# Name: beta
PublicKey = PBFFzDSLOn0CyxF2d0SWo8F0xktozEQFflmIYgYDg14=
Endpoint = beta.example.com:51820
AllowedIPs = 10.0.0.2

[Peer]
# Name: gamma
PublicKey = HV5qryi3YcrhKQd/4A0h6xxWr+ARlncT06K+BY9XsCU=
AllowedIPs = 10.0.0.3
PersistentKeepalive = 25

beta:

[Interface]
# Name: beta
Address = 10.0.0.2
PrivateKey = cJE+l3HYxY5eHzxUfvNP7i5nbR7TjaSVYwqxvjW4Hl4=
ListenPort = 51820

[Peer]
# Name: alpha
PublicKey = KE5NYiNewB3VwoIHGPXBCDxGphf6m3gGUyLyhy7Vd2A=
Endpoint = alpha.example.com:51820
AllowedIPs = 10.0.0.1

[Peer]
# Name: gamma
PublicKey = HV5qryi3YcrhKQd/4A0h6xxWr+ARlncT06K+BY9XsCU=
AllowedIPs = 10.0.0.3
PersistentKeepalive = 25

gamma:

[Interface]
# Name: gamma
Address = 10.0.0.3
PrivateKey = 4B6WZu3OyCXIOWyuN7dFKh/FGsVIFzkSuGsszdMsLlg=
ListenPort = 51820

[Peer]
# Name: alpha
PublicKey = KE5NYiNewB3VwoIHGPXBCDxGphf6m3gGUyLyhy7Vd2A=
Endpoint = alpha.example.com:51820
AllowedIPs = 10.0.0.1

[Peer]
# Name: beta
PublicKey = PBFFzDSLOn0CyxF2d0SWo8F0xktozEQFflmIYgYDg14=
Endpoint = beta.example.com:51820
AllowedIPs = 10.0.0.2

I have read all I could find about PersistentKeepalive (which is not a lot), and it sounds to me like it should only be specified in the config of the node which is behind NAT, which is gamma in this case. But in the above example, it's specified everywhere except in gamma's config.

Is the bug in wg-meshconf, or in my understanding of PersistentKeepalive?

My use case is to have a VPN of VMs, some of which are behind NAT, without public IPs, but most of which have public IPs. The servers without the public IPs should route traffic through one of the servers which DO have public IPs to reach other servers with don't have public IPs.

And yes, I'm trying to replace tinc :)

k4yt3x commented 1 year ago

Long story short, follow this:

PersistentKeepalive should be specified on the node BEHIND the NAT.

You'll need this at all because the nodes that have a public IP can always be reached without an existing NAT connection, where as nodes behind NAT cannot be reached actively without maintaining a NAT connection on the firewall or whatever that's forwarding the connection. Therefore, you use PersistentKeepalive to keep that connection alive so the outside will always be able to launch an active connection to the node behind NAT. It won't work the other way around because the node outside cannot reach the node behind NAT, so the initial connection won't get established.

k4yt3x commented 1 year ago

Is the bug in wg-meshconf, or in my understanding of PersistentKeepalive?

With that said, your understanding seems to be correct. I'll take a look at the code, it might be a bug.

rudolfbyker commented 1 year ago

Thanks for the feedback :)

I have done some tests … It seems like PersistenKeepalive sort-of works when specified in the "wrong" (i.e. the public) VM's config file, but only after you manually make the first connection from the "correct" (i.e. private behind NAT) VM. A simple ping -c 1 ... works. After that, the public one keeps it going by pinging back every 25 seconds. But when it breaks down for whatever reason, I need to ping again manually from the private VM to get it working again.

So it would be nice if we could turn this around in the config generator.

rudolfbyker commented 1 year ago

Looking into the code, it seems that the problem here is that PersistentKeepAlive is listed as a "peer attribute", which in the genconfig method is treated as something that should be set in the [Peer] section of OTHER nodes. But this specific attribute should be set on that specific peer, since it should call out to other peers to let them know where it is, since it does not have a public endpoint.

So I think we need to distinguish between two types of [Peer] attributes:

PersistentKeepAlive is implemented as the former, but it should be the latter.

rudolfbyker commented 1 year ago

@k4yt3x my PR #33 fixes this bug. Please consider merging.

k4yt3x commented 1 year ago

Hey @rudolfbyker thanks for the PR. Just give me a bit of time to verify that everything works fine and the change makes sense.