gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.4k stars 547 forks source link

[Bug]: Routing flaws for the netmaker server in the presence multiple networks and egress servers #1433

Open pquan opened 2 years ago

pquan commented 2 years ago

Contact Details

No response

What happened?

When the netmaker server is part of one or more networks with universal egress servers (i.e. private vpn case), it fails to connect to peers / clients and the internet in general.

I'll explain better my setup so it's easier to understand the above statrement.

I have multiple 2 node networks, where the second node is a universal egress node (i.e. routes all the internet out). The clients that connect to this egress node have all their connectivity redirected trough the node.

Unfortunately, the egress node "pulls" the connectivity for the netmaker server too. It creates routes for the netmaker server trough the egress node.

When having two or mode such networks, the netmaker node tries to establish multiple, redundant routes trough the multiple networks. This is the first issue. I get multiple route ups like this:

route add a.b.0.0/mask via net-1
route add a.b.0.0/mask via net-2
route add a.b.0.0/mask via net-3
...

Off course the second and successive routes fail.

The second issue is that the netmaker server itself loses connectivity to the internet and to some other clients, so it's impossible for them to connect. This is because a packet arrives from one interface to netmaker and is routed back trough the egress node. Subject to firewall and martians filters most of these packets are dropped and it becomes impossible for other networks except one to work.

This is a design flaw that needs addressing by (easier) excluding the netmaker server from the egress gateways (perhaps optionally) or allowing for more sophisticated routing to be specified for each node.

I think, as a first band aid, the netmaker server shall not pull the egress routes on any network. If I may suggest, this node shall not participate in any network anyway, being the controlling node and being exposed to the internet.

On a second step I'd suggest to add a setting to each egress node for turning on it's routes for the other nodes and to add a metric to such route. I'd add another setting for adding or not the egress routes to pure clients (i.e. wireguard external clients) for each node. This will allow us to have a single network, with multiple egress nodes with a proper metric and a single network with multiple egress nodes where clients can connect by their choice (i.e. a distributed vpn with multiple ingress and egress nodes).

Version

v0.14.6

What OS are you using?

Linux, Windows

Relevant log output

netmaker       | [netmaker] 2022-07-29 07:59:57 error running command: /sbin/ip -4 route add 0.0.0.0/5 dev nm-vpn4
netmaker       | [netmaker] 2022-07-29 07:59:57 RTNETLINK answers: File exists
....
netmaker       | [netmaker] 2022-07-29 07:59:57 error running command: /sbin/ip -4 route add 0.0.0.0/5 dev nm-vpn1
netmaker       | [netmaker] 2022-07-29 07:59:57 RTNETLINK answers: File exists
...
netmaker       | [netmaker] 2022-07-29 07:59:57 error running command: /sbin/ip -4 route add 0.0.0.0/5 dev nm-vpn2
netmaker       | [netmaker] 2022-07-29 07:59:57 RTNETLINK answers: File exists
...

When this happens on other clients I get:

2022-07-29 09:58:52.332: [TUN] [nm-net32] Sending handshake initiation to peer 1 (net.mak.er.ip:51821)
2022-07-29 09:58:52.489: [TUN] [nm-net32] Handshake for peer 1 (net.mak.er.ip:51821) did not complete after 5 seconds, retrying (try 2)

To fix the issue I have to manually remove the internet routes on the netmaker server by

ip route del -net 0.0.0.0/5 
....

Contributing guidelines

mattkasun commented 2 years ago

How are you defining your vpn egress? are you using 0.0.0.0/0 or using the multiple ranges as described in docs? https://docs.netmaker.org/egress-gateway.html

pquan commented 2 years ago

Sorry for the delay. I'm defining it with multiple ranges, but this does not appear relevant. 0.0.0.0/0 or not, the nodes will have multiple outbound routes towards the internet via the egress nodes and this will break connectivity to them (and the wireguard setup of the interface too). The whole logic for egress routes works only when there's egress to specific private subnets but it breaks when we define an internet wide egress route being via 0.0.0.0/0 or just listing the public ranges.

mattkasun commented 2 years ago

PRs

1455 and

1467

fix

pquan commented 2 years ago

Not fixed. Same behaviour as in 0.14.6 also in 0.15.0 Shall I use 0.0.0.0/0 as the egress ?

afeiszli commented 2 years ago

yes, this is fixed specifically for 0.0.0.0/0. When you add egress with 0.0.0.0/0, the route will not be added to the server.