gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.4k stars 547 forks source link

limitations for the environment in nodes #149

Closed changchichung closed 3 years ago

changchichung commented 3 years ago

here is my test architecture for netmaker

let me explain something.

so now , all nodes were connect to the netmaker server and all be healthy (excepts the server it self ....)

and here is the ping test table after all nodes connected

those messy architecture (4G Router , double NAT , nodes in the same private subnet) are accompanied by their messy problems you can find something "interesting" in the table.

node 1,2,5,6 can communicated with each other

But the other nodes are different

what a messy !

so maybe there are some limitations for the environment in nodes ?? You might need more than just a node that can connect to the Internet.

afeiszli commented 3 years ago

Hi @changchichung, thank you for this detailed issue, it is very helpful.

  1. Double NAT is known to have issues and we likely won't have a solution for a few iterations.

  2. 4G LTE router I am not familiar with so that will be more challenging. Have you manually configured wireguard interfaces for this device in the past? I would be interested to know how it must be set up for a manual configuration. It is also using IPv4 by default with Netmaker, but you could edit this to be the IPv6 address to see if that helps.

  3. For private lan, have you tried setting up site-to-site? That may be more appropriate for this layout.

  4. It is very strange that 4,8,9 can only go to node 2. If they can go to node 2, then 5 and 6 should be reachable as well. I think the issue with 8 and 9 is likely related to the private lan scenario. I am guessing that 7 gets all the connectivity because it was set up first, and then some backend issue is causing 8 and 9 to not be configured. It likely is a bug. But for 4, I really don't know.

I will do a deeper dive into your issues later on and report back, but this is very helpful information, so thank you for your post and please follow up with any additional findings.

changchichung commented 3 years ago

for 4G LTE Router , I think it's kind of double NAT , I get the ip 10.96.x.x from ISP on 4G router , and my desktop get ip 192.168.5.x from router . I will try to learn how to config to user ipv6 on router.

I am guessing that 7 gets all the connectivity because it was set up first

That's also what I was thinking. and configure many nodes in the same private lan with wireguard ,I don't think it is appropriate. so let's forget about private lan part , I'm going to remove the extra nodes to see if that helps.

changchichung commented 3 years ago

so I remove node 3,7,8,9 , and rejoin node 7 as a new node. the architecture would be this

let's see the ping result of new node3(original node 7)

$ for i in {1,2,4,5,6} ; do ping 10.1.0.$i -c 2 ;done
PING 10.1.0.1 (10.1.0.1) 56(84) bytes of data.

--- 10.1.0.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1029ms

PING 10.1.0.2 (10.1.0.2) 56(84) bytes of data.

--- 10.1.0.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1005ms

PING 10.1.0.4 (10.1.0.4) 56(84) bytes of data.

--- 10.1.0.4 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1005ms

PING 10.1.0.5 (10.1.0.5) 56(84) bytes of data.

--- 10.1.0.5 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1005ms

PING 10.1.0.6 (10.1.0.6) 56(84) bytes of data.

--- 10.1.0.6 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1005ms

what a suprised ! node3 is unable to communicate with any node !

here is the routing table of node3

chchang@hqdc039:~$ ip r
default via 192.168.11.253 dev eno1 proto static metric 100 
10.1.0.0/24 dev nm-testvpn proto kernel scope link src 10.1.0.3 
10.10.0.1 dev pihole scope link 
10.66.66.0/24 dev wg1 proto kernel scope link src 10.66.66.4 
104.31.0.0/16 dev wg1 scope link 
140.112.0.0/16 dev wg1 scope link 
169.254.0.0/16 dev virbr0 scope link metric 1000 linkdown 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
172.18.0.0/16 dev br-abfd08f1cb0c proto kernel scope link src 172.18.0.1 
172.19.0.0/16 dev br-55126d973997 proto kernel scope link src 172.19.0.1 linkdown 
172.20.0.0/16 dev br-23d2976842c9 proto kernel scope link src 172.20.0.1 linkdown 
172.22.1.0/24 dev br-mailcow proto kernel scope link src 172.22.1.1 linkdown 
172.23.0.0/16 dev br-23aac6e82c78 proto kernel scope link src 172.23.0.1 linkdown 
172.24.0.0/16 dev br-708969ef8879 proto kernel scope link src 172.24.0.1 linkdown 
172.29.0.0/16 dev br-3e850f5bfca6 proto kernel scope link src 172.29.0.1 linkdown 
178.62.0.0/16 dev wg0 scope link 
180.149.0.0/16 dev wg1 scope link 
192.168.2.0/24 dev excen scope link 
192.168.10.0/24 dev wg0 proto kernel scope link src 192.168.10.3 
192.168.11.0/24 dev eno1 proto kernel scope link src 192.168.11.39 metric 100 
192.168.16.0/20 dev br-c5c16b4efdd0 proto kernel scope link src 192.168.16.1 linkdown 
192.168.32.0/20 dev br-b95a20ac2c8f proto kernel scope link src 192.168.32.1 linkdown 
192.168.64.0/20 dev br-c6016e75841c proto kernel scope link src 192.168.64.1 linkdown 
192.168.96.0/20 dev br-8dafc6996742 proto kernel scope link src 192.168.96.1 linkdown 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown 
192.168.128.0/20 dev br-c4ae16429e66 proto kernel scope link src 192.168.128.1 linkdown 
192.168.176.0/20 dev br-9858112139ef proto kernel scope link src 192.168.176.1 linkdown 
chchang@hqdc039:~$

and wg show nm-testvpn

chchang@hqdc039:~$ sudo wg show nm-testvpn
interface: nm-testvpn
  public key: c2198uu8D4H5RUW
  private key: (hidden)
  listening port: 51821

peer: 9jGNS8tkx/2gEUO+9F2LQ
  endpoint: 61.64.147.184:51821
  allowed ips: 10.1.0.1/32
  transfer: 0 B received, 13.44 KiB sent
  persistent keepalive: every 20 seconds

peer: Mfyfr9rkEZN48ya5Pk5fzkty
  endpoint: 45.77.98.9:51821
  allowed ips: 10.1.0.2/32
  transfer: 0 B received, 14.02 KiB sent
  persistent keepalive: every 20 seconds

peer: 1jCs/udk0uiEb4scZra3FdR
  endpoint: 180.217.5.45:51821
  allowed ips: 10.1.0.4/32
  transfer: 0 B received, 13.59 KiB sent
  persistent keepalive: every 20 seconds

peer: 9kU8QxgUGEqDxeIUIq0xK
  endpoint: 34.74.153.47:51821
  allowed ips: 10.1.0.5/32
  transfer: 0 B received, 13.59 KiB sent
  persistent keepalive: every 20 seconds

peer: FxUaHZymc+HRcqLAuaHf
  endpoint: 8.210.138.12:51821
  allowed ips: 10.1.0.6/32
  transfer: 0 B received, 13.88 KiB sent
  persistent keepalive: every 20 seconds
chchang@hqdc039:~$ 

and journal log

May 11 16:22:15 hqdc039 systemd[1]: Started Regularly checks for updates in peers and local config.
May 11 16:22:15 hqdc039 netclient[1117457]: Beginning node check in for network testvpn
May 11 16:22:15 hqdc039 netclient[1117457]: Checking into server: ws.cowbay.org:50051
May 11 16:22:15 hqdc039 netclient[1117457]: Checking to see if public addresses have changed
May 11 16:22:15 hqdc039 netclient[1117457]: Addresses have not changed.
May 11 16:22:15 hqdc039 netclient[1117457]: Authenticating with GRPC Server
May 11 16:22:15 hqdc039 netclient[1117457]: Authenticated
May 11 16:22:15 hqdc039 netclient[1117457]: Checking In.
May 11 16:22:16 hqdc039 netclient[1117457]: Checked in.
May 11 16:22:16 hqdc039 netclient[1117457]: Command checkin Executed Successfully
May 11 16:22:16 hqdc039 systemd[1]: netclient@testvpn.service: Succeeded.

so , any suggestions to solve that ? or any more logs I can provide ?

afeiszli commented 3 years ago

Was node3 able to get the correct public IP (of the router)? Other than that, my best guess is that the router is blocking udp traffic on port 51821, so maybe check your router firewall settings (as well as the local machine firewall).

changchichung commented 3 years ago

1.Was node3 able to get the correct public IP?

OK , I'm on the node3

chchang@hqdc039:~$ ifconfig nm-testvpn
nm-testvpn: flags=209<UP,POINTOPOINT,RUNNING,NOARP>  mtu 1420
        inet 10.1.0.3  netmask 255.255.255.0  destination 10.1.0.3
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 1000  (UNSPEC)
        RX packets 4751  bytes 246068 (246.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 106822  bytes 15352776 (15.3 MB)
        TX errors 480  dropped 5212 overruns 0  carrier 0  collisions 0

and get public ip

chchang@hqdc039:~$ curl https://ifconfig.me
219.85.234.104

so , yes, I can get the correct public IP

  1. that the router is blocking udp traffic on port 51821 this is my first guess , lets check !

router NAT config

and test udp port from node2 to node3

2021-05-12 17:03:24 [chchang@ws ~]$ nc -vz -u 219.85.234.104 51821
Connection to 219.85.234.104 51821 port [udp/*] succeeded!
2021-05-13 10:25:10 [chchang@ws ~]$

so , the firewall is not blocking udp traffic on 51821.

and guest what , ping works now ! even I did not change any settings.

chchang@hqdc039:~$ for i in {1,2,4,5,6} ; do ping 10.1.0.$i -c 2 ;done
PING 10.1.0.1 (10.1.0.1) 56(84) bytes of data.
64 bytes from 10.1.0.1: icmp_seq=1 ttl=64 time=6.71 ms
64 bytes from 10.1.0.1: icmp_seq=2 ttl=64 time=6.95 ms

--- 10.1.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 6.705/6.827/6.950/0.122 ms
PING 10.1.0.2 (10.1.0.2) 56(84) bytes of data.
64 bytes from 10.1.0.2: icmp_seq=1 ttl=64 time=199 ms
64 bytes from 10.1.0.2: icmp_seq=2 ttl=64 time=201 ms

--- 10.1.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 199.453/200.109/200.766/0.656 ms
PING 10.1.0.4 (10.1.0.4) 56(84) bytes of data.
64 bytes from 10.1.0.4: icmp_seq=1 ttl=64 time=30.0 ms
64 bytes from 10.1.0.4: icmp_seq=2 ttl=64 time=38.7 ms

--- 10.1.0.4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 30.031/34.382/38.734/4.351 ms
PING 10.1.0.5 (10.1.0.5) 56(84) bytes of data.
64 bytes from 10.1.0.5: icmp_seq=1 ttl=64 time=189 ms
64 bytes from 10.1.0.5: icmp_seq=2 ttl=64 time=190 ms

--- 10.1.0.5 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 189.207/189.589/189.971/0.382 ms
PING 10.1.0.6 (10.1.0.6) 56(84) bytes of data.
64 bytes from 10.1.0.6: icmp_seq=1 ttl=64 time=28.3 ms
64 bytes from 10.1.0.6: icmp_seq=2 ttl=64 time=25.9 ms

--- 10.1.0.6 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 25.887/27.089/28.292/1.202 ms

the most weird is that node 3 can ping node4 !!

chchang@hqdc039:~$ ping 10.1.0.4 -c 4
PING 10.1.0.4 (10.1.0.4) 56(84) bytes of data.
64 bytes from 10.1.0.4: icmp_seq=1 ttl=64 time=25.0 ms
64 bytes from 10.1.0.4: icmp_seq=2 ttl=64 time=32.8 ms
64 bytes from 10.1.0.4: icmp_seq=3 ttl=64 time=29.0 ms
64 bytes from 10.1.0.4: icmp_seq=4 ttl=64 time=30.0 ms

--- 10.1.0.4 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 24.990/29.199/32.799/2.797 ms
chchang@hqdc039:~$

and from node 4 to node 3

chchang@chchang-Aspire-M3920:~$ ping 10.1.0.3 -c4
PING 10.1.0.3 (10.1.0.3) 56(84) bytes of data.
64 bytes from 10.1.0.3: icmp_seq=1 ttl=64 time=38.1 ms
64 bytes from 10.1.0.3: icmp_seq=2 ttl=64 time=31.9 ms
64 bytes from 10.1.0.3: icmp_seq=3 ttl=64 time=31.0 ms
64 bytes from 10.1.0.3: icmp_seq=4 ttl=64 time=28.7 ms

--- 10.1.0.3 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 28.702/32.461/38.124/3.482 ms
chchang@chchang-Aspire-M3920:~$ 

but node 4 can not go to node1

chchang@chchang-Aspire-M3920:~$ ping 10.1.0.1 -c 4
PING 10.1.0.1 (10.1.0.1) 56(84) bytes of data.

--- 10.1.0.1 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3081ms

chchang@chchang-Aspire-M3920:~$ 

that's really weird.

afeiszli commented 3 years ago

That solves one problem then! How long did you wait between making a change to the environment and running your ping test? The netclient updates on a 30 second timer, and if there is any problem with the connection to the server, it could take longer. The network may not have updated before you ran the test.

Node 4 cannot reach Node 1, but can Node 1 reach Node 4?

You may need to change either the PersistentKeepAlive time or the MTU.

https://wiki.archlinux.org/title/WireGuard#Troubleshooting

changchichung commented 3 years ago

How long did you wait between making a change to the environment and running your ping test ?

not sure , but it must be longer than 30 seconds

node 1 can go anywhere , except node 4

ping result from node 1

chchang@administrator-ThinkPad-T470:~$ for i in {1..6};do ping -c 2 10.1.0.$i;done
PING 10.1.0.1 (10.1.0.1) 56(84) bytes of data.
64 bytes from 10.1.0.1: icmp_seq=1 ttl=64 time=0.041 ms
64 bytes from 10.1.0.1: icmp_seq=2 ttl=64 time=0.084 ms

--- 10.1.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1019ms
rtt min/avg/max/mdev = 0.041/0.062/0.084/0.021 ms
PING 10.1.0.2 (10.1.0.2) 56(84) bytes of data.
64 bytes from 10.1.0.2: icmp_seq=1 ttl=64 time=201 ms
64 bytes from 10.1.0.2: icmp_seq=2 ttl=64 time=202 ms

--- 10.1.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 201.018/201.553/202.088/0.535 ms
PING 10.1.0.3 (10.1.0.3) 56(84) bytes of data.
64 bytes from 10.1.0.3: icmp_seq=1 ttl=64 time=7.18 ms
64 bytes from 10.1.0.3: icmp_seq=2 ttl=64 time=7.61 ms

--- 10.1.0.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 7.176/7.393/7.610/0.217 ms
PING 10.1.0.4 (10.1.0.4) 56(84) bytes of data.

--- 10.1.0.4 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1009ms

PING 10.1.0.5 (10.1.0.5) 56(84) bytes of data.
64 bytes from 10.1.0.5: icmp_seq=1 ttl=64 time=190 ms
64 bytes from 10.1.0.5: icmp_seq=2 ttl=64 time=192 ms

--- 10.1.0.5 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 190.470/191.190/191.910/0.720 ms
PING 10.1.0.6 (10.1.0.6) 56(84) bytes of data.
64 bytes from 10.1.0.6: icmp_seq=1 ttl=64 time=29.6 ms
64 bytes from 10.1.0.6: icmp_seq=2 ttl=64 time=30.2 ms

--- 10.1.0.6 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 29.638/29.894/30.150/0.256 ms
chchang@administrator-ThinkPad-T470:~$

node 4 can go to 2,3,4

ping result from node 4

chchang@chchang-Aspire-M3920:~$ for i in {1..6} ; do ping -c 2 10.1.0.$i;done
PING 10.1.0.1 (10.1.0.1) 56(84) bytes of data.

--- 10.1.0.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1021ms

PING 10.1.0.2 (10.1.0.2) 56(84) bytes of data.
64 bytes from 10.1.0.2: icmp_seq=1 ttl=64 time=227 ms
64 bytes from 10.1.0.2: icmp_seq=2 ttl=64 time=247 ms

--- 10.1.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 227.726/237.801/247.877/10.087 ms
PING 10.1.0.3 (10.1.0.3) 56(84) bytes of data.
64 bytes from 10.1.0.3: icmp_seq=1 ttl=64 time=48.5 ms
64 bytes from 10.1.0.3: icmp_seq=2 ttl=64 time=34.5 ms

--- 10.1.0.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 34.541/41.561/48.581/7.020 ms
PING 10.1.0.4 (10.1.0.4) 56(84) bytes of data.
64 bytes from 10.1.0.4: icmp_seq=1 ttl=64 time=0.032 ms
64 bytes from 10.1.0.4: icmp_seq=2 ttl=64 time=0.042 ms

--- 10.1.0.4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1020ms
rtt min/avg/max/mdev = 0.032/0.037/0.042/0.005 ms
PING 10.1.0.5 (10.1.0.5) 56(84) bytes of data.

--- 10.1.0.5 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1021ms

PING 10.1.0.6 (10.1.0.6) 56(84) bytes of data.

--- 10.1.0.6 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1005ms

chchang@chchang-Aspire-M3920:~$ 
afeiszli commented 3 years ago

To confirm the configuration is correct, I would compare the wireguard public/private keys using "wg show" on node 4 and a different node where connections are working. I have not seen it happen but maybe the keys got misconfigured?

If configuration is correct, it may be worth trying tcpdump, for instance from node 4 to node 1:

node1: tcpdump -i nm-testnet node4: ping -c1 10.1.0.4

Also, just to confirm, check for conflicting address ranges on the other interfaces (ip a) that could also be an issue.

Other than that, you may need to go through some WireGuard troubleshooting docs: https://wiki.archlinux.org/title/WireGuard#Troubleshooting

Again, it may be an MTU or KeepAlive issue.

changchichung commented 3 years ago

keys are identical in every node.

and tcpdump on node1 shows nothing with running ping -c1 10.1.0.1 on node4. I might just ignore node4 for now , since it's a "double NAT" environment , otherwise , I can't go on to test further.

I will close this issue.

but I have to clarify , on node 4 , there are some other wiregaurd vpn tunnels , and they do not have such issue.

afeiszli commented 3 years ago

Have you set up a tunnel manually between node 1 and node 4? I would be interested to see if that works. That would be a good test. If a manually configured wireguard tunnel between any two points works, and the same connection does not work with a netmaker tunnel, that would be good to examine. You may keep this issue open if you'd like. This is very good information for learning the limitations.

afeiszli commented 3 years ago

@changchichung as a note, in the new version, 7,8,9 should be able to reach each other, as long as the WireGuard port on the server is not blocked by firewall.

wfchair commented 2 years ago

in the new version, 7,8,9 should be able to reach each other

I use the latest version v0.9.2, but nodes in the same private lan cannot ping each other.