[Bug]: Relayed nodes cannot reach peers behind second relay

kellervater commented 2 years ago

Contact Details

patrick.poetz@voo.aero

What happened?

This bug has been discussed in Discord channel Netmaker Support - #general with @afeiszli .

Here's the summary:

I've finally got a Netmaker network up and running but I'm facing some strange behavior in connectivity between peers.

My Setup is the following:

2 real servers in different data centers running Proxmox VE. Both servers are running netclient.
each server provides 3 VMs which shall be able to talk to each other to form a HA k3s cluster. The VMs run behind NAT in subnet 192.168.0.0
1 Netmaker Server Node running publicly on EC2

With a basic mesh setup only the 2 bare-metal host servers can reach every peer. Peers can only reach other peers within the same data center. (Yes, I activated UDP hole punching).

So I thought I make both host servers a relay node each for the VMs running on them. This ended up looking like the first picture.

Thing is... that now only the 2 Relay Servers were able to reach each other and that's it.

Now the fun part... If I only make one of the 2 servers a relay server like on the second picture, everything works like a charm.

And it doesn't matter if aio1 or aio2 is the relay for their VMs. It's kinda mutually exclusive. Both as relay don't work. None as relay doesn't work but a single one works. It's a bit odd.

Has someone of you faced something similar or am I missing something obvious here?

Private (netmaker addresses) of each host: aio1: 10.236.196.7

master1: 10.236.196.4
worker1: 10.236.196.5
worker2: 10.236.196.6

aio2: 10.236.196.8

master2: 10.236.196.1
worker3: 10.236.196.2
worker4: 10.236.196.3

The 2-relay setup allows connectivity between nodes on each relay but no interconnectivity between the 2 servers. Except for the relay servers themselves. aio1 can reach aio2 and vice versa.

No showstopper for me atm, as long as I don't add a 3rd datacenter node with "natted" vms, I assume.

If I can provide further input, pls let me know!

Besides this: thanks for this great product and your comprehensive documentation. This was EXACTLY what I've been looking for for weeks now.

Version

v0.11.1

What OS are you using?

Linux

Relevant log output

No response

Contributing guidelines

[X] Yes, I did.

afeiszli commented 2 years ago

Investigating. In mean time if you can test with 0.12.1 to confirm it is still an issue, would be appreciated.

kellervater commented 2 years ago

Yes, with 0.12.1 the issue is still there. Network seems to recover quite quickly after removing 2nd relay. At least my k8s cluster started working again after a while.

kellervater commented 2 years ago

FYI: Just tried 0.12.2 and issue still persists. Dig the dark mode though 👍

One strange finding here: If i undo the 2nd relay the network doesn't recover. Every remaining relayed node cannot reach anything anymore. I need to recreate the entire network with a single relay to be functional again.

mattkasun commented 2 years ago

Just tried this with v0.16.0 and the issue has been resolved.

In this scenario node relayed is relayed node relay and node lxc is relayed by node node1 ... lxc also is an egress gateway with a gateway range of 10.0.3.0/24. relayed can ping lxc and hosts in the egress range. lxc can ping relayed.

Note: it does take some time after creating a relay before the ping will go through (in order of 30-60 secs)

gravitl / netmaker