gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.4k stars 547 forks source link

[Bug]: Endpoint Peer Detection causing issues #2212

Closed frayos closed 1 year ago

frayos commented 1 year ago

Contact Details

younes@sgx.fr

What happened?

Same / similar as in 3457

@afeiszli mentioned at the time : "Thank you. The issue is not the hostname, it is the local interface calculator.

The two peers "think" that they are on the same local network because they both have 172.19.0.1. We are working on a fix for this and should have an update soon"

I think this was supposedly fixed in 0.18.6 So I upgraded / rejoined my nodes, everything went well then ... I started my K3S instance which creates this 10.42.X.0 for each node, so again Netclient/Netmaker thinks they are on the same network and this breaks effectively Wireguard as they now try to communicate on this interface.

Technically K3S can use this interface for communication but Wireguard shouldn't change it's approach.

Would it be possible to disable this feature somehow ? command line argument on netclient for instance ? binding to a specific NIC ?

Log excerpt attached

Let me know if you need more verbose log etc.

Thank you for your help ! Younes

Version

v0.18.6

What OS are you using?

Linux

Relevant log output

Apr 15 18:56:46 xxxx-srv netclient[2933]: [netclient] 2023-04-15 18:56:46 determined new endpoint for peer Ryw0SxxxxscRgVvTmap9Jq80plHWJFLlX5AK0k= - 10.42.0.0:35760
Apr 15 18:56:47 xxxx-srv netclient[2933]: [netclient] 2023-04-15 18:56:47 determined new endpoint for peer U2o94xxxxvO24uNo0PwDrn4r8hgJxfxv57vhRk= - 10.42.3.0:51840
Apr 15 18:58:09 xxxx-srv netclient[2933]: [netclient] 2023-04-15 18:58:09 determined new endpoint for peer NTQnxxxxRqiLL4/uCFatw840MqX/JZL+lJ/qWk99w8= - 10.42.1.0:55510

Contributing guidelines

TKaxv-7S commented 1 year ago

My network environment is almost the same as yours, also found this problem in version 0.18.5, continued in version 0.18.6, the client log is also consistent with yours, both versions are completely unavailable to me, I have even started to reading the source code in the hope that I can fix it, of course, I would prefer to see the official release of a new version of the fix.

Finally, thanks to the developers for hard work on this project! :)

frayos commented 1 year ago

Indeed thank you for this great project :)

BTW Netclient source code containing the message https://github.com/gravitl/netclient/blob/7dc61b0e6dbcbbe2c5b1ed34af187f6fb2f2608f/networking/server-pong.go

Based on my reading the "iface detection" causes it I'm sure this has valid reason for doing it such as detecting local client, but is it possible to exclude some iface name maybe from being detected ?

TKaxv-7S commented 1 year ago

Indeed thank you for this great project :)

BTW Netclient source code containing the message https://github.com/gravitl/netclient/blob/7dc61b0e6dbcbbe2c5b1ed34af187f6fb2f2608f/networking/server-pong.go

Based on my reading the "iface detection" causes it I'm sure this has valid reason for doing it such as detecting local client, but is it possible to exclude some iface name maybe from being detected ?

Yes, this is exactly what I am trying to understand, I believe violently removing this part of the code will make the client work right away, but this is not what we want, we all want to use the fastest node, so we may need to fix this feature.

0xdcarns commented 1 year ago

My network environment is almost the same as yours, also found this problem in version 0.18.5, continued in version 0.18.6, the client log is also consistent with yours, both versions are completely unavailable to me, I have even started to reading the source code in the hope that I can fix it, of course, I would prefer to see the official release of a new version of the fix.

Finally, thanks to the developers for hard work on this project! :)

I'm surprised to hear this is still an issue in 0.18.6. I am wondering if you installed your clients over the old ones? Also if you want to cease/block the detection, just block TCP access with a local firewall (like ufw) on the specified proxy port (default 51722) of the host on the machine.

frayos commented 1 year ago

Indeed I did do "apt upgrade" style updates + wget / chmod etc on ARMs

I will try to 1) replace the clients after a clean install 2) block locally to the machine, Proxy Port to see if helps (spoiler i'm in a "nat" env so the proxy port were NOT opened anyway just the Wireguard listen port)

Will report back

Thank you !

TKaxv-7S commented 1 year ago

My network environment is almost the same as yours, also found this problem in version 0.18.5, continued in version 0.18.6, the client log is also consistent with yours, both versions are completely unavailable to me, I have even started to reading the source code in the hope that I can fix it, of course, I would prefer to see the official release of a new version of the fix. Finally, thanks to the developers for hard work on this project! :)

I'm surprised to hear this is still an issue in 0.18.6. I am wondering if you installed your clients over the old ones? Also if you want to cease/block the detection, just block TCP access with a local firewall (like ufw) on the specified proxy port (default 51722) of the host on the machine.

Yes, there was a problem overwriting the installation on the old client, and then I tried a fresh install after uninstalling the problem still exists, PS: I am using a windows client. Later, I replaced a brand-new environment, including netmaker server, linux docker netclient and windows netclient, and the problem appeared again.

I'll try you suggested block the detection method, Thank you !

frayos commented 1 year ago

so indeed blocking port with ufw has given some success but I still feel unstable I'm also going to try to remove every client / servers / start fresh and see where this leads

But ideally / if possible, disabling this or allowing through variables or netclient arguments to force the interface to be used (when we know better than autodetection what to use)

TKaxv-7S commented 1 year ago

so indeed blocking port with ufw has given some success but I still feel unstable I'm also going to try to remove every client / servers / start fresh and see where this leads

But ideally / if possible, disabling this or allowing through variables or netclient arguments to force the interface to be used (when we know better than autodetection what to use)

I think it's a good idea, In many cases, a stable connection to the server is more important than low latency, although it would be great to be able to maintain a stable connection and switch to a lower latency interface without feeling it! Maybe that's going to be difficult, so maybe it's a good idea to do the next best thing and let the user choose.

frayos commented 1 year ago

@0xdcarns to sum up A) UFW works as a workaround but not ideal, I'm already in a NATed environment, adding firewalls has unexpected side effects I'd like to avoid

B) I confirm issue still exists : Started fresh on a new server / new clients, Each client machine is on a different public IP / location One of the client has the same public IP as the server. Created a network, added 3 nodes, started K3S, "common network" caused the issue

Apr 22 12:21:32 resmusica-srv netclient[1494]: [netclient] 2023-04-22 12:21:32 determined new endpoint for peer 0DFEuKu16Eu3QfDdprKSpWKv2NsveVi6AWIcPEPG7gY= - 10.42.0.0:35760
Apr 22 12:21:39 resmusica-srv netclient[1494]: [netclient] 2023-04-22 12:21:39 publishing global host update for port changes
Apr 22 12:22:19 resmusica-srv netclient[1494]: [netclient] 2023-04-22 12:22:19 determined new endpoint for peer tuEkx4LflVJmsLcywOLaYybOCz8lGPZA9L/jG5t4sXo= - 10.42.1.0:55510

C) Any mean of disabling this feature would be great for next release !

TKaxv-7S commented 1 year ago

@0xdcarns to sum up A) UFW works as a workaround but not ideal, I'm already in a NATed environment, adding firewalls has unexpected side effects I'd like to avoid

B) I confirm issue still exists : Started fresh on a new server / new clients, Each client machine is on a different public IP / location One of the client has the same public IP as the server. Created a network, added 3 nodes, started K3S, "common network" caused the issue

Apr 22 12:21:32 resmusica-srv netclient[1494]: [netclient] 2023-04-22 12:21:32 determined new endpoint for peer 0DFEuKu16Eu3QfDdprKSpWKv2NsveVi6AWIcPEPG7gY= - 10.42.0.0:35760
Apr 22 12:21:39 resmusica-srv netclient[1494]: [netclient] 2023-04-22 12:21:39 publishing global host update for port changes
Apr 22 12:22:19 resmusica-srv netclient[1494]: [netclient] 2023-04-22 12:22:19 determined new endpoint for peer tuEkx4LflVJmsLcywOLaYybOCz8lGPZA9L/jG5t4sXo= - 10.42.1.0:55510

C) Any mean of disabling this feature would be great for next release !

Hi, friend!

I was fix this problem in my release version v0.20.3, you just to replace docker images gravitl/netmaker, gravitl/netclient, gravitl/netmaker-ui with tkaxv7s/netmaker, tkaxv7s/netclient, tkaxv7s/netmaker-ui, that's all

I modify the windows netclient to ensure that problem do not recur, now the netmaker server automatically update peer endpoints will not effective for the windows netclient, and added the "PULL" button in Networks page to manually update peer endpoints like v0.17.x version, as this screenshot: image

You can find in my release Netclient v0.20.3

If any issues are found, welcome feedback, we can continue our discussion.

frayos commented 1 year ago

Thank you ! Please make sure this gets documented of course.

To be noted, I lost so many hours on this that I switched for now to Ansible based WG deployment where I was surprised to Observe a SIMILAR issue, so just letting you know : Persistent Keep Alive will cause a similar issue in Native Wireguard as to maintain the connection it traces the IP that comes back, so after disabling Persistent Keep Alive, I have a Working solution !

I'll get back to Netmaker later

TKaxv-7S commented 1 year ago

Thank you ! Please make sure this gets documented of course.

To be noted, I lost so many hours on this that I switched for now to Ansible based WG deployment where I was surprised to Observe a SIMILAR issue, so just letting you know : Persistent Keep Alive will cause a similar issue in Native Wireguard as to maintain the connection it traces the IP that comes back, so after disabling Persistent Keep Alive, I have a Working solution !

I'll get back to Netmaker later

Thank you for your prompt. This is very useful, and I will continue to pay attention to similar issues.