gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.5k stars 552 forks source link

[Bug]: some clients does not checkin properly #999

Closed FaintGhost closed 6 months ago

FaintGhost commented 2 years ago

Contact Details

zhang.yaowei@live.com

What happened?

I built a small mesh net with about 8 nodes. Some of nodes (in China, perhaps behind the GFW) can join the network with no problem, all nodes can ping each other with no problem. But after a while, all chinese nodes status will first become to warning and then become to error. I saw the netclient.service logs in error node, it is different from normal node. If I manuall do the netclient pull then the error node will become healthy again but for a while become warning and error again. I don't know what the problem is. Now I wrote a shell loop run netclient pull, but it's not a good solution. Could some one help me to solve this problem?

logs of normal working nodes:

● netclient.service - Netclient Daemon
     Loaded: loaded (/etc/systemd/system/netclient.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-03-30 01:46:40 CEST; 12h ago
       Docs: https://docs.netmaker.org
             https://k8s.netmaker.org
   Main PID: 1007729 (netclient)
      Tasks: 10 (limit: 9509)
     Memory: 18.1M
        CPU: 11.231s
     CGroup: /system.slice/netclient.service
             └─1007729 /sbin/netclient daemon

Mar 30 14:04:42 debian11 netclient[1007729]: [netclient] 2022-03-30 14:04:42 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:04:43 debian11 netclient[1007729]: [netclient] 2022-03-30 14:04:43 received peer update for node hard-zombie E3UAQeqA
Mar 30 14:08:45 debian11 netclient[1007729]: [netclient] 2022-03-30 14:08:45 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:09:48 debian11 netclient[1007729]: [netclient] 2022-03-30 14:09:48 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:11:50 debian11 netclient[1007729]: [netclient] 2022-03-30 14:11:50 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:14:51 debian11 netclient[1007729]: [netclient] 2022-03-30 14:14:51 received peer update for node hard-zombie E3UAQeqA
Mar 30 14:14:53 debian11 netclient[1007729]: [netclient] 2022-03-30 14:14:53 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:15:54 debian11 netclient[1007729]: [netclient] 2022-03-30 14:15:54 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:16:57 debian11 netclient[1007729]: [netclient] 2022-03-30 14:16:57 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:17:58 debian11 netclient[1007729]: [netclient] 2022-03-30 14:17:58 received peer update for node de-pve-debian11 wg-mesh

Version

v0.12.2

What OS are you using?

Linux

Relevant log output

logs of error nodes:

● netclient.service - Netclient Daemon
     Loaded: loaded (/etc/systemd/system/netclient.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-03-30 19:29:40 CST; 48min ago
       Docs: https://docs.netmaker.org
             https://k8s.netmaker.org
   Main PID: 78184 (netclient)
      Tasks: 9 (limit: 9510)
     Memory: 18.1M
        CPU: 1.569s
     CGroup: /system.slice/netclient.service
             └─78184 /sbin/netclient daemon

Mar 30 19:29:40 debian11 systemd[1]: Started Netclient Daemon.
Mar 30 19:29:40 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:40 pulling latest config for  E3UAQeqA
Mar 30 19:29:45 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:45 waiting for interface...
Mar 30 19:29:45 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:45 interface ready - netclient.. ENGAGE
Mar 30 19:29:47 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:47 pulling latest config for  wg-mesh
Mar 30 19:29:53 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:53 waiting for interface...
Mar 30 19:29:53 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:53 interface ready - netclient.. ENGAGE
Mar 30 19:29:55 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:55 started comms network daemon,  E3UAQeqA
Mar 30 19:29:55 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:55 netclient daemon started for network:  E3UAQeqA

Contributing guidelines

goldsoft8888 commented 2 years ago

fI have the same fault

cx9208 commented 2 years ago

same problem, netclinet does not pull config automatically in v12.2

si458 commented 2 years ago

ive discovered a similar issue with one of our 'server 2012 r2' machines, our issue i have found is whenever the node loses internet access and disconnects from the MQTT, and then it reconnects when the internet returns, its not reconnecting properly so the node shows as offline even though, you can ping the node no problem, i just simply restart the netclient service and it returns to normal no problem

ElectronicElephant commented 2 years ago

Some of nodes (in China, perhaps behind the GFW) can join the network with no problem, all nodes can ping each other with no problem. But after a while, all chinese nodes status will first become to warning and then become to error. I saw the netclient.service logs in error node, it is different from normal node.

Can confirm. I met the same problem.

jacobped commented 2 years ago

I ended up just adding a system timer similar to how it was done in v0.9.x, which for some reason is not present any more. Commit that removed it as part of #645: https://github.com/gravitl/netmaker/commit/443ed80e4d27d208134795e603aa8f166f7af017

Fix:

sudo nano /etc/systemd/system/netclient-pull.service

[Unit]
Description=Network Check
Wants=netclient.timer
[Service]
Type=simple
ExecStart=/usr/sbin/netclient pull -n all
[Install]
WantedBy=multi-user.target

sudo nano /etc/systemd/system/netclient.timer

[Unit]
Description=Calls the Netmaker Mesh Client Service
Requires=netclient.service
[Timer]
Unit=netclient-pull.service
OnCalendar=*:*:0/15
[Install]
WantedBy=timers.target

sudo systemctl enable netclient.timer

sudo systemctl start netclient.timer

jacobped commented 2 years ago

841 might be related, but I didn't have the mentioned logs with "invalid message from broker".

Nexxus-LMT commented 2 years ago

same with netmaker server 0.14.1 running on docker. it worked perfectly after addin 4 nodes. Issues began when i added a windows 10 node (Sever network slowdown on the machine that had to be removed). Since then almost every node i add brings this issue. Restarts and reinstalls of client does not work. Will try a reinstall of server if issues persist, worsen or inhibit my use case

abhishek9686 commented 6 months ago

Please try it on latest version