gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.5k stars 552 forks source link

Unsuccessful `netclient daemon` pull process #862

Closed vitex closed 2 years ago

vitex commented 2 years ago

While experimenting with v0.11.0, I wound up with a client node that had configuration files for two comms networks: OxHztZZg associated with the server managing the active network Test and vIKbXptc associated with an inactive server. netclient daemon only contacted OxHztZZg once and then went into a loop waiting for vIKbXptc to become active.

It appears that the current implementation of netclient daemon would block if a node is a member of several Netmaker networks and at least one server goes down.

Mar 06 15:10:51 a1 netclient[354871]: 2022/03/06 15:10:51 [netclient] pulling latest config for Test
Mar 06 15:10:51 a1 netclient[354871]: 2022/03/06 15:10:51 [netclient] pulling latest config for OxHztZZg
Mar 06 15:10:51 a1 netclient[354871]: 2022/03/06 15:10:51 [netclient] pulling latest config for vIKbXptc
Mar 06 15:10:51 a1 netclient[354871]: 2022/03/06 15:10:51 [netclient] failed to pull for network vIKbXptc
Mar 06 15:10:51 a1 netclient[354871]: 2022/03/06 15:10:51 [netclient] waiting 2 seconds to retry...
Mar 06 15:10:54 a1 netclient[354871]: 2022/03/06 15:10:54 [netclient] failed to pull for network vIKbXptc
Mar 06 15:10:54 a1 netclient[354871]: 2022/03/06 15:10:54 [netclient] waiting 4 seconds to retry...
...
Mar 06 15:27:56 a1 netclient[354871]: 2022/03/06 15:27:56 [netclient] waiting 1024 seconds to retry...
Mar 06 15:45:00 a1 netclient[354871]: 2022/03/06 15:45:00 [netclient] failed to pull for network vIKbXptc
Mar 06 15:45:00 a1 netclient[354871]: 2022/03/06 15:45:00 [netclient] waiting 2048 seconds to retry...
Mar 06 16:19:09 a1 netclient[354871]: 2022/03/06 16:19:09 [netclient] failed to pull for network vIKbXptc
Mar 06 16:19:09 a1 netclient[354871]: 2022/03/06 16:19:09 [netclient] waiting 3600 seconds to retry...
vitex commented 2 years ago

Version v0.12.1 does not resolve the problem.

Create a Netmaker client node that is connected to two networks, each of which is installed on a separate Netmaker server node.

Use docker-compose down to temporarily disable the second server node, and then use systemctl to restart netclient on the client node.

After the restart, the client log file shows that netclient daemon stops receiving configuration updates from the first server node as long as the daemon polls for the second server node to come back online.

vitex commented 2 years ago

Version v0.13.0 does not resolve this problem.

vitex commented 2 years ago

Version v0.13.1 does not resolve this problem.

afeiszli commented 2 years ago

Hi, the comms network no longer exists on version 0.13 or 0.13.1, which would be why pull fails for this network. You should leave the comms network in v0.13. You need to follow the upgrade guide for 0.12-0.13. Specifically step 11 for removing comms: https://gist.github.com/afeiszli/f53f34eb4c5654d4e16da2919540d0eb

vitex commented 2 years ago

On Sat, May 7, 2022 at 11:19 AM Alex Feiszli @.***> wrote:

Hi, the comms network no longer exists on version 0.13 or 0.13.1, which would be why pull fails for this network. You should leave the comms network in v0.13. You need to follow the upgrade guide for 0.12-0.13. Specifically step 11 for removing comms: https://gist.github.com/afeiszli/f53f34eb4c5654d4e16da2919540d0eb

The issue was first encountered while using v0.11 and did involve a comms network, but I duplicated the issue using fresh installs of v0.13 that had no remnants of comms networks:

While netclient daemon waits for the second server to come back up, the daemon never checks in with the first server to see if any configuration changes must be made. Thus a problem with one server blocks access to another server, which is bad design.

Ed

vitex commented 2 years ago

This issue was resolved with v0.14.2.