gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.5k stars 552 forks source link

[Bug]: netclient cannot fully join network #1155

Closed catcake closed 2 years ago

catcake commented 2 years ago

Contact Details

No response

What happened?

Attempting to join a network consistently fails on a particular host (bare metal, Ubuntu 22.04 LTS). The node shows up in Netmaker-ui, but no wireguard interfaces are created on the host. The same issue occurs when using the provided docker command (from Access Keys menu).

I can consistently produce the error with the following:

sudo netclient leave -n [network]
sudo netclient join -vvv -t [token]

Using the leave command does successfully remove the broken node from Netmaker.

I have also tired:

sudo netclient leave -n [network]
sudo netclient uninstall
sudo apt reinstall netclient
sudo netclient join -vvv -t [token]

And additionally, completely uninstalling then installing netclient.

Other hosts can join this network, using the same command, without issue.

Version

v0.14.1

What OS are you using?

Linux

Relevant log output

$ sudo netclient join -vvv -t [token]
[netclient] 2022-05-31 03:15:15 joining [network] at api.[domain]:443
[netclient] 2022-05-31 03:15:16 node created on remote server...updating configs
[netclient] 2022-05-31 03:15:16 turn on UDP hole punching (dynamic port setting)? no
[netclient] 2022-05-31 03:15:16 could not read CA file  open /etc/netclient/broker.[domain]/root.pem: no such file or directory
[netclient] 2022-05-31 03:15:16 failed to append cert
2022/05/31 03:15:16 could not read client cert/key open /etc/netclient/broker.[domain]/client.pem: no such file or directory

Contributing guidelines

mattkasun commented 2 years ago

can you check that wireguard-tools are installed ... the apt install should install wireguard-tools in addition to netclient, but perhaps that failed for some reason.

the missing starting wireguard from your logs leads me to suspect this

a successful join would produce the following logs:

$ sudo netclient join -vvv -t [token]
[netclient] 2022-05-31 07:18:31 joining virt-net at api.example.com:443 
[netclient] 2022-05-31 07:18:32 node created on remote server...updating configs 
[netclient] 2022-05-31 07:18:32 turn on UDP hole punching (dynamic port settings? no 
[netclient] 2022-05-31 07:18:32 starting wireguard 
[netclient] 2022-05-31 07:18:34 waiting for interface... 
[netclient] 2022-05-31 07:18:34 interface ready - netclient.. ENGAGE 
[netclient] 2022-05-31 07:18:34 register at https://api.example.com:443/api/server/register 
[netclient] 2022-05-31 07:18:34 certificates/key saved  
[netclient] 2022-05-31 07:18:34 local port has changed from  0  to  51821 
[netclient] 2022-05-31 07:18:36 sent a node update to server for node <hostname> ,  42c44cd5-37c3-4b3b-a181-070bda9738b3 
[netclient] 2022-05-31 07:18:37 restarting netclient.service 
[netclient] 2022-05-31 07:18:38 joined  virt-net 
catcake commented 2 years ago

wireguard-tools appears to have already been installed:

c@0v-tails:~/xyz.catw.tank/docker/netmaker$ sudo apt search wireguard-tools
Sorting... Done
Full Text Search... Done
wireguard-tools/jammy,now 1.0.20210914-1ubuntu2 amd64 [installed]
fast, modern, secure kernel VPN tunnel (userland utilities)

I removed it, which also removed netclient, then reinstalled the two:

$ sudo apt remove wireguard-tools
$ sudo apt install wireguard-tools netclient

I then attempted to join the network, but this resulted in another error, but I think this might have been caused by the "ghost" node created by the join attempt in the original post:

$ sudo netclient join -t [token]
[netclient] 2022-05-31 16:27:32 joining [network] at api.[domain]:443 
[netclient] 2022-05-31 16:27:33 unable to authenticate: failed to authenticate 400 Bad Request {"Code":400,"Message":"W1R3: ID can't be empty"} 
[netclient] 2022-05-31 16:27:34 removed systemd remnants if any existed 
[netclient] 2022-05-31 16:27:35 removed systemd remnants if any existed 

I then ran:

$ sudo netclient leave -n [network]
[netclient] 2022-05-31 16:34:03 used backup file for network:  [network]
2022/05/31 16:34:03 open /etc/netclient/config/netconfig-[network]: no such file or directory
$ sudo netclient leave
[netclient] 2022-05-31 16:34:08 used backup file for network:  all 
2022/05/31 16:34:08 open /etc/netclient/config/netconfig-all: no such file or directory
$ sudo netclient uninstall
[netclient] 2022-05-31 16:34:13 uninstalling netclient... 
[netclient] 2022-05-31 16:34:13 uninstalled netclient

This did not remove the ghost node in netmaker-ui, so I manually deleted it. I reinstalling wireguard-tools, netclient and continued with another join attempt, but this failed with the same issue as the original post:

$ sudo apt remove wireguard-tools
$ sudo apt install wireguard-tools netclient
$ sudo netclient join -vvv -t [token]
[netclient] 2022-05-31 16:35:08 joining [network] at api.[domain]:443
[netclient] 2022-05-31 16:35:08 node created on remote server...updating configs
[netclient] 2022-05-31 16:35:08 turn on UDP hole punching (dynamic port setting)? no
[netclient] 2022-05-31 16:35:08 could not read CA file  open /etc/netclient/broker.[domain]/root.pem: no such file or directory
[netclient] 2022-05-31 16:35:08 failed to append cert
2022/05/31 16:35:08 could not read client cert/key open /etc/netclient/broker.[domain]/client.pem: no such file or directory

The error in the original post is now reliably created in this loop:

$ sudo netclient leave -n [network]
$ sudo netclient leave
$ sudo netclient uninstall
$ sudo apt remove wireguard-tools netclient
$ sudo apt install wireguard-tools netclient
$ sudo netclient join -vvv -t [token]

I then tried rebooting after apt remove, before apt install, but the same issue persisted.

btw, netclient version:

$ sudo netclient --version
Netclient version v0.14.1
mattkasun commented 2 years ago

When you do a join do you get log with starting wireguard

catcake commented 2 years ago

No. For all of the commands with output show, I've included the entirety of the output, not any snippets.

mattkasun commented 2 years ago

have you tried the MQ troubleshooting steps:

https://gist.github.com/mattkasun/face2a7c1f32031a2126ff7243caad12

catcake commented 2 years ago

Yes.

  1. docker-compose.yaml & mosquitto.conf match the given examples.
  2. port 8883 is open and publicly reachable; broker.[NETMAKER_BASE_DOMAIN] resolves to the correct host.
  3. mq certs generated properly:
    $ docker logs mq
    1653980105: mosquitto version 2.0.11 starting
    1653980105: Config loaded from /mosquitto/config/mosquitto.conf.
    1653980105: Opening ipv4 listen socket on port 8883.
    1653980105: Opening ipv6 listen socket on port 8883.
    1653980105: Opening ipv4 listen socket on port 1883.
    1653980105: Opening ipv6 listen socket on port 1883.
    1653980105: mosquitto version 2.0.11 running
  4. & onward are N/A. The server certs are fine (all other hosts could join perfectly fine). The problematic host does not receive any certs; /etc/netclient only contains config, which is empty.
n-able-consulting commented 2 years ago

Have the same issue on a new netmaker install (done today)! Clearly a new feature.... Same configuration: 22.04, docker-compose. Now a 3 vm's for netmaker (compose solution) & ha-postgres at the back. Worked in the old setup, this issue arises with image gravitl/netmaker:v0.14.1. Wil test some older images to see if it is image/version related. As long as this db configuration works with older images.

n-able-consulting commented 2 years ago

Did not have the latest image. Confirmed working ok with image v0.14.2. Issue Solved for me.

catcake commented 2 years ago

Yeah, v0.14.2 seems to fix whatever it was. I changed only the netmaker & netmaker-ui image versions in the compose file from v0.14.1 to v0.14.2 and everything is working now.