gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.52k stars 552 forks source link

[Bug]: Node needs to "repull" config after restart #1393

Closed kellervater closed 2 years ago

kellervater commented 2 years ago

Contact Details

patrick.poetz@voo.aero

What happened?

Since last upgrade v0.14.0 -> v0.14.5 my nodes need to pull config again to rejoin the network. I run 3 bare metal nodes which run netclient installed via apt to form a k8s cluster later on. The nodes don't rejoin the network after a restart or crash which would need an extra startup script to execute netclient pull or manual intervention.

Version

v0.14.5

What OS are you using?

Linux

Relevant log output

There's no explicit log about netclient when node is restarted. This one is just the log for the "rejoin/repull":

root@aio3:~# netclient pull
[netclient] 2022-07-17 06:13:51 No network selected. Running Pull for all networks.
[netclient] 2022-07-17 06:13:51 error remove interface nm-rke exit status 1
[netclient] 2022-07-17 06:13:51 UDP hole punching enabled for node aio3
[netclient] 2022-07-17 06:13:53 certificates/key saved

Contributing guidelines

mattkasun commented 2 years ago

How are you installing/updating netclient. If you are using your distro package manager, the latest packages should be enabling the netclient service so that it will start after a reboot.

ferreirocm commented 2 years ago

I have the same issue using a RPi and an Ubuntu 18.04 VM, using a clean install of netclient v0.14.5 (latest version at the moment)

kellervater commented 2 years ago

I did a fresh install on the weekend, but still the same. First uninstall netclient on all nodes and then reinstalled it with ansible playbooks:

# ansible tasks
- hosts: nodes:rancher
  tasks:
    - name: uninstall netclient
      shell: |
        netclient uninstall
      become: yes
      register: _result
      failed_when: "_result.rc != 0 and 'netclient: not found' not in _result.stderr"
      changed_when: "'uninstalled netclient' in _result.stderr"
    - name: uninstall apt dependency
      ansible.builtin.apt:
        pkg: netclient
        state: absent
      become: yes

The above tasks translate to:

sudo netclient uninstall
sudo apt remove netclient

Then the fresh installation via ansible (excerpt):

# netmaker installation comes first
...
- hosts: nodes:rancher
  any_errors_fatal: true
  pre_tasks:
    - name: Netclient prerequisites
      shell: |
        curl -sL 'https://apt.netmaker.org/gpg.key' | tee /etc/apt/trusted.gpg.d/netclient.asc
        curl -sL 'https://apt.netmaker.org/debian.deb.txt' | tee /etc/apt/sources.list.d/netclient.list
      become: yes 
    - name: Install packages
      apt:
        pkg:
          - netclient={{ netclient_version }} # netclient_version: 0.14.5-2
      become: yes
  tasks:
    - name: Join Network
      shell: |
        netclient join -t {{ hostvars['netmaker']['network_access_token'] }} {% if netmaker_ip|length > 0 %}--address {{ netmaker_ip }}{% endif %}
      register: join_result
      changed_when: "'ALREADY_INSTALLED' not in join_result.stdout"
      become: yes
...
# then make all nodes static via API
...
- name: Pull latest config
      shell: |
        netclient pull -n {{ hostvars['netmaker']['network']['id'] }}
      become: yes
    - name: ping all peers (including self)
      shell: |
        ping "{{ item }}" -c 1
      register: result
      retries: 5
      delay: 5
      until: result.rc == 0
      loop: "{{ nodes.json|map(attribute='address') }}"

After this everything works fine until a reboot is executed. After a reboot I'd assume my peers are pingable. Even after 10 minutes nothing's happening. But with a netclient pull it's working instantly.

Since the wirguard network builds the base of my k8s cluster, nodes cannot recover automatically anymore. Right now though, it's more of an annoyance than an issue.

ghgeiger commented 2 years ago

I'd like to add that I'm also having this problem on a fresh install of 0.14.6. My initial install (0.12.2) did not have this problem. I kept upgrading with each release and sometime around the time that the netclient repository became available, this problem began. I waited after i updated to 0.14.3 for a few releases and attempted a fresh install of NetMaker on the VPS assuming that the problem may have had to do with upgrading from an old config file, but the problem is still there.

I have NetMaker installed on a Hetzner VPS running Ubuntu 22.04. I have netclient installed on Raspberry Pi OS (debian bullseye) and Ubuntu Desktop 22.04, and the problem exists on both boxes. Further, on the Ubuntu box, after I run 'netclient pull' my outbound connection to the internet gets corrupted and I have to manually disable and enable the wired connection to get everything working again... Not very convenient when I'm away from the boxes.

martinkeat commented 2 years ago

Did you ever resolve this issue?

kellervater commented 2 years ago

Haven't tested on v0.15.0 so far. But on mentioned version above I just created a startup script which does a netclient pull.

martinkeat commented 2 years ago

@ppoetz Don't suppose you have a copy of that script do you?

Tivin-i commented 2 years ago

@martinkeat

sudo nano /etc/systemd/system/netclientpull.service

Service script: [Unit] Description=Run a netclient pull

[Service] Type=forking User=root Group=root UMask=1000 ExecStart=/usr/sbin/netclient pull Type=oneshot RemainAfterExit=yes

[Install] WantedBy=multi-user.target


Then run:

sudo systemctl daemon-reload sudo systemctl enable netclientpull.service

mattkasun commented 2 years ago

@ppoetz is this still an issue?

ghgeiger commented 2 years ago

I was able to resolve it by manually editing the .service file after raising the issue on discord (this was before the solution was posted above). I haven't had any problems with updates since then, but I can't say whether it would've resolved itself in an update if I hadn't intervened.

kellervater commented 2 years ago

@mattkasun will upgrade my cluster on Saturday to the latest version and then give you an update on this. If not, I'll try the workaround @ghgeiger mentioned.

kellervater commented 2 years ago

So... I performed an upgrade from v0.14.5 to v0.16.0 on Netmaker as well as an update from v0.14.5-2 to v0.16.0-2 on all netclients (apt) in our networks.

And my issue is resolved! I can now reboot any instance and instantly ping other nodes! Thank you very much!