tonarino / innernet

A private network system that uses WireGuard under the hood.
https://blog.tonari.no/introducing-innernet
MIT License
4.93k stars 183 forks source link

timeout when simultanious redeeming #163

Open amarao opened 2 years ago

amarao commented 2 years ago

I found that if two or more clients (servers) are trying to redeem invite at the same time, one of them is getting timeout. The 'simultaneous' is much simpler to reproduce than it sound, because playbooks in Ansible normally does stuff in parallel on all servers in a group.

Here is a simple playbook to configure servers (not the innernet server, other servers):

- hosts: innernet
  tasks:
    - name: Create invite for the server
      delegate_to: '{{ innernet_auth_server }}'
      become: true
      command:
        innernet add-peer
        --name '{{ inventory_hostname }}'
        --ip '{{ innernet_ip }}'
        --cidr '{{ innernet_servers_cidr_name }}'
        --invite-expires {{ innernet_server_invite_expiration_time }}
        --save-config '{{ innernet_invite_path }}'
        --admin false
        --yes
        '{{ innernet_network_name }}'
      register: res
      changed_when: res.rc==0

    - name: Fetch invite
      become: true
      delegate_to: '{{ innernet_auth_server }}'
      shell: |
        cat '{{ innernet_invite_path }}'
        rm '{{ innernet_invite_path }}'
      register: invite
      changed_when: invite.rc==0

    - name: Save invite
      become: true
      copy:
        content: '{{ invite.stdout }}'
        dest: '{{ innernet_invite_path }}'
        owner: root
        group: root
        mode: '0600'

    - name: Accept invite
      become: true
      command:
        innernet
          install
          --default-name
          --delete-invite
          '{{ innernet_invite_path }}'
      register: res
      changed_when: res.rc==0

    - name: Activate systemd unit
      become: true
      systemd:
        name: innernet@{{ innernet_network_name }}

'Accept invite' task succeed for the first host and fails for all others. Adding throttle: 1 or retries helps.

janikvonrotz commented 2 years ago

Hi @amarao I am working on a Ansible implementation as well: https://github.com/tonarino/innernet/discussions/166 Any hints on my problem?

amarao commented 2 years ago

@janikvonrotz , there is too little information to say something definitive, but things I found:

  1. There is an issue with outgoing interface for invitation (I think it need fixing in ureq library).
  2. Simultaneous invites do not work (this issue). If you use ansible, use throttle: 1 on the task.
  3. I found that calling innernet fetch on innernet-server interface breaks a lot (don't do it).

You may also want to try to pause redeeming process for debugging (Press Ctrl-Z, and look at a temporary wg interface created by innernet - I found issue with wrong interface by doing this).

janikvonrotz commented 2 years ago

@amarao I assume the issue with the wrong interface is #141 I will try to reproduce the issue. Thanks for your initiative and well done reports.

mcginty commented 2 years ago

Interestingly, I added a quick section in the docker tests that spins up two peers redeeming invitations at about the same time, and it didn't error out. Will have to dig more into this. Thanks for reporting, and I'm sorry I've been away for so long!

amarao commented 2 years ago

I got permission from a company I work for to opensource our innernet playbooks/roles, which I'll do shortly. For testing I use libvirt vms, and when ansible does redeeming, it does so in parallel, and it's pretty reproducible. I'm afraid, GH Actions does not allow to use nested virtualization, so to run it one would need a normal linux machine.

mcginty commented 2 years ago

@amarao that would be great, thanks! I started working on dropping the docker tests in favor of just using netns on linux directly, using https://git.zx2c4.com/wireguard-linux/tree/tools/testing/selftests/wireguard/netns.sh as the base, but if the ansible playbooks are more readable that could be another possible way to do "integration" testing.

janikvonrotz commented 2 years ago

Hey @amarao have you published your ansible playbooks on gh/galaxy?

i am working on:

The roles are not as good as they should be.

amarao commented 2 years ago

I'm working on open sourcing, it's a bit harder than I expected (mostly to rip off internal stuff and make it self-sustaining). My plan is to publish playbooks, not roles, as I don't believe roles can work there (there is too much delegation and cross-host orchestration for role). Insofar I got molecule working. I think it would take couple of days (.. evenings).

amarao commented 2 years ago

Finally, I got everything streamlined and permitted for publishing. My playbook is here: https://github.com/lidofinance/innernet-playbooks

It's full of 'retries', 'throttle: 1', etc; nevertheless there is 30-50% chance than one of the nodes fails to redeem invite. Each of those hacks is issue with automation.

DanielJoyce commented 2 years ago

Does the server use async, and if so, is it using the multi-threaded runner or just the default single threaded one? Another place to look is SQLITE ( I think innernet uses this for assigned ip storage ) and that is single threaded. But it shouldn't block long enough to back up.

mcginty commented 2 years ago

@DanielJoyce yep, it uses async with a multithreaded runner (tokio).