netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
11.3k stars 518 forks source link

Client 0.30.0 Exit Node not working as 0.29.4 #2707

Open rkleivel opened 1 month ago

rkleivel commented 1 month ago

Problem: After upgrading clients to 0.30.0, nodes in a exit node distribution group looses internet connection if exit node is restarted

To Reproduce 1) Create 2 groups is_exit_node and uses_exit_node 2) Add a node in each group (preferably behind different public IPs for easier testing) 3) Create a policy that allows the groups to communicate (unless the Default policy All <-> All is active) image 4) Add an exit node under network routes that makes the node in is_exit_node the Exit Node of the node in uses_exit_node image 5) Run curl ipinfo.io on each node and verify that the public IPs are identical 6) Run netbird down && netbird up on the exit node 7) Wait a minute to allow settings to be updated 8) Run curl ipinfo.io on each node

Expected behavior Each node should still appear to be behind the same public IP.

Are you using NetBird Cloud? Yes

NetBird version 0.30.0 (failing) and 0.29.4 (working)

Additional info Both nodes are running Ubuntu 24.04 server

As this has been fairly easy to reproduce, I do not attach any logs at this stage. Please let me know if they will be necessary, and I'll happily provide :)

mlsmaycon commented 1 month ago

Hello @rkleivel can you please share the output from nft list ruleset from the exit node?

rkleivel commented 1 month ago

Thanks @mlsmaycon! Here is the ruleset after netbird down / up on the exit node:

table ip filter {
    chain INPUT {
        type filter hook input priority filter; policy accept;
    }

    chain OUTPUT {
        type filter hook output priority filter; policy accept;
    }

    chain FORWARD {
        type filter hook forward priority filter; policy accept;
        oifname "wt0" ct state established,related counter packets 0 bytes 0 accept
        iifname "wt0" counter packets 0 bytes 0 accept
    }
}
table ip nat {
    chain POSTROUTING {
        type nat hook postrouting priority srcnat; policy accept;
    }
}
table ip netbird {
    set nb0000001 {
        type ipv4_addr
        flags dynamic
        elements = { 100.93.17.98 }
    }

    set nb0000002 {
        type ipv4_addr
        flags dynamic
        elements = { 100.93.17.98 }
    }

    chain netbird-rt-fwd {
        ct state established,related accept
        counter packets 0 bytes 0 accept
    }

    chain netbird-rt-nat {
        type nat hook postrouting priority srcnat - 1; policy accept;
        iifname "wt0" counter packets 1 bytes 176 masquerade
        oifname "wt0" counter packets 0 bytes 0 masquerade
    }

    chain netbird-acl-input-rules {
        ct state established,related accept
        ip saddr @nb0000001 accept
    }

    chain netbird-acl-output-rules {
        ct state established,related accept
        ip daddr @nb0000002 accept
    }

    chain netbird-acl-input-filter {
        type filter hook input priority filter; policy accept;
        iifname "wt0" jump netbird-acl-input-rules
        iifname "wt0" drop
    }

    chain netbird-acl-output-filter {
        type filter hook output priority filter; policy accept;
        oifname "wt0" ip daddr != 100.93.0.0/16 accept
        oifname "wt0" jump netbird-acl-output-rules
        oifname "wt0" drop
    }

    chain netbird-acl-forward-filter {
        type filter hook forward priority filter; policy accept;
        iifname "wt0" jump netbird-rt-fwd
        iifname "wt0" drop
    }
}

The diff from when it was working looks like this does not seem significant:

diff exit_node_working.txt exit_node_not_working.txt 
12,13c12,13
<       oifname "wt0" ct state established,related counter packets 1048 bytes 2078521 accept
<       iifname "wt0" counter packets 909 bytes 54393 accept
---
>       oifname "wt0" ct state established,related counter packets 0 bytes 0 accept
>       iifname "wt0" counter packets 0 bytes 0 accept
36c36
<       counter packets 25 bytes 1580 accept
---
>       counter packets 0 bytes 0 accept
41c41
<       iifname "wt0" counter packets 18 bytes 1160 masquerade
---
>       iifname "wt0" counter packets 1 bytes 176 masquerade
mgarces commented 1 month ago

hi there; can you try our latest release v0.30.1 please?

rkleivel commented 1 month ago

hi there; can you try our latest release v0.30.1 please?

Sure! Unfortunately I cannot see any improvement since 0.30.0

mlsmaycon commented 1 month ago

Hello @rkleivel can you please run the following commands?

On exit node:

sysctl net.ipv4.ip_forward
sudo tcpdump -i any -nn host 1.1.1.1 and port 443 # keep this running while testing on client

On client:

ip route get 1.1.1.1
nc -vw 5 -z 1.1.1.1 443

Then, share the output with us.

rkleivel commented 1 month ago

Hi @mlsmaycon,

As opposed to my comment Oct 11 at 10.11 GMT, I am currently not able to reproduce the issue. Below I will provide the output from your commands.

However, High Availability with 2 exit nodes still does not seem to work. I did not mention that earlier because I did not get time to test it thoroughly, but noticed it last week while debugging the issue of this thread, and have a feeling it might be related. I will provide similar outputs in another comment.

Here the logging for only one exit node that goes down and up, showing that the it now works as expected. (I did add some timestamps to make it easier to relate the two). Both nodes on 0.30.1:

Exit node:

admin@exit1:~$ date && sudo tcpdump -i any -nn host 1.1.1.1 and port 443
Mon Oct 14 07:46:36 UTC 2024
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
07:46:48.250437 wt0   In  IP 100.93.158.105.32836 > 1.1.1.1.443: Flags [S], seq 1080105978, win 64480, options [mss 1240,sackOK,TS val 4127900581 ecr 0,nop,wscale 7], length 0
07:46:48.250465 ens18 Out IP 192.168.7.26.32836 > 1.1.1.1.443: Flags [S], seq 1080105978, win 64480, options [mss 1240,sackOK,TS val 4127900581 ecr 0,nop,wscale 7], length 0
07:46:48.259774 ens18 In  IP 1.1.1.1.443 > 192.168.7.26.32836: Flags [S.], seq 3227839981, ack 1080105979, win 65535, options [mss 1460,sackOK,TS val 2170826905 ecr 4127900581,nop,wscale 13], length 0
07:46:48.259793 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.32836: Flags [S.], seq 3227839981, ack 1080105979, win 65535, options [mss 1460,sackOK,TS val 2170826905 ecr 4127900581,nop,wscale 13], length 0
07:46:48.321426 wt0   In  IP 100.93.158.105.32836 > 1.1.1.1.443: Flags [.], ack 1, win 504, options [nop,nop,TS val 4127900651 ecr 2170826905], length 0
07:46:48.321442 ens18 Out IP 192.168.7.26.32836 > 1.1.1.1.443: Flags [.], ack 1, win 504, options [nop,nop,TS val 4127900651 ecr 2170826905], length 0
07:46:48.322012 wt0   In  IP 100.93.158.105.32836 > 1.1.1.1.443: Flags [F.], seq 1, ack 1, win 504, options [nop,nop,TS val 4127900651 ecr 2170826905], length 0
07:46:48.322025 ens18 Out IP 192.168.7.26.32836 > 1.1.1.1.443: Flags [F.], seq 1, ack 1, win 504, options [nop,nop,TS val 4127900651 ecr 2170826905], length 0
07:46:48.331567 ens18 In  IP 1.1.1.1.443 > 192.168.7.26.32836: Flags [F.], seq 1, ack 2, win 8, options [nop,nop,TS val 2170826977 ecr 4127900651], length 0
07:46:48.331598 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.32836: Flags [F.], seq 1, ack 2, win 8, options [nop,nop,TS val 2170826977 ecr 4127900651], length 0
07:46:48.377526 wt0   In  IP 100.93.158.105.32836 > 1.1.1.1.443: Flags [.], ack 2, win 504, options [nop,nop,TS val 4127900708 ecr 2170826977], length 0
07:46:48.377546 ens18 Out IP 192.168.7.26.32836 > 1.1.1.1.443: Flags [.], ack 2, win 504, options [nop,nop,TS val 4127900708 ecr 2170826977], length 0
^C
12 packets captured
14 packets received by filter
0 packets dropped by kernel
admin@exit1:~$ netbird down && netbird up
Disconnected
Connected
admin@exit1:~$ date && sudo tcpdump -i any -nn host 1.1.1.1 and port 443
Mon Oct 14 07:47:47 UTC 2024
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
07:47:51.625745 wt0   In  IP 100.93.158.105.56110 > 1.1.1.1.443: Flags [S], seq 2392515285, win 64480, options [mss 1240,sackOK,TS val 4127963956 ecr 0,nop,wscale 7], length 0
07:47:51.625771 ens18 Out IP 192.168.7.26.56110 > 1.1.1.1.443: Flags [S], seq 2392515285, win 64480, options [mss 1240,sackOK,TS val 4127963956 ecr 0,nop,wscale 7], length 0
07:47:51.635630 ens18 In  IP 1.1.1.1.443 > 192.168.7.26.56110: Flags [S.], seq 1033369258, ack 2392515286, win 65535, options [mss 1460,sackOK,TS val 3968834669 ecr 4127963956,nop,wscale 13], length 0
07:47:51.635670 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.56110: Flags [S.], seq 1033369258, ack 2392515286, win 65535, options [mss 1460,sackOK,TS val 3968834669 ecr 4127963956,nop,wscale 13], length 0
07:47:51.680578 wt0   In  IP 100.93.158.105.56110 > 1.1.1.1.443: Flags [.], ack 1, win 504, options [nop,nop,TS val 4127964011 ecr 3968834669], length 0
07:47:51.680599 ens18 Out IP 192.168.7.26.56110 > 1.1.1.1.443: Flags [.], ack 1, win 504, options [nop,nop,TS val 4127964011 ecr 3968834669], length 0
07:47:51.680832 wt0   In  IP 100.93.158.105.56110 > 1.1.1.1.443: Flags [F.], seq 1, ack 1, win 504, options [nop,nop,TS val 4127964011 ecr 3968834669], length 0
07:47:51.680852 ens18 Out IP 192.168.7.26.56110 > 1.1.1.1.443: Flags [F.], seq 1, ack 1, win 504, options [nop,nop,TS val 4127964011 ecr 3968834669], length 0
07:47:51.691850 ens18 In  IP 1.1.1.1.443 > 192.168.7.26.56110: Flags [F.], seq 1, ack 2, win 8, options [nop,nop,TS val 3968834725 ecr 4127964011], length 0
07:47:51.691877 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.56110: Flags [F.], seq 1, ack 2, win 8, options [nop,nop,TS val 3968834725 ecr 4127964011], length 0
07:47:51.738217 wt0   In  IP 100.93.158.105.56110 > 1.1.1.1.443: Flags [.], ack 2, win 504, options [nop,nop,TS val 4127964069 ecr 3968834725], length 0
07:47:51.738254 ens18 Out IP 192.168.7.26.56110 > 1.1.1.1.443: Flags [.], ack 2, win 504, options [nop,nop,TS val 4127964069 ecr 3968834725], length 0
^C
12 packets captured
13 packets received by filter
0 packets dropped by kernel
admin@exit1:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
admin@exit1:~$

Client:

admin@client:~$ date && ip route get 1.1.1.1
Mon Oct 14 07:46:38 UTC 2024
1.1.1.1 dev wt0 table netbird src 100.93.158.105 uid 1000 
    cache 
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:46:48 UTC 2024
Connection to 1.1.1.1 443 port [tcp/https] succeeded!
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:47:07 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && ip route get 1.1.1.1
Mon Oct 14 07:47:13 UTC 2024
1.1.1.1 dev wt0 table netbird src 100.93.158.105 uid 1000 
    cache 
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:47:16 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:47:22 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:47:29 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:47:35 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:47:42 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:47:51 UTC 2024
Connection to 1.1.1.1 443 port [tcp/https] succeeded!
admin@client:~$ 
mlsmaycon commented 1 month ago

Thanks, @rkleivel, for sharing the outputs.

Ok, to confirm what we see in with the timestamps, the failure was after you restarted the connection and is probably related to the time it took for the peers to connect. Right?

With the previous release, we fixed an issue with forwarding rules caused by the number of peers in an access control rule, which shouldn't affect nodes with exit nodes and no access control groups set in any of the routing peer routes. So it may not have affected you unless you had an access control group for a network route.

We will wait for your check with HA as well.

rkleivel commented 1 month ago

As promised, here follows the outputs for 1 client with 2 exit nodes. Initially both exit nodes are up. I confirm that routing through exit node 1 is OK, then I take exit node 1 down. Routing does not switch to exit node 2 until I manually deactivate and activate the exit node entry in the web GUI.

Client:

admin@client:~$ date && ip route get 1.1.1.1
Mon Oct 14 07:58:39 UTC 2024
1.1.1.1 dev wt0 table netbird src 100.93.158.105 uid 1000 
    cache 
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:59:10 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:59:19 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 07:59:35 UTC 2024
Connection to 1.1.1.1 443 port [tcp/https] succeeded!
admin@client:~$ 
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:00:11 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:00:27 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:00:44 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:01:10 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:01:41 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:03:04 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && ip route get 1.1.1.1
Mon Oct 14 08:04:25 UTC 2024
1.1.1.1 dev wt0 table netbird src 100.93.158.105 uid 1000 
    cache 
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:04:42 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:04:56 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:05:12 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:05:35 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:06:00 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:06:26 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ 
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:06:57 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:07:32 UTC 2024
nc: connect to 1.1.1.1 port 443 (tcp) timed out: Operation now in progress
admin@client:~$ date && ip route get 1.1.1.1
Mon Oct 14 08:08:28 UTC 2024
1.1.1.1 dev wt0 table netbird src 100.93.158.105 uid 1000 
    cache 
admin@client:~$

EXIT Node deactivated and activated in GUI at this point

admin@client:~$ date && nc -vw 5 -z 1.1.1.1 443
Mon Oct 14 08:08:30 UTC 2024
Connection to 1.1.1.1 443 port [tcp/https] succeeded!
admin@client:~$ date
Mon Oct 14 08:08:54 UTC 2024
admin@client:~$

Exit Node 1:

admin@exit1:~$ date && sysctl net.ipv4.ip_forward
Mon Oct 14 07:58:46 UTC 2024
net.ipv4.ip_forward = 1
admin@exit1:~$ date && sudo tcpdump -i any -nn host 1.1.1.1 and port 443 
Mon Oct 14 07:59:01 UTC 2024
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
07:59:39.508343 wt0   In  IP 100.93.158.105.47854 > 1.1.1.1.443: Flags [S], seq 2382693363, win 64480, options [mss 1240,sackOK,TS val 4128671838 ecr 0,nop,wscale 7], length 0
07:59:39.508381 ens18 Out IP 192.168.7.26.47854 > 1.1.1.1.443: Flags [S], seq 2382693363, win 64480, options [mss 1240,sackOK,TS val 4128671838 ecr 0,nop,wscale 7], length 0
07:59:39.517940 ens18 In  IP 1.1.1.1.443 > 192.168.7.26.47854: Flags [S.], seq 2277416614, ack 2382693364, win 65535, options [mss 1460,sackOK,TS val 1861873 ecr 4128671838,nop,wscale 13], length 0
07:59:39.517982 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.47854: Flags [S.], seq 2277416614, ack 2382693364, win 65535, options [mss 1460,sackOK,TS val 1861873 ecr 4128671838,nop,wscale 13], length 0
07:59:39.563320 wt0   In  IP 100.93.158.105.47854 > 1.1.1.1.443: Flags [.], ack 1, win 504, options [nop,nop,TS val 4128671893 ecr 1861873], length 0
07:59:39.563339 ens18 Out IP 192.168.7.26.47854 > 1.1.1.1.443: Flags [.], ack 1, win 504, options [nop,nop,TS val 4128671893 ecr 1861873], length 0
07:59:39.564156 wt0   In  IP 100.93.158.105.47854 > 1.1.1.1.443: Flags [F.], seq 1, ack 1, win 504, options [nop,nop,TS val 4128671893 ecr 1861873], length 0
07:59:39.564175 ens18 Out IP 192.168.7.26.47854 > 1.1.1.1.443: Flags [F.], seq 1, ack 1, win 504, options [nop,nop,TS val 4128671893 ecr 1861873], length 0
07:59:39.573752 ens18 In  IP 1.1.1.1.443 > 192.168.7.26.47854: Flags [F.], seq 1, ack 2, win 8, options [nop,nop,TS val 1861929 ecr 4128671893], length 0
07:59:39.573792 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.47854: Flags [F.], seq 1, ack 2, win 8, options [nop,nop,TS val 1861929 ecr 4128671893], length 0
07:59:39.618865 wt0   In  IP 100.93.158.105.47854 > 1.1.1.1.443: Flags [.], ack 2, win 504, options [nop,nop,TS val 4128671948 ecr 1861929], length 0
07:59:39.618890 ens18 Out IP 192.168.7.26.47854 > 1.1.1.1.443: Flags [.], ack 2, win 504, options [nop,nop,TS val 4128671948 ecr 1861929], length 0
^C
12 packets captured
14 packets received by filter
0 packets dropped by kernel
admin@exit1:~$ date && netbird down
Mon Oct 14 08:00:03 UTC 2024
Disconnected
admin@exit1:~$ date && sudo tcpdump -i any -nn host 1.1.1.1 and port 443 
Mon Oct 14 08:04:40 UTC 2024
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
^C
0 packets captured
2 packets received by filter
0 packets dropped by kernel
admin@exit1:~$ date
Mon Oct 14 08:08:58 UTC 2024
admin@exit1:~$

Exit Node 2:

admin@exit2:~$ date && sysctl net.ipv4.ip_forward
Mon Oct 14 07:58:49 UTC 2024
net.ipv4.ip_forward = 1
admin@exit2:~$ date && sudo tcpdump -i any -nn host 1.1.1.1 and port 443 
Mon Oct 14 07:59:02 UTC 2024
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
08:08:30.973700 wt0   In  IP 100.93.158.105.35580 > 1.1.1.1.443: Flags [S], seq 1509625333, win 64480, options [mss 1240,sackOK,TS val 4129203321 ecr 0,nop,wscale 7], length 0
08:08:30.973717 eth0  Out IP 172.17.0.4.35580 > 1.1.1.1.443: Flags [S], seq 1509625333, win 64480, options [mss 1240,sackOK,TS val 4129203321 ecr 0,nop,wscale 7], length 0
08:08:30.985416 eth0  In  IP 1.1.1.1.443 > 172.17.0.4.35580: Flags [S.], seq 1235011614, ack 1509625334, win 65535, options [mss 1460,sackOK,TS val 560577716 ecr 4129203321,nop,wscale 13], length 0
08:08:30.985430 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.35580: Flags [S.], seq 1235011614, ack 1509625334, win 65535, options [mss 1460,sackOK,TS val 560577716 ecr 4129203321,nop,wscale 13], length 0
08:08:30.996615 wt0   In  IP 100.93.158.105.35580 > 1.1.1.1.443: Flags [.], ack 1, win 504, options [nop,nop,TS val 4129203343 ecr 560577716], length 0
08:08:30.996627 eth0  Out IP 172.17.0.4.35580 > 1.1.1.1.443: Flags [.], ack 1, win 504, options [nop,nop,TS val 4129203343 ecr 560577716], length 0
08:08:30.996632 wt0   In  IP 100.93.158.105.35580 > 1.1.1.1.443: Flags [F.], seq 1, ack 1, win 504, options [nop,nop,TS val 4129203344 ecr 560577716], length 0
08:08:30.996636 eth0  Out IP 172.17.0.4.35580 > 1.1.1.1.443: Flags [F.], seq 1, ack 1, win 504, options [nop,nop,TS val 4129203344 ecr 560577716], length 0
08:08:31.009041 eth0  In  IP 1.1.1.1.443 > 172.17.0.4.35580: Flags [.], ack 2, win 8, options [nop,nop,TS val 560577740 ecr 4129203344], length 0
08:08:31.009041 eth0  In  IP 1.1.1.1.443 > 172.17.0.4.35580: Flags [F.], seq 1, ack 2, win 8, options [nop,nop,TS val 560577740 ecr 4129203344], length 0
08:08:31.009062 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.35580: Flags [.], ack 2, win 8, options [nop,nop,TS val 560577740 ecr 4129203344], length 0
08:08:31.009081 wt0   Out IP 1.1.1.1.443 > 100.93.158.105.35580: Flags [F.], seq 1, ack 2, win 8, options [nop,nop,TS val 560577740 ecr 4129203344], length 0
08:08:31.030793 wt0   In  IP 100.93.158.105.35580 > 1.1.1.1.443: Flags [.], ack 2, win 504, options [nop,nop,TS val 4129203377 ecr 560577740], length 0
08:08:31.030800 eth0  Out IP 172.17.0.4.35580 > 1.1.1.1.443: Flags [.], ack 2, win 504, options [nop,nop,TS val 4129203377 ecr 560577740], length 0
^C
14 packets captured
16 packets received by filter
0 packets dropped by kernel
admin@exit2:~$ date
Mon Oct 14 08:08:50 UTC 2024
admin@exit2:~$
rkleivel commented 1 month ago

Thanks, @rkleivel, for sharing the outputs.

Ok, to confirm what we see in with the timestamps, the failure was after you restarted the connection and is probably related to the time it took for the peers to connect. Right?

With the previous release, we fixed an issue with forwarding rules caused by the number of peers in an access control rule, which shouldn't affect nodes with exit nodes and no access control groups set in any of the routing peer routes. So it may not have affected you unless you had an access control group for a network route.

We will wait for your check with HA as well.

I can confirm that the failure was after netbird down && netbird up on the exit node. Now it just takes up to a couple of minutes till the connection is restored. Last week, when creating this issue, I could easily wait for half an hour and still connection was not restored. I can also confirm that Access Control Groups (optioinal) in the web GUI Exit node definition is empty during my tests.

mgarces commented 2 weeks ago

hi there, we have released 0.31.1, that addresses some of the issues described here; can you please test it with this version?

rkleivel commented 9 hours ago

My apology for late response on this. I have tested the initial scenario, as well as the high availability aspect on 0.33.0, and all now seems to work as expected. So thanks a lot for that!

I do, however, see some strange effects on docker containers that run inside a netbird node that uses_exit_node (see test system setup in my initial post):

the request times out. This does not happen if the exit node in use is hosted elsewhere, or if I do not use an exit node. If I run the same curl https://... on the docker host, it is also always fine no matter where the exit node is hosted. Http endpoints are always OK. All nodes involved in the test have the same setup of Ubuntu 24.04, with netbird 0.33.0

I do not necessarily expect this to be a netbird issue, but would be very thankful for any thoughts on where Netbird possibly could intersect with a VM in Azure causing such an effect.