netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
11.03k stars 508 forks source link

unstable connection #1694

Closed pascal456 closed 6 months ago

pascal456 commented 7 months ago

Describe the problem

I am trying out netbird as VPN and I am really satisfied with the setup process. Everything went fine connecting

Now in the office everything is fine. Connection is stable. Actually the machines are on the same local network in that case.

Then, when using the connection from home, it is extremely unstable.

To Reproduce

  1. Scenario: ping for simple connection check

    • Connect from remote / home network
    • run ping; experiencing timeouts; somhow it looks like it drops for the period of about three ping attempts
    • sometimes however, the pings seem to be constantly working / the connection seems to be stable over a longer period of time; then again, when I start using ssh (see scenario 2 below) the connection gets unstable again
  2. Scenario: Use Case remote development

    • I use the connection to login via ssh
    • all in all ssh is functioning as expected: I can connect, I can manage the system etc.
    • the connection drops sporadically
    • I use VS Code to connect remotely via ssh and develop on the remote machine. Now when the connection drops, the connection cannot be re-established correctly anymore, resulting in loss of progress (worst case)
    • therefore, for remote development NetBird is not a practical option for us in the moment

Expected behavior

Are you using NetBird Cloud?

NetBird version

netbird version

on connecting machine / dev-pc1 (windows client):

PS C:\Windows\System32> netbird version
0.26.2

on remote-workstation / ubuntu:

➜  ~ netbird version
0.26.2

NetBird status -d output:

on connecting machine / dev-pc1 (windows client):

``` PS C:\Windows\System32> netbird status -d Peers detail: .netbird.cloud: NetBird IP: 100.85.98.15/32 Public key: sP0Ik/u/rzaGU65ueve8UnvI3rAgFhA9BbtFgR3o9wQ= Status: Disconnected -- detail -- Connection type: Relayed Direct: false ICE candidate (Local/Remote): srflx/relay ICE candidate endpoints (Local/Remote): 84.160.59.148:51820/3.73.3.142:51820 Last connection update: 2024-03-12 19:51:00 Last WireGuard handshake: 2024-03-12 20:46:43 Transfer status (received/sent) 5.7 KiB/1.9 KiB Quantum resistance: false .netbird.cloud: NetBird IP: 100.85.20.201 Public key: drkPNR4V8ndQ535lmU+41zHpfwsq1/GS8gUIyc2uyEQ= Status: Connected -- detail -- Connection type: Relayed Direct: true ICE candidate (Local/Remote): host/relay ICE candidate endpoints (Local/Remote): 127.0.0.1:51820/3.73.3.142:51820 Last connection update: 2024-03-12 20:43:30 Last WireGuard handshake: 2024-03-12 20:45:36 Transfer status (received/sent) 216 B/648 B Quantum resistance: false .netbird.cloud: NetBird IP: 100.85.106.204 Public key: hpM/znyAKnHUJ40V6s6mrQJsMjHXVrdiJ6+cGl6v7SE= Status: Connected -- detail -- Connection type: Relayed Direct: true ICE candidate (Local/Remote): srflx/relay ICE candidate endpoints (Local/Remote): 84.160.59.148:51820/18.157.58.205:51820 Last connection update: 2024-03-12 19:51:03 Last WireGuard handshake: 2024-03-12 20:47:26 Transfer status (received/sent) 2.7 KiB/9.4 KiB Quantum resistance: false .netbird.cloud: NetBird IP: 100.85.144.221 Public key: DfjaBkVLNA9XiJpdqSpEO9//lKGwuHdkG1WVtAQpODo= Status: Connected -- detail -- Connection type: Relayed Direct: true ICE candidate (Local/Remote): srflx/relay ICE candidate endpoints (Local/Remote): 84.160.59.148:51820/3.73.3.142:51820 Last connection update: 2024-03-12 20:44:30 Last WireGuard handshake: 2024-03-12 20:46:43 Transfer status (received/sent) 5.7 KiB/1.9 KiB Quantum resistance: false .netbird.cloud: NetBird IP: 100.85.162.129 Public key: klrTfHLpQvDEQWppjJKVlC6skkT+I9ccGjkVnxG93GA= Status: Connected -- detail -- Connection type: Relayed Direct: true ICE candidate (Local/Remote): host/relay ICE candidate endpoints (Local/Remote): 172.29.240.1:51820/18.157.58.205:51820 Last connection update: 2024-03-12 19:51:02 Last WireGuard handshake: 2024-03-12 20:46:30 Transfer status (received/sent) 3.5 KiB/6.2 KiB Quantum resistance: false Daemon version: 0.26.2 CLI version: 0.26.2 Management: Connected to https://api.netbird.io:443 Signal: Connected to https://signal.netbird.io:443 Relays: [stun:stun.netbird.io:5555] is Available [turns:turn.netbird.io:443?transport=tcp] is Available FQDN: .netbird.cloud NetBird IP: 100.85.215.26/16 Interface type: Userspace Quantum resistance: false Peers count: 4/5 Connected ```

one remote-workstation / ubuntu:

``` ➜ ~ netbird status -d Peers detail: .netbird.cloud: NetBird IP: 100.85.98.15/32 Public key: sP0Ik/u/rzaGU65ueve8UnvI3rAgFhA9BbtFgR3o9wQ= Status: Disconnected -- detail -- Connection type: Relayed Direct: false ICE candidate (Local/Remote): relay/srflx ICE candidate endpoints (Local/Remote): 3.73.3.142:12814/84.160.59.148:12814 Last connection update: 2024-03-12 19:44:49 Last WireGuard handshake: 2024-03-12 20:43:05 Transfer status (received/sent) 968 B/1.2 KiB Quantum resistance: false .netbird.cloud: NetBird IP: 100.85.20.201 Public key: drkPNR4V8ndQ535lmU+41zHpfwsq1/GS8gUIyc2uyEQ= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): host/host ICE candidate endpoints (Local/Remote): 192.26.175.25:51820/192.26.175.24:51820 Last connection update: 2024-03-12 18:05:48 Last WireGuard handshake: 2024-03-12 20:42:50 Transfer status (received/sent) 15.1 KiB/13.8 KiB Quantum resistance: false .netbird.cloud: NetBird IP: 100.85.106.204 Public key: hpM/znyAKnHUJ40V6s6mrQJsMjHXVrdiJ6+cGl6v7SE= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): host/prflx ICE candidate endpoints (Local/Remote): 192.26.175.25:51820/192.26.175.26:51820 Last connection update: 2024-03-12 17:31:40 Last WireGuard handshake: 2024-03-12 20:42:14 Transfer status (received/sent) 20.1 KiB/14.7 KiB Quantum resistance: false .netbird.cloud: NetBird IP: 100.85.162.129 Public key: klrTfHLpQvDEQWppjJKVlC6skkT+I9ccGjkVnxG93GA= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): host/prflx ICE candidate endpoints (Local/Remote): 192.26.175.25:51820/192.26.175.27:51820 Last connection update: 2024-03-11 15:44:29 Last WireGuard handshake: 2024-03-12 20:42:23 Transfer status (received/sent) 172.6 KiB/139.4 KiB Quantum resistance: false .netbird.cloud: NetBird IP: 100.85.215.26 Public key: i/mo6lL/Q5OJQqtDItLtYSOQTrsAtC3DwP78aayMX28= Status: Connected -- detail -- Connection type: Relayed Direct: false ICE candidate (Local/Remote): relay/srflx ICE candidate endpoints (Local/Remote): 3.73.3.142:12814/84.160.59.148:12814 Last connection update: 2024-03-12 20:43:00 Last WireGuard handshake: 2024-03-12 20:43:05 Transfer status (received/sent) 968 B/1.2 KiB Quantum resistance: false Daemon version: 0.26.2 CLI version: 0.26.2 Management: Connected to https://api.netbird.io:443 Signal: Connected to https://signal.netbird.io:443 Relays: [stun:stun.netbird.io:5555] is Unavailable, reason: stun request: context deadline exceeded [turns:turn.netbird.io:443?transport=tcp] is Available FQDN: .netbird.cloud NetBird IP: 100.85.144.221/16 Interface type: Kernel Quantum resistance: false Peers count: 4/5 Connected ```

Screenshots

stability of pings (ping from dev-pc1 to remote-workstation):

``` Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=26ms TTL=64 Reply from 100.85.144.221: bytes=32 time=27ms TTL=64 Reply from 100.85.144.221: bytes=32 time=26ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=26ms TTL=64 Reply from 100.85.144.221: bytes=32 time=27ms TTL=64 Reply from 100.85.144.221: bytes=32 time=27ms TTL=64 Reply from 100.85.144.221: bytes=32 time=26ms TTL=64 Request timed out. Request timed out. Request timed out. Reply from 100.85.144.221: bytes=32 time=1796ms TTL=64 Reply from 100.85.144.221: bytes=32 time=32ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=33ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=50ms TTL=64 Reply from 100.85.144.221: bytes=32 time=73ms TTL=64 Request timed out. Request timed out. Request timed out. Reply from 100.85.144.221: bytes=32 time=1870ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=32ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=32ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Reply from 100.85.144.221: bytes=32 time=30ms TTL=64 Reply from 100.85.144.221: bytes=32 time=31ms TTL=64 Reply from 100.85.144.221: bytes=32 time=29ms TTL=64 Request timed out. Request timed out. Request timed out. Reply from 100.85.144.221: bytes=32 time=27ms TTL=64 Reply from 100.85.144.221: bytes=32 time=26ms TTL=64 Reply from 100.85.144.221: bytes=32 time=26ms TTL=64 Reply from 100.85.144.221: bytes=32 time=26ms TTL=64 Reply from 100.85.144.221: bytes=32 time=27ms TTL=64 Reply from 100.85.144.221: bytes=32 time=27ms TTL=64 Request timed out. Request timed out. Request timed out. Reply from 100.85.144.221: bytes=32 time=1024ms TTL=64 Reply from 100.85.144.221: bytes=32 time=42ms TTL=64 Reply from 100.85.144.221: bytes=32 time=27ms TTL=64 Reply from 100.85.144.221: bytes=32 time=31ms TTL=64 Reply from 100.85.144.221: bytes=32 time=27ms TTL=64 Reply from 100.85.144.221: bytes=32 time=36ms TTL=64 ``` screenshot: ![grafik](https://github.com/netbirdio/netbird/assets/21061035/dc55dc44-cb30-464e-abd6-b99cf4eadd25)

Additional context

karstennilsen commented 7 months ago

Did you already try:

[Environment]::SetEnvironmentVariable("NB_ICE_DISCONNECTED_TIMEOUT_SEC", "10", "Machine")

This works for the issue like written above in our setup with Windows clients. We are using 13 in stead of 10 as value I believe.

See: https://github.com/netbirdio/netbird/issues/1195#issuecomment-1962321207

support-tt commented 7 months ago

We got the same problem with 1 client out of 8 and 2 server. After settings the enviroment variable the problem was solved. very strange. Some more informations: the client with the problem has no timouts to the internet or when we are using openvpn instead of netbird. the client is Windows 10

pascal456 commented 7 months ago

thanks @support-tt and @karstennilsen for the input. I tried it out.

I have set this on the windows client by adding the setting to C:\ProgramData\Netbird\config.json but it extremely worsened the situation:

PS C:\Windows\System32> ping zm2ws1 -t -4

Pinging zm2ws1.netbird.cloud [100.85.144.221] with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 100.85.144.221: bytes=32 time=783ms TTL=64
Reply from 100.85.144.221: bytes=32 time=29ms TTL=64
Reply from 100.85.144.221: bytes=32 time=28ms TTL=64
Reply from 100.85.144.221: bytes=32 time=30ms TTL=64
Reply from 100.85.144.221: bytes=32 time=33ms TTL=64
Reply from 100.85.144.221: bytes=32 time=29ms TTL=64
Reply from 100.85.144.221: bytes=32 time=31ms TTL=64
Reply from 100.85.144.221: bytes=32 time=30ms TTL=64
Reply from 100.85.144.221: bytes=32 time=29ms TTL=64
Reply from 100.85.144.221: bytes=32 time=30ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from 100.85.144.221: bytes=32 time=635ms TTL=64
Reply from 100.85.144.221: bytes=32 time=29ms TTL=64
Reply from 100.85.144.221: bytes=32 time=28ms TTL=64
Reply from 100.85.144.221: bytes=32 time=28ms TTL=64
Reply from 100.85.144.221: bytes=32 time=29ms TTL=64
Reply from 100.85.144.221: bytes=32 time=28ms TTL=64
Reply from 100.85.144.221: bytes=32 time=30ms TTL=64
Reply from 100.85.144.221: bytes=32 time=28ms TTL=64
Reply from 100.85.144.221: bytes=32 time=28ms TTL=64
Reply from 100.85.144.221: bytes=32 time=29ms TTL=64
Request timed out.
Request timed out.

Did you set this on the connecting client only? Or also on the remote host? And as far as I understood this is only an issue between Windows clients, is that correct in your case? I am connecting to a Ubuntu Server machine, wondering if that makes a difference?

Fantu commented 7 months ago

A connection is established for each peer, so when a connection between 2 peers has continuous disconnections you need to try to increase the timeout on the 2 peers. However, if the internet connection of one or both peers is too unstable or there are routers/firewalls in the route that close the connections, I don't think this setting can help. On the contrary, the time before reconnection may increase in some cases.

pascal456 commented 7 months ago

When you do it right, it works. Sorry for the confusion. The setting did not worsen the situation: actually I did the setting in the settings file, and doing it with the preceding NB seems to be wrong. At first the client did not start at all; corrupted settings file. Then I did it with

[Environment]::SetEnvironmentVariable("NB_ICE_DISCONNECTED_TIMEOUT_SEC", "10", "Machine")

but forgot to restart the service (I confused the service with the client / UI-Agent). It is not sufficient to restart that. So that

netstart service restart

on the Windows machine did the trick.

I am wondering @karstennilsen, where did you get that setting from? I cannot find anything on it in the docs?

mlsmaycon commented 7 months ago

Thanks for the support @Fantu and @karstennilsen.

@pascal456 the flag is not part of the agent's command, but we will add this as default in the client and have it as a flag too.

Fantu commented 7 months ago

I think that probably is good to change ice disconnect timeout to 10 sec as default. The pros seem more than the cons to me, or I'm wrong?

pascal456 commented 7 months ago

For my part, this ticket can be closed because I have a solution to the problem for now. Do you want it to stay open for tracking or do you want to open a separate one @mlsmaycon ?