netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
10.5k stars 472 forks source link

high packet loss from continuous peer reconnections in some cases with windows to windows connections #1195

Open Fantu opened 11 months ago

Fantu commented 11 months ago

Hi, from a windows 10 pro device with netbird 0.23.7 connecting to a windows 10 pro I saw frequent freeze of rdp session and a ping packet lost during these problems. I updated destination device to netbird 0.23.7 but problem persist and 12% of packet lost

Statistiche Ping per 100.117.66.12:
    Pacchetti: Trasmessi = 49, Ricevuti = 43,
    Persi = 6 (12% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
    Minimo = 16ms, Massimo =  1717ms, Medio =  59ms

I tried connetting to a linux device and no issue or packet lost detected

I tried connecting to windows 2019 standard, netbird 0.23.7 and same issue, even if less frequest still not usable and 7% packet lost

Statistiche Ping per 100.117.119.201:
    Pacchetti: Trasmessi = 107, Ricevuti = 99,
    Persi = 8 (7% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
    Minimo = 15ms, Massimo =  3390ms, Medio =  75ms

I tried to check the client.log but there are only these lines on the test (and issues) time range:

windows 10 pr source

2023-10-05T15:43:51+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:54500
2023-10-05T15:45:41+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:57599
2023-10-05T15:46:01+02:00 INFO client/internal/peer/conn.go:337: connected to peer dbrzwmy0grg6A18e4h4QPG32b/g2B68boB5SZ1lwPEo=, endpoint address: 127.0.0.1:55932
2023-10-05T15:46:17+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:54166
2023-10-05T15:46:49+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:53421
2023-10-05T15:48:08+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:58581
2023-10-05T15:48:44+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:58032
2023-10-05T15:49:39+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:50940
2023-10-05T15:50:04+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:64547
2023-10-05T15:50:47+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:52384
2023-10-05T15:51:50+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:64640
2023-10-05T15:52:22+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:56654
2023-10-05T15:52:52+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:60795
2023-10-05T15:53:45+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:54856
2023-10-05T15:54:05+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:57528
2023-10-05T15:54:44+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:56627
2023-10-05T15:55:04+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:56431
2023-10-05T15:55:24+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:59815
2023-10-05T15:58:08+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:62078
2023-10-05T15:58:46+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:53470
2023-10-05T16:00:08+02:00 INFO client/internal/peer/conn.go:337: connected to peer gTjUAU9z/zwMkmd9iaIY/6UDPL3NgyYIZSq9Bb8kNGY=, endpoint address: 127.0.0.1:51862

windows 2019 destination

2023-10-05T15:43:49+02:00 INFO client/internal/peer/conn.go:337: connected to peer dbrzwmy0grg6A18e4h4QPG32b/g2B68boB5SZ1lwPEo=, endpoint address: 192.168.1.170:51820
2023-10-05T15:43:51+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50782
2023-10-05T15:44:04+02:00 INFO client/internal/peer/conn.go:337: connected to peer QEpV+NKRdFU82Sw/wBvCVn2AOxyEVUilicveHQ6kqgc=, endpoint address: 192.168.1.180:50248
2023-10-05T15:45:42+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50205
2023-10-05T15:46:18+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50826
2023-10-05T15:46:50+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50349
2023-10-05T15:48:08+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50572
2023-10-05T15:48:45+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:49612
2023-10-05T15:49:39+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:49890
2023-10-05T15:50:05+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50210
2023-10-05T15:50:47+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50968
2023-10-05T15:51:50+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:49554
2023-10-05T15:52:22+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:49404
2023-10-05T15:52:53+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50642
2023-10-05T15:53:46+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:49490
2023-10-05T15:54:05+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50317
2023-10-05T15:54:44+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50758
2023-10-05T15:55:04+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:49770
2023-10-05T15:55:24+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50484
2023-10-05T15:58:09+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:50704
2023-10-05T15:58:46+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:49220
2023-10-05T16:00:08+02:00 INFO client/internal/peer/conn.go:337: connected to peer 659jTAbdpL8qPoe8nwvmu3Z5zJRdUd46KFr+t0FOyUA=, endpoint address: 192.168.1.180:49619

I don't know if there is something useful for you if you need more info tell me how to debug and provide useful data

thanks for any reply and sorry for my bad english

mlsmaycon commented 11 months ago

Hello @Fantu can you share more details about your NetBird configuration? Do you have network routes involved and what are the networks the two peers are part of?

Fantu commented 11 months ago

@mlsmaycon thanks for reply, netbird server is selfhosted, there are no network routes on netbird configuration and the devices of the tests have 2 groups ("All" and "M2R") on access control page I see: "M2R"<->"M2R" protocol ALL ports ALL the Default related to "All"<->"All" is disabled instead sorry if I have almost no knowledge of netbird and have only done a few quick things when my boss told me to install and then 2 times (including this one) where he reported problems to fix tell me if the info provided is wrong or incomplete

mlsmaycon commented 11 months ago

Thanks @Fantu , we see a lot of reconnection in your logs. Can you share more details about the network of both machines? Are they in the same office, data center, or cloud?

Regarding the connection, they also seem to be using relay, can you ensure that the server where you are running NetBird services, have the following ports accessible from the internet:

Fantu commented 11 months ago

netbird server and destination devices are in the same LAN on office and here my first question: why these reconnections on LAN connections which in theory should have no problems? source device is in different network on my home and use internet, anyway both internet connection works correctly and using a normal wireguard vpn to office network instead of netbird I can work on same devices without any issue and 0 packet lost

netbird server seems have all ports ok FWIK and with port forwarding on NAT of gateway. there was an issue on custom signal port on start but was solved with https://github.com/netbirdio/netbird/commit/8d18190c9412268aa0279a4e89d4e02cbd1bdb46 about coturn range is custom but range in coturn configuration is correct and udp ports list viewed with "ss" command are all in the correct range I don't know a way to test and check are really working correctly, is there?

anyway I have another question: connection with coturn should be only the relay one when direct connection is not possible or on start when direct connection are not started and after should use direct connections between devices or am I wrong? at least take a quick look at the documentation but from the logs I don't seem to see direct connections between the devices and I also don't understand why in the source device logs there is always the loopback IP as an endpoint unlike the devices in the LAN which have the IP of the netbird server or of other devices. shouldn't there be the internet IP of the netbird server instead on the source device? is there a cli command to view all devices, the active connections and their details (or if are missed connection) or some other useful command for debugging? for example of have a small experience with zerotier and on it there is zerotier-cli peers useful for similar thing

thanks for any reply and sorry for my bad english

Fantu commented 11 months ago

I found the cli command to check peers connection status netbird status -d --filter-by-ips ... and I did some tests with both devices (source and destination) on LAN (or connected with other stable VPN) there is P2P connetion stable

Peers detail:
 administration.netbird.selfhosted:
  NetBird IP: 100.117.66.12
  Public key: xxx
  Status: Connected
  -- detail --
  Connection type: P2P
  Direct: true
  ICE candidate (Local/Remote): srflx/host
  Last connection update: 2023-10-09 12:22:54

connection from remote instead have a relay connection that continue to disconnect/reconnect:

Peers detail:
 administration.netbird.selfhosted:
  NetBird IP: 100.117.66.12
  Public key: xxx
  Status: Connected
  -- detail --
  Connection type: Relayed
  Direct: false
  ICE candidate (Local/Remote): relay/host
  Last connection update: 2023-10-09 14:38:31

...

Peers detail:
 administration.netbird.selfhosted:
  NetBird IP: 100.117.66.12
  Public key: xxx
  Status: Connected
  -- detail --
  Connection type: Relayed
  Direct: false
  ICE candidate (Local/Remote): relay/host
  Last connection update: 2023-10-09 14:39:06

...

Peers detail:
 administration.netbird.selfhosted:
  NetBird IP: 100.117.66.12
  Public key: xxx
  Status: Connected
  -- detail --
  Connection type: Relayed
  Direct: false
  ICE candidate (Local/Remote): relay/host
  Last connection update: 2023-10-09 14:41:19

for the brief moments it remains connected it seems works well and packets seems are lost only during disconnections Is it perhaps that it keeps trying to connect in P2P but fails and goes back to relay or are there other problems? If you need further information/tests let me know

EDIT: from a fast look netbird use ice for manage peers connections, update ice from 2.3.1 to 2.3.11 which include many bugfixes can be useful to fix something and probably also this?

Fantu commented 10 months ago

I did some tests with custom build with updated ICE (https://github.com/M2Rbiz/netbird/commit/3b00ee9a4241b26f313a6cdfcef8d7891a4a7809) but issue persist :( issue seems happen only when one or both peer is inside a network with symmetrical nat (pfsense, freebsd don't support fullcone for now https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219803) and double nat (the router is fritzbox that have fullcone nat)

I suppose other users have symmetrical nat, nobody saw this issue before?

on major of cases netbird is more stable that zerotier and faster to connect/reconnect between peers except this case where with netbird is unusable (for reconnect "loop") and with zerotier even if with packet loss, and occasional rdp disconnect is enough usable

at the moment I don't know what else to do and my boss seems willing to abandon the use of netbird any advice is appreciated thanks for any reply and sorry for my bad english

Fantu commented 10 months ago

here a client log debug on a peer with the issue: https://paste.debian.net/1295455/

netbird status -d --filter-by-ips 100.117.66.12
Peers detail:
 administration.netbird.selfhosted:
  NetBird IP: 100.117.66.12
  Public key: dbrzwmy0grg6A18e4h4QPG32b/g2B68boB5SZ1lwPEo=
  Status: Connected
  -- detail --
  Connection type: Relayed
  Direct: false
  ICE candidate (Local/Remote): relay/host
  Last connection update: 2023-10-18 18:44:52 

peer dbrzwmy0grg6A18e4h4QPG32b/g2B68boB5SZ1lwPEo= is one with reconnect loop issue

I don't understand what the problem could be looking the logs, maybe I don't have enough experience, does anyone have any ideas/advice please?

Fantu commented 10 months ago

From a fast test with managed netbird service on same device the issue don't happen So seems happen only when one windows peer in lan with netbird server connect to external peer. I also saw the other cases, using the same selfhosted netbird server, where windows peer of one external network connect to peers of other external network don't have the issue and even if stay in relay is stable. I'll probably do more tests next week to be sure.

small question OT: I tried managed netbird for this fast test and I saw only peers list and is only possible add peers with login, I didn't saw these feature listed in free plan: Access controls, Private DNS, Network routes, Management activity logging I suppose the cause is my user result with role "user" and not possible to change to admin should not be set "admin", as the first user of the selfhosted?

mlsmaycon commented 10 months ago

Hello @Fantu it should be set as admin, but is very likely that there is an admin account for your domain already.

Can you reach out on slack so we can discuss your case in detail?

Fantu commented 10 months ago

thanks for the reply, there is another user with the same domain who had signed up before and there were no peers because he only had that signup/login by mistake. I would recommend adding something in the managed service interface to make sure that cases like these are seen/understood

returning in topic: from the first tests this morning with the same selfhost server adding peers from another external network with pfsense (therefore nat symmetrical and double like the office) and was in relay without problems it seemed confirmed that the problem was with windows peer in lan with netbird server. instead then trying from other peers from 2 other networks (both always w10) towards an office peer (with which we had problems) in those cases there were no problems and one also has a "worse" wifi connection instead of vdsl or FTTH but he had no disconnections, no packet loss but only some rare packets with latency of about 1 second so at the moment I have no idea what other tests to do and under what conditions this problem occurs

teoder commented 9 months ago

Having the exact same issue. A bunch of Windows PCs scattered behind various Firewalls and they are disconnecting/reconnecting frequently. We basically only use one Peer as exit node for a bunch of services and have a self-hosted Management/Admin server with a Public IP and the required port-openings.

If I take my laptop and use LTE-connection without any NAT and get status "Direct: true" it works without any issues. I even get P2P/True to some devices, so the issues seems to be related to having NAT on both source and destination.

I'll keep troubleshooting and update if I have any findings that might help you out.

Fantu commented 8 months ago

About the issue in my office it seems to be related to the netbird selfhost server (more exactly issue seems related to coturn) in a network with pfsense behind another nat and without access to a public ip. using an external coturn with single nat with public IP and the managed service there were no problems.

After my employer has started to put netbird using the managed service on some devices of some customers (I think he will want to completely replace zerotier with that) and problems of frequent disconnections have started on some windows devices even using the managed service.

After some debugging attempts without being able to find the cause and in which the netbird debug logs did not give any details regarding the disconnection or possible problems to work to try to resolve or improve yesterday thanks to the help of Pascal Fisher I finally managed to find something which could be useful. Initially try to enable ICE debug logs was difficult because trying $env:NB_LOG_LEVEL="debug"; $env:PIONS_LOG_DEBUG="all"; netbird service run | Tee-Object -FilePath "disconnect.logs" was not working, as seems don't possible run netbird in foreground like with linux. I tried to add env. variable also in the windows service with regedit and in other way but netbird client.log was always without ICE debug. Pascal Fisher found a solution hardcoding it in custom build I after used for debug on peers with disconnect issue.

Here a cut from one ICE debug log with strange thing spotted after disconnect:

ice WARNING: 2023/12/21 15:54:07 Discard message from (udp4 relay 35.234.101.101:58033 related 192.168.168.110:54256), unknown TransactionID 0x8f16f56aba969acc059bffeb
turnc DEBUG: 15:54:07.929646 allocation.go:132: Refresh timer 1 expired
turnc DEBUG: 15:54:07.945809 allocation.go:173: Set new nonce with 16 bytes
turnc DEBUG: 15:54:07.945809 allocation.go:56: Refresh allocation: 438, got new nonce.
turnc DEBUG: 15:54:07.971915 allocation.go:127: Refresh permissions successful
ice WARNING: 2023/12/21 15:54:11 Discard message from (udp4 relay 35.234.101.101:58033 related 192.168.168.110:54256), unknown TransactionID 0x13a8d2430fae1880ea81004d
ice INFO: 2023/12/21 15:54:17 Setting new connection state: Disconnected
turnc DEBUG: 15:54:17.797803 allocation.go:77: Send refresh request (dontWait=true)
ice WARNING: 2023/12/21 15:54:17 Failed to read from candidate udp4 relay 35.234.101.101:37089 related 192.168.178.27:64981: read udp 35.234.101.101:37089: use of closed network connection
turnc DEBUG: 15:54:17.798374 allocation.go:84: Refresh request sent
turnc DEBUG: 15:54:17.798374 client.go:177: Failed to read: read tcp4 192.168.178.27:64981->35.234.101.101:443: use of closed network connection. Exiting loop
ice INFO: 2023/12/21 15:54:17 Setting new connection state: Closed
ice DEBUG: 15:54:25.123955 agent.go:397: Started agent: isControlling? true, remoteUfrag: "OicdTxJdOSmLLGcq", remotePwd: "mwlAbQuRQYPTcrZReOiJwwAWuPJZXLzw"
ice INFO: 2023/12/21 15:54:25 Setting new connection state: Checking
ice WARNING: 2023/12/21 15:54:25 Failed to ping without candidate pairs. Connection is not possible yet.
ice DEBUG: 15:54:25.147639 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.147639 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.180758 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.180758 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.182850 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.182850 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.221630 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.221630 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.262682 gather.go:403: Failed to resolve STUN host: turn.netbird.io:443: lookup turn.netbird.io: no such host
ice DEBUG: 15:54:25.262682 gather.go:403: Failed to resolve STUN host: stun.netbird.io:5555: lookup stun.netbird.io: no such host
ice DEBUG: 15:54:25.284471 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.284471 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.284471 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.284471 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true
ice DEBUG: 15:54:25.315599 agent.go:817: Ignore duplicate candidate: udp4 srflx <hiddenip>:51820 related 0.0.0.0:51820
turnc DEBUG: 15:54:25.364673 client.go:119: Resolved TURN server turn.netbird.io:443 to 35.198.143.10:443
turnc DEBUG: 15:54:25.411742 udp_conn.go:66: Initial lifetime: 600 seconds
turnc DEBUG: 15:54:25.411742 udp_conn.go:81: Started refresh allocation timer
turnc DEBUG: 15:54:25.411742 udp_conn.go:84: Started refresh permission timer
turnc DEBUG: 15:54:25.462244 udp_conn.go:438: Channel binding successful: 192.168.168.115:51820 16384
ice INFO: 2023/12/21 15:54:25 Failed to send packet: CreatePermission error response (error 403: Forbidden IP)
turnc DEBUG: 15:54:25.509529 udp_conn.go:438: Channel binding successful: 192.168.168.110:51820 16385
turnc DEBUG: 15:54:25.533151 udp_conn.go:438: Channel binding successful: <hiddenip>:61335 16386
ice DEBUG: 15:54:25.533698 agent.go:1072: Inbound STUN message: isControlling && a.isControlling == true

the strange thing found is:

ice DEBUG: 15:54:25.262682 gather.go:403: Failed to resolve STUN host: turn.netbird.io:443: lookup turn.netbird.io: no such host
ice DEBUG: 15:54:25.262682 gather.go:403: Failed to resolve STUN host: stun.netbird.io:5555: lookup stun.netbird.io: no such host

resolution on windows works correctly, I also tried to change windows dns server and restart netbird but the error still happen. I tried for example to keep only 8.8.8.8 (google) as dns server and keep active ping on it and turn.netbird.io and was both without packet lost when netbird peer disconnect and have same error in ICE debug log. On linux peers disconnect issue is rare, but I tried to debug also it with: sudo bash -c "NB_LOG_LEVEL=debug PIONS_LOG_DEBUG=all netbird service run" | tee disconnect.logs and on one ubuntu 20.04 after a disconnect I had the same resolve error.

I don't know if this problem is linked to that of disconnections, which are a significant problem on some Windows peers, or just causes a possible delay in the connection/reconnection between the peers. Is there an issue with netbird internal dns resolution?

Any answer and advice are welcome. If there are other debugging possibilities that might be useful that I don't know about let me know please.

Fantu commented 8 months ago

Working ICE debug method on Windows, thanks to @mlsmaycon :

So the netbird.out.log will contain netbird debug log and ICE trace logs

shyer commented 7 months ago

I have the same problem, I change NB_ICE_DISCONNECTED_TIMEOUT_SEC to 13,solved this problem. on centos: vi /etc/sysconfig/netbird

# NB_LOG_LEVEL=debug
# PION_LOG_DEBUG=all
NB_ICE_DISCONNECTED_TIMEOUT_SEC=13

i think iceDisconnectedTimeoutDefault should be more than iceKeepAliveDefault * 2:

if iceDisconnectedTimeoutDefault < iceKeepAliveDefault * 2

ping -> lost, -> disconnected -> netbird clean peer and reconnect

if iceDisconnectedTimeoutDefault > iceKeepAliveDefault * 2:

ping -> lost, ping -> lost, -> disconnected -> netbird reconnect ping -> lost, ping -> pong: no disconnected, netbird ok <--

https://github.com/netbirdio/netbird/blob/main/client/internal/peer/env_config.go#L14

    envICEKeepAliveIntervalSec   = "NB_ICE_KEEP_ALIVE_INTERVAL_SEC"
    envICEDisconnectedTimeoutSec = "NB_ICE_DISCONNECTED_TIMEOUT_SEC"

https://github.com/netbirdio/netbird/blob/main/client/internal/peer/conn.go#L27

    iceKeepAliveDefault           = 4 * time.Second
    iceDisconnectedTimeoutDefault = 6 * time.Second # <-- may be 10  * time.Second ( 2.5 * iceKeepAliveDefault)

https://github.com/pion/ice/blob/master/agent_config.go#L21

    // keepaliveInterval used to keep candidates alive
    defaultKeepaliveInterval = 2 * time.Second

    // defaultDisconnectedTimeout is the default time till an Agent transitions disconnected
    defaultDisconnectedTimeout = 5 * time.Second

https://github.com/pion/ice/blob/master/agent.go#L631

    case a.disconnectedTimeout != 0 && disconnectedTime > a.disconnectedTimeout:
        a.updateConnectionState(ConnectionStateDisconnected)
Fantu commented 7 months ago

Thanks for the information, I did some tests with keepalive lower without result, but I hadn't thought carefully about the disconnect timeout. I tried today, starting with NB_ICE_DISCONNECTED_TIMEOUT_SEC to 30 on one case with continuous nearby disconnections, setting it on both "source" and "destination" peer and solved. After I tried NB_ICE_DISCONNECTED_TIMEOUT_SEC to 10 which seems more reasonable to not increase the reconnection time too much in case the connection actually needs to be re-established and also with that 0 disconnection and 0 packet lost in for the test period. Unfortunately, however, I have not yet been able to test with the ongoing problem on some networks which are occasionally very problematic and I have some doubts whether increasing the disconnection timeout will resolve.

Grimbart commented 6 months ago

i got the same problem... any news how to fix it?

or can anyone discribe me how to add the "NB_ICE_DISCONNECTED_TIMEOUT_SEC" Var?

Fantu commented 6 months ago

For example, if you have frequent disconnections between 2 peer windows, on both peers open PowerShell as administrator and do: [Environment]::SetEnvironmentVariable("NB_ICE_DISCONNECTED_TIMEOUT_SEC", "10", "Machine") After restart netbird service:

netbird service stop
netbird service start

After you will try, can you write if increase disconnect timeout to 10 seconds has solved your problem, please?

Grimbart commented 6 months ago

Only the Period between the interrupts increase, but its not completely gone. I tried multiple Values on both Side from 5 to 30. Also I tried a Debian 12 that connect to the Windows Server and there is no Interruption.

My current Setup: I use Windows 11 (22H2) that connects to a Windows Server 2202 (21H2) over RDP. The Netbird Instance is Self-Hosted on an ESXi-Hosts that got an OPNSense as a Firewall to NAT the Inbound needed Ports.

Fantu commented 6 months ago

Probably is the same other issue I have on my office using netbird selfhosted server in LAN behind pfsense, the problem seems double nat and with only 1 public ipv4 I can't assign one on pfsense. Increase timeout decreased the issue, specify external-ip in the turnserver.conf seems also helped.

Grimbart commented 6 months ago

its not only 1 public IP-Address, but in the same subnet range...

the office network send the information e.g. 12.23.34.55 (my windows 11 machine + Netbird Instance) over a Palo Alto Firewall to the e.g. 12.23.34.56 (windows server) that has the OPNSense... i will try later in the evening to modify the turnserver.conf.

webash commented 1 month ago

@Fantu did you make any progress here? I'm having some terribly annoying inconsistencies trying to stay connected to a remote Linux box behind a NAT'd 4G connection from my Windows 11 PC. Are we saying that the connection lifetime is too short on Windows to properly handle navigating NAT delays?

kgiannandrea commented 1 month ago

@Fantu did you make any progress here? I'm having some terribly annoying inconsistencies trying to stay connected to a remote Linux box behind a NAT'd 4G connection from my Windows 11 PC. Are we saying that the connection lifetime is too short on Windows to properly handle navigating NAT delays?

Hey @webash , we have had good results setting the following environment variable on the client machines

Variable: NB_ICE_DISCONNECTED_TIMEOUT_SEC Value: 10

This can be set in powershell using

webash commented 1 month ago

Do you know how to set this on Linux, too? I think its the sporadic nature of the 4G connection, but being the target that is the issue; so I think its disconnecting too often when actually it should just be a bit more patient. In some cases I seem to be able to keep an SSH session open to do a bunch of stuff, other times it drops rapidly.

Fantu commented 1 month ago

Set disconnect timeout to 10 seconds solve the problem? For Windows device, see here: https://github.com/netbirdio/netbird/issues/1195#issuecomment-1962321207 On Linux device I didn't remember if export of the environment variable and restart of netbird is working, I only remember I used env. variable only on manual foreground testing that can be different from the service. You can do a test in foreground if the issue is fast to reproduce, take also more detailed log using PION_LOG_TRACE=all probably can help developers in case of different issue.

About change it by default in netbird (from 6 to 10 https://github.com/netbirdio/netbird/blob/main/client/internal/peer/conn.go#L29) need to be sure it does not introduce worsening but only improvements in some cases as seen

webash commented 1 month ago

Thanks for the pointer to the part in the code that shows the defaults! That also lead me to this part of the documentation that describes how to set the environment variables for the Linux netbird daemon: https://docs.netbird.io/how-to/troubleshooting-client#on-linux-with-systemd

Which I did such, which seems to have improved reliability so far, but I think the 4G connection is less constrained at this time of night.

sudo mkdir -p /etc/sysconfig
echo 'NB_ICE_KEEP_ALIVE_INTERVAL_SEC=1' | sudo tee -a /etc/sysconfig/netbird
echo 'NB_ICE_DISCONNECTED_TIMEOUT_SEC=10' | sudo tee -a /etc/sysconfig/netbird
sudo systemctl restart netbird