netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
11.34k stars 520 forks source link

Incorrect `status` reported with `netbird status -d` #1810

Open soakes opened 7 months ago

soakes commented 7 months ago

Today I have experenced a strange issue with netbird version 0.27.0 for Debian 12. During the night several of the VPN links went offline (as can see below) and this morning the links were dead. However, whats intresting is that netbird status -d is showing that they are connected when they actually are offline.

The time right now is Sun Apr 7 07:42:49 UTC 2024 which is when this snapshot below was taken. If you see the Last WireGuard handshake was around 2024-04-07 01:06:23. Considuring that the keepalive should be I think around 10-60 seconds if I recall, the status of the link should say its down. I did confirm that you could not see the VPN endpoint (netbird.cloud address).

After restarting netbird, brought up all the links correctly and the status is no longer lying (says connected and is true now).

I believe for whatever reason netbird sent an update to the client which it coudln't do (for whatever reason) and then gave up trying or even attempted to restart the link.

I believe that if the endpoint addresses (netbird IP) can't be reached, then the status should change from connected to something else (offline maybe?). This will allow monitoring to be done on the links and restart accordingly.

It might also be an idea to adjust netbird so that it trys and restart the links if they have been down for a while as in most cases this is all thats required.

To Reproduce

I am not sure how you could reproduce as I am not sure why the links failed, but my theory is, you could probbaly setup a firewall rule to drop the packets to simulate the line being down and then check the status has updated.

Expected behavior

I would expect the status on the links to apear to be down (offline). Ideally it would be nice if they could be restarted upto X retrys as by the looks of it, it doesn't even try if there is a glitch in the network.

Are you using NetBird Cloud?

I am using NetBird Cloud control plane which is deployed as of Sun 7th April 2024.

NetBird version

netbird version 0.27.0

NetBird status -d output:

Peers detail:
 xxxx.netbird.cloud:
  NetBird IP: 100.xx.xx.130
  Public key: DRh4eyiGdfy**********************
  Status: Connected
  -- detail --
  Connection type: P2P
  Direct: true
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 10.xx.xx.254:51820/159.xxx.xxx.10:51820
  Last connection update: 2024-04-04 01:08:51
  Last WireGuard handshake: 2024-04-07 01:06:23
  Transfer status (received/sent) 222.1 GiB/80.2 GiB
  Quantum resistance: true
  Routes: 10.xx.xx.xx/16
  Latency: 6.120067ms

 xxx.netbird.cloud:
  NetBird IP: 100.xx.xx.255
  Public key: PVcBjP7Wro*******************
  Status: Connected
  -- detail --
  Connection type: P2P
  Direct: true
  ICE candidate (Local/Remote): host/host
  ICE candidate endpoints (Local/Remote): 10.xx.xx.254:51820/185.xxx.xxx.218:51820
  Last connection update: 2024-04-06 19:09:51
  Last WireGuard handshake: 2024-04-07 01:04:01
  Transfer status (received/sent) 5.7 MiB/6.8 MiB
  Quantum resistance: true
  Routes: -
  Latency: 8.975109ms

 xxxx.netbird.cloud:
  NetBird IP: 100.xx.xx.156
  Public key: TLj1K0BtAV************************
  Status: Connected
  -- detail --
  Connection type: P2P
  Direct: true
  ICE candidate (Local/Remote): srflx/srflx
  ICE candidate endpoints (Local/Remote): 45.xx.xx.213:37386/90.xx.xx.142:37386
  Last connection update: 2024-04-06 06:48:53
  Last WireGuard handshake: 2024-04-06 08:41:53
  Transfer status (received/sent) 7.9 MiB/13.0 MiB
  Quantum resistance: true
  Routes: 10.xx.xx.0/16
  Latency: 8.165065ms

I have more peers but the above snippet should be enough for an example (two offline, one online). All should be online, several are, several are not.

Screenshots

No screenshot necessary, its netbird status -d that needs adjusting.

soakes commented 7 months ago

Update

I have done some testing and if you shutdown netbird on the other side of the link, the status does get updated. However, if you simulate a drop using iptables -I OUTPUT -d 100.xxx.xxx.155/32 -j DROP and confirm its down with a ping, the status doesn't get updated.

Status is updated currectly if the client on other side of tunnel has been shutdown

 xxxx.netbird.cloud:
  NetBird IP: 100.xxx.xxx.155
  Public key: eUH/Juw9vnLD**************
  Status: Disconnected
  -- detail --
  Connection type: P2P
  Direct: false
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 10.xxx.xxx.254:51820/90.xxx.xxx.142:51820
  Last connection update: 2024-04-07 08:41:40
  Last WireGuard handshake: 2024-04-07 08:41:12
  Transfer status (received/sent) 404.0 MiB/25.2 MiB
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 8.461006ms

Status of client with iptables drop rule added to simulate link down. Client status doesn't get updated.

 xxx.netbird.cloud:
  NetBird IP: 100.xxx.xxx.155
  Public key: eUH/Juw9vn*****
  Status: Connected
  -- detail --
  Connection type: P2P
  Direct: true
  ICE candidate (Local/Remote): host/srflx
  ICE candidate endpoints (Local/Remote): 10.xx.xx.254:51820/195.xx.xx.155:51820
  Last connection update: 2024-04-07 08:42:18
  Last WireGuard handshake: 2024-04-07 08:48:59
  Transfer status (received/sent) 41.8 KiB/30.1 KiB
  Quantum resistance: true
  Routes: -
  Latency: 8.302137ms
# date
Sun Apr  7 08:52:50 UTC 2024

So it seems the netbird client is only updating if the actual client on other side is offline (i.e machine offline) but if there are connection issues, its not being updated correctly.

IMHO, the status should be updated if you can't reach the other side (icmp to the netbird ip).

I believe the keepalive ICMP packet responce could be used to give you the status as if the keepalive hasn't been recieved/replied to then the link must be down.

Another intresting thing ive just noticed is that after you remove the iptables drop rule, the VPN doesn't fix itself. I would of expected it to recover on its own. The only way to recover it is to restart the netbird client.

I believe two issues are in play here, first the netbird status command isn't being updated to reflect the true status of the client and also that netbird isn't trying to restart the links if they recover on their own.

# ping 100.xx.xx.155
PING 100.xx.xx.155 (100.xx.xx.155) 56(84) bytes of data.
^C
--- 100.xx.xx.155 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1027ms

# systemctl restart netbird

# ping 100.xx.xx.155
PING 100.xx.xx.155 (100.xx.xx.155) 56(84) bytes of data.
64 bytes from 100.xx.xx.155: icmp_seq=1 ttl=64 time=8.13 ms
64 bytes from 100.xx.xx.155: icmp_seq=2 ttl=64 time=16.7 ms
^C
--- 100.92.209.155 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 8.128/12.437/16.746/4.309 ms
mlsmaycon commented 7 months ago

Hello @soakes thanks for sharing the status command outputs, it seems like we have a bug preventing status updates. As we got a successful wireguard handshake, something must be influencing the connectivity between the nodes.

To troubleshoot the case, can you confirm the following?

  1. From your output I can see that rosenpass is enabled on the remote node, can you confirm if it is enabled in the local node too, and if the rosenpass is running on permissive mode on either side of the connection?
  2. Do you have any exit routes (0.0.0.0/0) enabled?
  3. Can you confirm the version of all nodes? Can you test with the latest 0.27.1?
  4. Is there any difference between the nodes you can't connect and with the ones you can?
soakes commented 7 months ago

Hello @soakes thanks for sharing the status command outputs, it seems like we have a bug preventing status updates. As we got a successful wireguard handshake, something must be influencing the connectivity between the nodes.

To troubleshoot the case, can you confirm the following?

1. From your output I can see that rosenpass is enabled on the remote node, can you confirm if it is enabled in the local node too, and if the rosenpass is running on permissive mode on either side of the connection?

2. Do you have any exit routes (0.0.0.0/0) enabled?

3. Can you confirm the version of all nodes? Can you test with the latest 0.27.1?

4. Is there any difference between the nodes you can't connect and with the ones you can?

Hi @mlsmaycon,

Intresting enough the rosenpass error only shows if the endpoint is offline (i.e shutdown netbird client), but yes its enabled on both sides and if the client is online then the rosenpass error isnt present as can be seen in the outputs as it shows it to be true.

As for point two, no I don't have default gw route set (0.0.0.0/0).

As for 0.27.1, I applogise, I didn't notice that this has been released, I will update this on all nodes shortly and will post you an update after ive redone the tests.

Last point, everything is fine, as I mentioned, something network related happened late last night, I suspect internet routing which knocked off some of the nodes but not all as I have machines/networks in different countries, but when they really recovered, netbird didn't reconnect on their own and required manual restart on the netbird process.

You can test this yourself by dropping the packets between a couple nodes, wait a few minutes, then remove the block, you will see the netbird link for it stays down and the status also shows it to be connected when it clearly isn't as you no longer can see the other side even with the drop being removed.

soakes commented 7 months ago

Hi @mlsmaycon,

As promiced, heres the test results after upgrading to 0.27.1.

I now can confirm that all 10 clients have been upgraded to 0.27.1.

I have just run the following test again now that all clients have been updated and also given you a step by step on how you can reproduce.

You can reproduce the not reconnecting and also status update failure by using the below steps:

Check if LINK is alive (alive as expected)

(PROD JSHN)[root@ra01 ~]# ping 100.xx.xx.155
PING 100.xx.xx.155 (100.xx.xx.155) 56(84) bytes of data.
64 bytes from 100.xx.xx.155: icmp_seq=1 ttl=64 time=8.13 ms
64 bytes from 100.xx.xx.155: icmp_seq=2 ttl=64 time=8.39 ms
^C
--- 100.xx.xx.155 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 8.132/8.261/8.391/0.129 ms

Check if netbird status -d shows that the link is up (yes its should be up as we confirmed PING is working)

 xxx.netbird.cloud:
  NetBird IP: 100.xx.xx.155
  Public key: eUH/Juw9vnL******
  Status: Connected
  -- detail --
  Connection type: P2P
  Direct: true
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 10.xx.xx.254:51820/195.xx.xx.155:51820
  Last connection update: 2024-04-07 09:34:38
  Last WireGuard handshake: 2024-04-07 09:36:39
  Transfer status (received/sent) 3.8 KiB/30.7 KiB
  Quantum resistance: true
  Routes: -
  Latency: 8.35792ms

Add in a DROP rule on firewall so machine can't see other side, which will also change other side as there be no reply/responce

(PROD JSHN)[root@ra01 ~]# iptables -I OUTPUT -d 100.xx.xx.155/32 -j DROP

Confirm LINK is down (yes this should now not respond as we added a DROP rule above)

(PROD JSHN)[root@ra01 ~]# ping 100.xx.xx.155
PING 100.xx.xx.155 (100.xx.xx.155) 56(84) bytes of data.
^C
--- 100.92.209.155 ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9119ms

Confirm netbird status -d has updated to show disconnected/offline (this should say disconnected as the link is down, however still shows its connected)

 xxx.netbird.cloud:
  NetBird IP: 100.xx.xx.155
  Public key: eUH/Juw9vnLDA******
  Status: Connected
  -- detail --
  Connection type: P2P
  Direct: true
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 10.xx.10xx254:51820/195.xx.xxx.155:51820
  Last connection update: 2024-04-07 09:34:38
  Last WireGuard handshake: 2024-04-07 09:41:16
  Transfer status (received/sent) 45.6 KiB/32.8 KiB
  Quantum resistance: true
  Routes: -
  Latency: 8.230129ms

Now restore the link, its been down a good 5+ minutes now, remove iptables drop

  (PROD JSHN)[root@ra01 ~]# iptables -D OUTPUT -d 100.xx.xx.155/32 -j DROP

Confirm link has been restored (this should now recover and should now ping, however as seen below, it doesn't and the link stays down)

(PROD JSHN)[root@ra01 ~]# ping 100.xxx.xxx.155
PING 100.xx.xx.155 (100.xx.xx.155) 56(84) bytes of data.
^C
--- 100.xx.xx.155 ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9108ms

Only solution now is to restart netbird

(PROD JSHN)[root@ra01 ~]# systemctl restart netbird

Now confirm if link is online via PING

(PROD JSHN)[root@ra01 ~]# ping 100.92.209.155
PING 100.xx.xx.155 (100.xx.xx.155) 56(84) bytes of data.
64 bytes from 100.xx.xx.155: icmp_seq=1 ttl=64 time=8.23 ms
^C
--- 100.xx.xx.155 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 8.226/8.226/8.226/0.000 ms

This should not be happening as with the internet there are always times when routing gets updated with ISPs and links can drop for a little while (1-2 minutes) while they re-establish.

So I would say theres two issues, first being status is not being updated and second, its not trying to reconnect the links if there is a network blip.

mlsmaycon commented 7 months ago

Can you confirm the OS of the peers involved? I know one is Linux, but what is the other?

soakes commented 7 months ago

Can you confirm the OS of the peers involved? I know one is Linux, but what is the other?

Applogies, they are all Linux (Debian 12) machines. Most are routing blocks, but a few are just standalone clients. However all are running Debain 12.

image
mlsmaycon commented 7 months ago

Thanks for all the information. Recently we have updated our ICE library and routing logic. We will troubleshoot and attempt to reproduce the issue. If needed I will ask for some debug logs later.

soakes commented 7 months ago

Thanks for all the information. Recently we have updated our ICE library and routing logic. We will troubleshoot and attempt to reproduce the issue. If needed I will ask for some debug logs later.

No worries. Just ping me when you need some further testing.

soakes commented 7 months ago

@mlsmaycon, FYI, I have found that if the peer is on a dynamic public address and this changes (happens with a lot of broadband connections over here), while netbird status still shows that the endpoint is still connected and so does the UI, the link is completly down (can't ping private netbird IP). The only resolution is still to restart the netbird process which then restores the connection.

The intresting thing is, with netmaker which also uses kernel version of wireguard, this problem isnt present.

This isn't just happening on dynamic links, its happening on everything, including leased lines with static publics and even datacenters. I am seeing at least a node or two vanish every couple or so days and the only solution is a restart on the client.

I am tempted to write something to directly monitor the links and then restart if required, but as theres no way to just bring up a specific link, it will effect all which is not what I wanted to do. Apart from these issues, netbird is awesome, just for site to site links, its not suitable in its current state.