netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
11.19k stars 515 forks source link

Windows Client: Network Routes dissappear after Peer Login Expiration and don't reappear #2637

Open Rob787 opened 1 month ago

Rob787 commented 1 month ago

Describe the problem

We noticed the following behaviour which is at the moment preventing us from deploying Netbird productively:

The network routes seem to dissappear from Windows Peers after they've been asked to re-login (previously login expired per default setting). After reauthentication, the network routes are not updated and remain empty. The only way to get them back is by triggering a change on the network routes server side (e.g. deactivating and activating a route quickly).

To Reproduce

Steps to reproduce the behavior:

  1. Reauthenticate after Netbird login expired
  2. Reconnect
  3. Check Network Routes of client and see they are empty
  4. Server-side: Deactivate and immediately activate a network route
  5. Client: check Network Routes and observe they are now repropagated

Expected behavior

Network Routes to immediately propagate after reauthentication.

Are you using NetBird Cloud?

Self-hosted for testing purposes, might move to cloud in future for production

NetBird version

0.29.4 behaviour also observed on 0.29.3

NetBird status -dA output:

Do you face any (non-mobile) client issues?

Please provide the file created by netbird debug for 1m -AS. We advise reviewing the anonymized files for any remaining PII.

netbird.debug.256237502.zip

Screenshots

N/A

Additional context

N/A

mgarces commented 1 month ago

@Rob787 can you confirm the version of Windows you are experiencing these issues? I have tested this scenario with a fresh install of Windows Server 2019 Base, with a default expiry of 1 hour, and once I went through steps 1 and 2, after a while I got the routes (both subnet and DNS routes). I only have 2 routes configured, your case seems a bit more complicated. It took a while for the routes to (re)propagate, but they did, after 20-30 seconds.

Rob787 commented 1 month ago

@mgarces Windows 11, all clients, basically teammembers laptops

Thanks for looking into this :)

We've now also experienced it separate from reauthentication, somehow they sometimes randomly dissappear from the client and don't repropagate. We have to reboot the Netbird client and then trigger a change on the server for them to come back.

Routes are required because we have many IPSEC tunnels running to our Azure infrastructure with different clients. For these clients we need to connect directly to DBs and or software clients on their end. Hence we have so many routes :)

mlsmaycon commented 1 month ago

@Rob787, the routes are linked to the peer-routing peer connection. When the issue happens, can you validate that this node is connected to the routing peer? You can do that with the following steps:

# you should see a route list with the remote peer details and it's connection status
netbird status -d --filter-by-names <routing peer name>
# ping the routing peer if an access control policy allows that
ping 100.x.x.x
# get the route and confirm the next hop as the NetBird address
Find-NetRoute -RemoteIPAddress "<Routed IP>" | Select-Object ifIndex,InterfaceAlias,DestinationPrefix,NextHop,RouteMetric -Last 1
# get the system routes
route print -4
Rob787 commented 1 month ago

@mlsmaycon Thanks for jumping in on this.

I'm currently again in the same situation where all Network Routes dissappeared on my Windows client. So good time to test above :)

With netbird status -d, I funnily enough only see a few of the other laptop peers, but not the peers that it uses for the routing. These are docker peers in one network, and kubernetes peers in Azure. Server-side I see them online, but again running above command on Windows only other Windows (laptop) peers are detected.

Does this give you some direction where to troubleshoot further?

mlsmaycon commented 1 month ago

@Rob787 ok.

We need to enable more logs to debug it properly. Can you run the following commands in a privileged powershell to enable debug logs and to extend the log file size?

# enable debug logs via machine environment variables
[Environment]::SetEnvironmentVariable("NB_LOG_LEVEL", "debug", "Machine")
# set max log size to 100MB via environment variables
[Environment]::SetEnvironmentVariable("NB_LOG_MAX_SIZE_MB", "100", "Machine")
# restart the agent
netbird service restart

If the issue happens again, please share the log file from C:\ProgramData\netbird\client.log (might be better to compress it) and run the following steps and share the outputs from the commands. If you are concerned on sharing them here, please use slack or send to us via email at support@netbird.io.

# you should see a route list with the remote peer details and it's connection status
netbird status -d --filter-by-names <routing peer name>
# ping the routing peer if an access control policy allows that
ping 100.x.x.x
# get the route and confirm the next hop as the NetBird address
Find-NetRoute -RemoteIPAddress "<Routed IP>" | Select-Object ifIndex,InterfaceAlias,DestinationPrefix,NextHop,RouteMetric -Last 1
# get the system routes
route print -4
mgarces commented 5 days ago

hi there, can you please test release [0.31.1](https://github.com/netbirdio/netbird/releases/tag/v0.31.1)