Windows 10 selfhosted tunnel connected but can see no other peers in a group - other clients work fine

m3v4 commented 3 months ago

In our selfhosted implementation we are experiencing problems with a single client (out of few dozens). So far this problem has not been reproduced in our infrastructure, but we are struggling to resolve this one case.

OS: Windows 10 Pro 10.0.19045 x64 Client version: latest (0.27.10 AMD x64)

Installed with use of elevated user rights (main Administrator account) CLI command used for installation

msiexec /i netbird.msi /quiet /l netbird.log

netbird.log:

=== Logging started: 19.06.2024 11:42:15 === Action start 11:42:15: INSTALL. Action start 11:42:15: FindRelatedProducts. Action ended 11:42:15: FindRelatedProducts. Return value 1. Action start 11:42:15: LaunchConditions. Action ended 11:42:15: LaunchConditions. Return value 1. Action start 11:42:15: ValidateProductID. Action ended 11:42:15: ValidateProductID. Return value 1. Action start 11:42:15: CostInitialize. Action ended 11:42:15: CostInitialize. Return value 1. Action start 11:42:15: FileCost. Action ended 11:42:15: FileCost. Return value 1. Action start 11:42:15: CostFinalize. Action ended 11:42:15: CostFinalize. Return value 1. Action start 11:42:15: MigrateFeatureStates. Action ended 11:42:15: MigrateFeatureStates. Return value 0. Action start 11:42:15: InstallValidate. Action ended 11:42:15: InstallValidate. Return value 1. Action start 11:42:15: RemoveExistingProducts. Action ended 11:42:15: RemoveExistingProducts. Return value 1. Action start 11:42:15: InstallInitialize. Action ended 11:42:15: InstallInitialize. Return value 1. Action start 11:42:15: ProcessComponents. Action ended 11:42:15: ProcessComponents. Return value 1. Action start 11:42:15: UnpublishFeatures. Action ended 11:42:15: UnpublishFeatures. Return value 1. Action start 11:42:15: StopServices. Action ended 11:42:15: StopServices. Return value 1. Action start 11:42:15: DeleteServices. Action ended 11:42:15: DeleteServices. Return value 1. Action start 11:42:15: RemoveShortcuts. Action ended 11:42:15: RemoveShortcuts. Return value 1. Action start 11:42:15: RemoveEnvironmentStrings. Action ended 11:42:15: RemoveEnvironmentStrings. Return value 1. Action start 11:42:15: RemoveFiles. Action ended 11:42:15: RemoveFiles. Return value 0. Action start 11:42:15: RemoveFolders. Action ended 11:42:15: RemoveFolders. Return value 0. Action start 11:42:15: CreateFolders. Action ended 11:42:15: CreateFolders. Return value 0. Action start 11:42:15: InstallFiles. Action ended 11:42:15: InstallFiles. Return value 1. Action start 11:42:15: CreateShortcuts. Action ended 11:42:15: CreateShortcuts. Return value 1. Action start 11:42:15: WriteEnvironmentStrings. Action ended 11:42:15: WriteEnvironmentStrings. Return value 1. Action start 11:42:15: InstallServices. Action ended 11:42:15: InstallServices. Return value 1. Action start 11:42:15: StartServices. Action ended 11:42:15: StartServices. Return value 1. Action start 11:42:15: RegisterUser. Action ended 11:42:15: RegisterUser. Return value 1. Action start 11:42:15: RegisterProduct. Action ended 11:42:15: RegisterProduct. Return value 1. Action start 11:42:15: PublishFeatures. Action ended 11:42:15: PublishFeatures. Return value 1. Action start 11:42:15: PublishProduct. Action ended 11:42:15: PublishProduct. Return value 1. Action start 11:42:15: InstallFinalize. Action ended 11:42:18: InstallFinalize. Return value 1. Action ended 11:42:18: INSTALL. Return value 1. MSI (s) (5C:28) [11:42:18:425]: Product: NetBird -- Installation completed successfully.

MSI (s) (5C:28) [11:42:18:425]: Instalator Windows zainstalował produkt. Nazwa produktu: NetBird. Wersja produktu: 0.27.10. Język produktu: 1033. Producent: Wiretrustee UG (haftungsbeschreankt). Stan powodzenia lub błędu instalacji: 0.

=== Logging stopped: 19.06.2024 11:42:18 ===

After successful installation I used netbird up with url parameters, here is the debug log bundle:

2024-06-19T11:42:16+02:00 INFO client/cmd/service_controller.go:24: starting Netbird service 2024-06-19T11:42:16+02:00 INFO client/internal/config.go:140: generating new config C:\ProgramData\Netbird\config.json 2024-06-19T11:42:16+02:00 INFO client/internal/config.go:202: using default Management URL https://api.netbird.io:443 2024-06-19T11:42:16+02:00 INFO client/internal/config.go:226: using default Admin URL https://api.netbird.io:443 2024-06-19T11:42:16+02:00 INFO client/internal/config.go:244: generated new Wireguard key 2024-06-19T11:42:16+02:00 INFO client/internal/config.go:250: generated new SSH key 2024-06-19T11:42:16+02:00 INFO client/internal/config.go:266: using default Wireguard port 51820 2024-06-19T11:42:16+02:00 INFO client/internal/config.go:277: using default Wireguard interface wt0 2024-06-19T11:42:16+02:00 INFO client/internal/config.go:321: filling in interface blacklist with defaults: [ wt0 wt utun tun0 zt ZeroTier wg ts Tailscale tailscale docker veth br- lo ] 2024-06-19T11:42:16+02:00 INFO client/cmd/service_controller.go:64: started daemon server: 127.0.0.1:41731 2024-06-19T11:43:27+02:00 INFO client/internal/config.go:209: new Management URL provided, updated to "https://net.anon-ST92p.domain:33073" (old value "https://api.netbird.io:443") 2024-06-19T11:43:27+02:00 INFO client/internal/config.go:347: enabling SSH server 2024-06-19T11:43:28+02:00 WARN client/server/server.go:259: failed login: rpc error: code = InvalidArgument desc = invalid setup-key or no sso information provided, err: invalid UUID length: 0 2024-06-19T11:44:10+02:00 INFO client/internal/login.go:130: peer has been successfully registered on Management Service 2024-06-19T11:44:10+02:00 INFO client/internal/connect.go:119: starting NetBird client version 0.27.10 on windows/amd64 2024-06-19T11:44:11+02:00 INFO client/internal/routemanager/manager.go:93: Routing setup complete 2024-06-19T11:44:13+02:00 INFO signal/client/grpc.go:158: connected to the Signal Service stream 2024-06-19T11:44:13+02:00 INFO client/internal/engine.go:1405: Network monitor is disabled, not starting 2024-06-19T11:44:13+02:00 INFO client/internal/connect.go:265: Netbird engine started, the IP is: 100.103.53.99/16 2024-06-19T11:44:13+02:00 INFO management/client/grpc.go:147: connected to the Management Service stream 2024-06-19T11:44:13+02:00 INFO client/internal/dns/host_windows.go:149: added 1 match domains to the state. Domain list: [.netbird.selfhosted] 2024-06-19T11:44:13+02:00 INFO client/internal/dns/host_windows.go:176: updated the search domains in the registry with 1 domains. Domain list: [netbird.selfhosted] 2024-06-19T11:44:13+02:00 INFO client/internal/acl/manager.go:52: ACL rules processed in: 0s, total rules count: 0 2024-06-19T11:46:31+02:00 INFO client/internal/acl/manager.go:52: ACL rules processed in: 564.3µs, total rules count: 20

I authenticated successfully on first attempt, but logs above still show unsuccessful attempt. Below is screenshot from Authentik's (our SSO tool) successful authentication screen.

And status:

Peers detail: laptop-dell-mariusza.netbird.selfhosted: NetBird IP: 100.103.44.19 Public key: YvVW8g9sDDcUNhigOOW2SlIZBHj5Lj//mfMP2WAgzkg= Status: Connecting -- detail -- Connection type: Direct: false ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Last connection update: 5 seconds ago Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 0s

komputer-oskara.netbird.selfhosted: NetBird IP: 100.103.86.100 Public key: juqFrcIdeYGFwAUYxei6SU0SiRRsJ8JbXfKMWurUphs= Status: Connecting -- detail -- Connection type: Direct: false ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Last connection update: 1 second ago Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 0s

milena-laptop.netbird.selfhosted: NetBird IP: 100.103.207.48 Public key: /hJXb8Z7N3//Dmtbx40u2iUa1aGWHgkX1pDjAeClXlA= Status: Connecting -- detail -- Connection type: Direct: false ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Last connection update: 1 second ago Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 0s

OS: windows/amd64 Daemon version: 0.27.10 CLI version: 0.27.10 Management: Connected to https://net.anon-ZTM28.domain:33073 Signal: Connected to http://net.anon-ZTM28.domain:10000 Relays: [stun:net.anon-ZTM28.domain:3478] is Available [turn:net.anon-ZTM28.domain:3478?transport=udp] is Unavailable, reason: allocate: attribute not found Nameservers: FQDN: desktop-n07vu1e.netbird.selfhosted NetBird IP: 100.103.53.99/16 Interface type: Userspace Quantum resistance: false Routes: - Peers count: 0/3 Connected

Status on other machines in the same grup shows parameter "Peers count" as "2/3" connected meaning, that this single machine doesn't connect properly, but it also can not access all other machines.

In our policies we have 3389 port open and that kind of traffic allowed inside the forementioned group. This one PC is unable to access our server thou.

Previously we hace used openvpn and wireguard on all of the forementioned machines, but only this one ha sproblems. I have tried to find any remaining "tun/tap" adapters but non were identified, not even hidden in device manager. I have also activated the Administrator account and installed with use of that, but also no joy. We have dozens of other computers in other groups with exact same policies and all seems fine elsewhere - just this one PC is causing fuss about change of vpn platform.

What else can I try and diagnose?

pascal-fischer commented 3 months ago

Hi @m3v4,

the warning might be a bit misleading. Whithout knowing why the first attempt failed it seems to be able to authenticate on retry and successfully connect to management and receive information about the other peers in the network so thats good so far.

The issue seems to occur in the connection establishment. Is this machine in a physically different location than the working ones? Or maybe behind a different firewall. When checking the status output it shows issues connecting to turn which will cause the peer to only be able to connect to peers using P2P, as relay (which would be fallback) is not possible. If P2P is not possible (can be for multiple reasons) and it is unable to fall back to relay this could cause the connections to be stuck in connecting state.

m3v4 commented 3 months ago

Hi @pascal-fischer thanks for very quick response. That is right, the machine is in phisically different location. 3 that are working fine are in the same physical location and one other that is someplace else was working yesterday, but today does not...

All 4 machines are in a group with policy allowing all protocols bidirectionally like so:

group <=> group

I've tried to remove all peers from the grup and add again, but nothing changed.

What else can I do?

EDIT: we have deleted that other formerly working peer and reauthorised it with SSO - now it works again. I have tried to use same method on that problematic peer, but this time the issue persists. So again we have 3 local and 1 remote working fine, and one remote not working.

pascal-fischer commented 3 months ago

I am pretty sure this is unrelated to the configurations within netbird itself. It is related to the setup of the selfhosted management in specific TURN and STUN servers in combination with the physical network of the peer that does not work. You need to make sure that STUN as well as TURN are reachabe and working from that location.

Relays: [stun:net.anon-ZTM28.domain:3478] is Available [turn:net.anon-ZTM28.domain:3478?transport=udp] is Unavailable, reason: allocate: attribute not found

If either one of them is not reachable this will cause issues.

You said that the other machines are in the same physical location, this means they are most likely connected P2P? So there might even be a general issue with the TURN server.

EDIT: The peer that was previously not working but is working after reauthentication. Can you send a netbird status -d output from that peer?

m3v4 commented 3 months ago

sure, here you go:

Peers detail: laptop-dell-mariusza.netbird.selfhosted: NetBird IP: 100.103.44.19 Public key: YvVW8g9sDDcUNhigOOW2SlIZBHj5Lj//mfMP2WAgzkg= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): srflx/srflx ICE candidate endpoints (Local/Remote): 198.51.100.0:51820/198.51.100.1:51820 Last connection update: 20 minutes, 47 seconds ago Last WireGuard handshake: 41 seconds ago Transfer status (received/sent) 2.5 KiB/3.2 KiB Quantum resistance: false Routes: - Latency: 14.9644ms

desktop-n07vu1e.netbird.selfhosted: NetBird IP: 100.103.50.131 Public key: E8Y1I9u7F8ntuKE6QU6RNPwiFXrkBadNN1HO2djDPBc= Status: Disconnected -- detail -- Connection type: P2P Direct: false ICE candidate (Local/Remote): srflx/srflx ICE candidate endpoints (Local/Remote): 198.51.100.0:51820/198.51.100.1:51820 Last connection update: Now Last WireGuard handshake: 41 seconds ago Transfer status (received/sent) 2.5 KiB/3.2 KiB Quantum resistance: false Routes: - Latency: 0s

desktop-t16jv8o.netbird.selfhosted: NetBird IP: 100.103.117.3 Public key: 4KaK7VcDiimhmCe67feJQZhLGKKKLju5Vs4vxCmxN2U= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): srflx/srflx ICE candidate endpoints (Local/Remote): 198.51.100.0:51820/198.51.100.1:56033 Last connection update: 20 minutes, 47 seconds ago Last WireGuard handshake: 54 seconds ago Transfer status (received/sent) 3.3 KiB/2.7 KiB Quantum resistance: false Routes: - Latency: 15.2351ms

desktop-rglgsc3.netbird.selfhosted: NetBird IP: 100.103.228.235 Public key: /hJXb8Z7N3//Dmtbx40u2iUa1aGWHgkX1pDjAeClXlA= Status: Disconnected -- detail -- Connection type: P2P Direct: false ICE candidate (Local/Remote): srflx/srflx ICE candidate endpoints (Local/Remote): 198.51.100.0:51820/198.51.100.1:51820 Last connection update: - Last WireGuard handshake: 41 seconds ago Transfer status (received/sent) 2.5 KiB/3.2 KiB Quantum resistance: false Routes: - Latency: 0s

OS: windows/amd64 Daemon version: 0.27.10 CLI version: 0.27.10 Management: Connected to https://net.anon-JBpli.domain:33073 Signal: Connected to http://net.anon-JBpli.domain:10000 Relays: [stun:net.anon-JBpli.domain:3478] is Available [turn:net.anon-JBpli.domain:3478?transport=udp] is Unavailable, reason: allocate: attribute not found Nameservers: FQDN: desktop-3o2mcrk.netbird.selfhosted NetBird IP: 100.103.61.103/16 Interface type: Userspace Quantum resistance: false Routes: - Peers count: 2/4 Connected

EDIT: I have now realised that TURN is not available on all of our PEERs, or at least the ones that I have checked.

pascal-fischer commented 3 months ago

Ah perfekt! So here you have the same result. It also shows TURN unavailable:

[turn:net.anon-JBpli.domain:3478?transport=udp] is Unavailable, reason: allocate: attribute not found

And even though it is in a different physical location it still manages to establish a P2P connection thats why it is working.

So the issue lies with your TURN server setup in general. Check the TURN servers logs if it is a general issue with the server or a configuration issue.

m3v4 commented 3 months ago

thanks @pascal-fischer, our admin is on it testing. Could it be as simple as closed UDP ports range 49152-65535?

pascal-fischer commented 3 months ago

Yes this could be the reason.

Pshemas commented 2 months ago

in the end it was COTURN config (ports were open, so it was not firewall issue). I've prepared config for it (turnserver.conf) manually and used tools like trickle-ice to test whether it's working correctly - till it did :) .

netbirdio / netbird

Windows 10 selfhosted tunnel connected but can see no other peers in a group - other clients work fine #2153