netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
11.09k stars 510 forks source link

Random Disconnections on Windows Server 2019 #2608

Open arobinsongit opened 1 month ago

arobinsongit commented 1 month ago

Describe the problem

We have a single machine in our group of peers that seems to randomly disconnect and will not reconnect until we uninstall and reinstall netbird.

Last time we saw the issue was September 13 at around 830 AM when a user tried to connect.

To Reproduce

Steps to reproduce the behavior:

  1. Install netbird with a fixed key
  2. Confirm host shows up in peers and I can connect via RDP
  3. Wait for some number of days (I have not been able to establish any kind of pattern)
  4. Attempt to connect via RDP to netbird, connection is not possible and it shows offline in the peers view of the website

Expected behavior

The machine should always be connected to netbird

Are you using NetBird Cloud?

Using netbird cloud

NetBird version

0.29.2

NetBird status -dA output:

-- Replaced hostnames with 001, 002, 003, etc. except for the host in question, test-001

Peers detail: andy-x1-01.netbird.cloud: NetBird IP: 100.83.216.253/32 Public key: xwHuTn+vUNJWuxHeUNYIVMwEHCNMie38MNRrwthK9FQ= Status: Disconnected -- detail -- Connection type: ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: Last connection update: 5 minutes, 10 seconds ago Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 0s

001.netbird.cloud: NetBird IP: 100.83.134.124/32 Public key: zVxs5wF1zCHh7OoaW11vDUo6HCKdZAgBvGwHoUbxKWQ= Status: Disconnected -- detail -- Connection type: ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: Last connection update: 5 minutes, 10 seconds ago Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 0s

002.netbird.cloud: NetBird IP: 100.83.24.87 Public key: S/wflOzdL/HmBPMDO7nlSrzT2cIxG02L5Ez/UxxWnhg= Status: Connected -- detail -- Connection type: P2P ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): 10.20.30.102:51820/198.51.100.0:51820 Relay server address: Last connection update: 5 minutes, 9 seconds ago Last WireGuard handshake: 59 seconds ago Transfer status (received/sent) 276 B/924 B Quantum resistance: false Routes: 10.0.0.0/24 Latency: 27.3266ms

003.netbird.cloud: NetBird IP: 100.83.24.199 Public key: p+ou9OsQHDZvsHAM3P5yMrnYnD4svnhXMHJY9lTLgF8= Status: Disconnected -- detail -- Connection type: ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: Last connection update: - Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 0s

004.netbird.cloud: NetBird IP: 100.83.89.86 Public key: kspYz8Y+g6e9XhjrUO0tiulQ9KR2te2g651FUnsasGI= Status: Disconnected -- detail -- Connection type: ICE candidate (Local/Remote): -/- ICE candidate endpoints (Local/Remote): -/- Relay server address: Last connection update: - Last WireGuard handshake: - Transfer status (received/sent) 0 B/0 B Quantum resistance: false Routes: - Latency: 0s

005.netbird.cloud: NetBird IP: 100.83.137.194 Public key: AhXLX4DchwYz/dZ11JFRyHZ9lwD5ywy7D2j28ZS7r10= Status: Connected -- detail -- Connection type: P2P ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): 127.0.0.1:51820/198.51.100.1:51820 Relay server address: Last connection update: 5 minutes, 8 seconds ago Last WireGuard handshake: 1 minute, 7 seconds ago Transfer status (received/sent) 409.3 KiB/1.8 MiB Quantum resistance: false Routes: - Latency: 32.6807ms

006.netbird.cloud: NetBird IP: 100.83.253.195 Public key: aiGNqgUZXnlM7DrmCnMKXVJutgtY/3MwDcILbdyzoxA= Status: Connected -- detail -- Connection type: P2P ICE candidate (Local/Remote): host/host ICE candidate endpoints (Local/Remote): 192.168.10.176:51820/192.168.10.164:51820 Relay server address: Last connection update: 5 minutes, 9 seconds ago Last WireGuard handshake: 59 seconds ago Transfer status (received/sent) 308 B/924 B Quantum resistance: false Routes: - Latency: 1.0118ms

OS: windows/amd64 Daemon version: 0.29.2 CLI version: 0.29.2 Management: Connected to https://api.netbird.io:443 Signal: Connected to https://signal.netbird.io:443 Relays: [stun:stun.netbird.io:5555] is Available [turns:turn.netbird.io:443?transport=tcp] is Available Nameservers: FQDN: test-001.netbird.cloud NetBird IP: 100.83.221.102/16 Interface type: Userspace Quantum resistance: false Routes: - Peers count: 3/7 Connected

Do you face any (non-mobile) client issues? no

Please provide the file created by netbird debug for 1m -AS.

Screenshots image

Additional context We are running both netbird and zerotier on these machines. We previously ran zerotier but are swapping over to netbird. We don't have this issue on any other machine so we suspect it is something unique to conditions on this single machine.

mlsmaycon commented 1 month ago

@arobinsongit, thanks for sharing the logs. Unfortunately, it doesn't show anything around 8:30; the only logs seem to be related to a system reboot:

2024-09-13T10:44:35-04:00 INFO client/cmd/root.go:191: shutdown signal received
2024-09-13T10:44:35-04:00 INFO client/internal/engine.go:252: Network monitor: stopped
...
2024-09-13T10:44:36-04:00 INFO client/internal/routemanager/manager.go:170: Routing cleanup complete
2024-09-13T10:44:37-04:00 INFO client/internal/engine.go:275: stopped Netbird Engine
2024-09-13T10:44:37-04:00 INFO client/internal/connect.go:281: stopped NetBird client
2024-09-13T10:44:42-04:00 INFO client/cmd/service_controller.go:80: stopped Netbird service
2024-09-13T13:25:20-04:00 INFO client/cmd/service_controller.go:24: starting Netbird service
2024-09-13T13:25:20-04:00 INFO client/cmd/service_controller.go:66: started daemon server: 127.0.0.1:41731
2024-09-13T13:25:20-04:00 INFO client/internal/connect.go:117: starting NetBird client version 0.29.2 on windows/amd64

The issue might be related to the daemon stopping. Can you please check the system logs in the event viewer for events related to the NetBird process?

mlsmaycon commented 1 month ago

@arobinsongit we've released the version 0.29.3, you can also upgrade your client to this version to validate if the issue was related to network changes that got fixed in this release.

arobinsongit commented 1 month ago

I've upgraded to 0.29.3

I'll write a small script to dump the status every minute along with pinging a few other peers so if it goes down I can isolate the issue better. Our 0830 time before was just when we found the issue when a user tried to connect.

Regards Andy

On Tue, Sep 17, 2024 at 7:42 AM Maycon Santos @.***> wrote:

@arobinsongit https://github.com/arobinsongit we've released the version 0.29.3, you can also upgrade your client to this version to validate if the issue was related to network changes that got fixed in this release.

— Reply to this email directly, view it on GitHub https://github.com/netbirdio/netbird/issues/2608#issuecomment-2355460987, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHLJIX36W664T7GI7YVP43ZXAIRXAVCNFSM6AAAAABOJJELZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJVGQ3DAOJYG4 . You are receiving this because you were mentioned.Message ID: @.***>

mlsmaycon commented 1 month ago

Ok, I imagined that the time wasn't exactly when the issue happened, but looking back in the logs, other events also don't display an abnormal event that could explain that.

arobinsongit commented 1 month ago

Ok thanks - another question - is there a reliable address (IP or hostname) that I could ping on the netbird side that would confirm netbird connectivity? I can use another one of my peers but that might go up and down. Also if it reboots I don't know what the leases on the addresses look like so it might not come back up at the same address.

-andy

On Tue, Sep 17, 2024 at 8:00 AM Maycon Santos @.***> wrote:

Ok, I imagined that the time wasn't exactly when the issue happened, but looking back in the logs, other events also don't display an abnormal event that could explain that.

— Reply to this email directly, view it on GitHub https://github.com/netbirdio/netbird/issues/2608#issuecomment-2355512275, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHLJIWNKOXSK2FPLO7TQRLZXAKVHAVCNFSM6AAAAABOJJELZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJVGUYTEMRXGU . You are receiving this because you were mentioned.Message ID: @.***>

arobinsongit commented 1 month ago

Looks like the service is not starting up successfully on reboots. The service is set to start automatically but it's timing out. I can start the service after I have logged on and it runs with no issues.

Debug file netbird.debug.1969967324.zip

Event Logs and Service configuration services-events.zip

Multiple reboots today with the last one being around 20:27

I do see this line after I startup the service

74235 Sep 17 20:25 Error Microsoft-Windows... 1023 Name resolution policy table has been corrupted. DNS resolution will fail until it is fixed. Contact your network administrator. For more information: read policy table for rule NetBird-Match failed...

Although that might not have anything to do with the service not starting on reboots

mlsmaycon commented 1 month ago

@arobinsongit I got this from one of the event logs:

   74077 Sep 17 16:31  Error       Service Control M...   3221232472 The NetBird service failed to start due to the following error: ...                                                                                                                                      

can you share more details about it?

arobinsongit commented 1 month ago

Dang, didn't realize powershell truncated that - that's kinda worthless :-)

Here are two messages back to back

Log Name: System Source: Service Control Manager Date: 9/17/2024 4:31:06 PM Event ID: 7009 Task Category: None Level: Error Keywords: Classic User: N/A Computer: TEST-001 Description: A timeout was reached (30000 milliseconds) while waiting for the NetBird service to connect.

Log Name: System Source: Service Control Manager Date: 9/17/2024 4:31:06 PM Event ID: 7000 Task Category: None Level: Error Keywords: Classic User: N/A Computer: TEST-001 Description: The NetBird service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

arobinsongit commented 1 month ago

Question, do you know if you specifically test on Server 2019 or just a standard current Windows box like 10 or 11?

What's curious is I have another 2019 box that is exhibiting what I would categorize as similar, but maybe not the same symptoms. This one also has trouble after reboots. For this one what I tried was to to through

Service Stop Service Uninstall Service Install Service Start

and it seemed to get things working again.

I haven't seen similar symptoms on my Windows 10 boxes.