netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
10.69k stars 479 forks source link

Netbird Clients drop and never return #1853

Open lazerusrm opened 5 months ago

lazerusrm commented 5 months ago

Describe the problem SELF HOSTED Server is up to date as of today. Installing netbird clients (primarily on debian 12) they disconnect, and never return. this is happening across many machines (more than 10 so far) if i remote in another way, i get this info which is the same across most devices: Daemon version: 0.27.3 CLI version: 0.27.3 Management: Disconnected, reason: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout Signal: Disconnected, reason: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout Relays: 0/2 Available Nameservers: 0/0 Available FQDN: NetBird IP: Interface type: Kernel Quantum resistance: false Routes: - Peers count: 0/8 Connected

:~# netbird up Already connected ~# netbird status Daemon version: 0.27.3 CLI version: 0.27.3 Management: Disconnected, reason: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout Signal: Disconnected, reason: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout Relays: 0/2 Available Nameservers: 0/0 Available FQDN: NetBird IP: Interface type: Kernel Quantum resistance: false Routes: - Peers count: 0/8 Connected A clear and concise description of what the problem is. **To Reproduce** install on debian, connect using auth key. wait. system goes offline. never returns. "netbird up" returns "connected" if i do a "netbird down" and "netbird up" client returns. i need netbird to be more persistent about attempting to reconnect, it doesnt help if i remote locate a device and it disconnects.... forever without manual intervention. I assume this is what i should expect? Steps to reproduce the behavior: 1. Go to '...' 2. Click on '....' 3. Scroll down to '....' 4. See error **Expected behavior** A clear and concise description of what you expected to happen. Stay connected, always, no matter what, forever. **Are you using NetBird Cloud?** No Please specify whether you use NetBird Cloud or self-host NetBird's control plane. Self Hosted **NetBird version** .27.3 `netbird version` **NetBird status -d output:** If applicable, add the `netbird status -d' command output. **Screenshots** If applicable, add screenshots to help explain your problem. **Additional context** Add any other context about the problem here.
shotgun-octopus commented 1 month ago

I'm fairly certain I'm observing the same issue or one very similar. We have several Debian 12 hosts in our environment and all of them connect just fine but eventually time out and disappear.

Keeping these peers connected to the mesh is critical as they provide routes into several parts of our enterprise. This issue recently showed up when we disabled masquerading and adding the static route to the relevant gateways. Disabling masquerading and adding the static route provides access to the hosts in these networks but the Debian hosts drop off the mesh after a period of time.

When masquerading is enabled, the hosts do not drop off the mesh. I suspect this may be due to the fact that enough traffic is originating from the host (because of the masquerade rewriting the IP header) instead of being routed/forwarded along as in the case of masquerade being disabled. From the WireGuard docs:

When it's not being asked to send packets, it stops sending packets until it is asked again.

Because the host is routing the packets and not sending them, perhaps this is allowing the connection to die which would also imply that it is ignoring the PersistentKeepAlive which is set and passed to the wgtypes.PeerConfig constructor: https://github.com/netbirdio/netbird/blob/3ed90728e64187191e78144554c7ce060bc2f52f/client/internal/peer/conn.go#L33

What brings me further confusion is that if I explicitly ping (ICMP) a host in the network the Debian host provides a route to, the connection to the mesh is restored.

Relevant logs from a Debian host:

2024-08-20T16:09:28-04:00 WARN management/client/grpc.go:169: disconnected from the Management service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-08-20T16:09:44-04:00 INFO management/client/grpc.go:154: connected to the Management Service stream
2024-08-20T16:09:44-04:00 WARN client/internal/engine.go:602: running SSH server is not permitted
2024-08-20T16:09:44-04:00 INFO client/internal/acl/manager.go:52: ACL rules processed in: 1.159215ms, total rules count: 2
2024-08-20T16:37:23-04:00 INFO client/internal/peer/conn.go:362: connected to peer [REDACTED], endpoint address: 3.133.249.218:43386
2024-08-20T16:58:53-04:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR
2024-08-20T16:58:54-04:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-08-20T17:03:23-04:00 INFO client/internal/peer/conn.go:362: connected to peer [REDACTED], endpoint address: [REDACTED]
2024-08-20T17:55:33-04:00 INFO client/internal/peer/conn.go:362: connected to peer [REDACTED], endpoint address: [REDACTED]
2024-08-20T17:56:02-04:00 WARN client/internal/engine.go:602: running SSH server is not permitted
2024-08-20T17:56:02-04:00 INFO client/internal/acl/manager.go:52: ACL rules processed in: 920.112µs, total rules count: 2
2024-08-20T18:06:37-04:00 INFO client/internal/peer/conn.go:362: connected to peer [REDACTED], endpoint address: [REDACTED]
shotgun-octopus commented 1 month ago

After additional investigation, my issue was caused by asymmetric routing. The route into the network was different than the route out which caused the gRPC connection to timeout likely because the responses didn't come back to the same socket.

If your NetBird bridge is not also your default gateway and you're not using masquerade, do yourself a favor and push a static route so other hosts in the network use the bridge as the next hop for the NetBird subnet.

ip route add 100.103/16 via 192.168.1.42

Or use DHCP Option 121 to push the additional route to your DHCP clients.