netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
10.94k stars 494 forks source link

Regression: Frequent disconnects with version 0.30.2 #2765

Open christian-schlichtherle opened 1 day ago

christian-schlichtherle commented 1 day ago

Describe the problem

We are running an IoT project where some Linux based K3s nodes on the edge are located at customer premises and communicate with some other Linux based K3s nodes in the cloud. This project is running for almost three years now. Previously, we have been connecting and managing all nodes via OpenVPN (so we can SSH into every node, even when it's connected at customer premises) and then installed K3s on each node. A few months ago we replaced OpenVPN with Netbird because of its many advantages like performance, peer-to-peer topology with central management etc.

Ever since, we were following updates as soon as possible. We started at 0.27.10 and now we are (or were) at 0.30.2. Unfortunately, starting somewhere between version 0.28.4 and 0.30.2 we started to observe frequent network partitions (disconnects). They would happen randomly after some hours, mostly several times a day, at least once per day, following no particular pattern. I checked many potential causes, including IP address changes which happens to CPE equipment every night (according to the Internet provider's plan), but none of this was the root cause.

Recently I decided to downgrade the network from version 0.30.2 to version 0.28.4 and since then, we didn't have a single network partition / disconnect any more.

To Reproduce

Setup a bunch of nodes and run them 24/7. If you setup the nodes using Ansible, you can discover network partitions like this:

ansible netbird_client -m shell -a 'netbird status --filter-by-status disconnected | grep netbird.cloud | grep -v FQDN || true'

If there is no network partition, then each node produces an empty output, otherwise it lists the nodes it cannot connect to.

Expected behavior

These Linux nodes should stay connected 24/7, real Internet outages aside.

Are you using NetBird Cloud?

Yes

NetBird version

see above

NetBird status -dA output:

n/a

Do you face any (non-mobile) client issues?

n/a

Screenshots

n/a

Additional context

n/a

wiiun commented 13 hours ago

I also encountered the same situation

mgarces commented 10 hours ago

hello @christian-schlichtherle thank you for your issue; we will investigate this, also, thank you for your provided commands. It would be beneficial to have debug logs for those clients, is it possible to turn them on and run them for a long period of time? You can achieve this by following these docs. Thanks

mgarces commented 9 hours ago

Hi again! We've found a bug in our reconnection logic, and we are working on improvements for it. We currently have a PullRequest ongoing, would you be willing to test it out before we release it?