Open christian-schlichtherle opened 1 day ago
I also encountered the same situation
hello @christian-schlichtherle thank you for your issue; we will investigate this, also, thank you for your provided commands. It would be beneficial to have debug logs for those clients, is it possible to turn them on and run them for a long period of time? You can achieve this by following these docs. Thanks
Describe the problem
We are running an IoT project where some Linux based K3s nodes on the edge are located at customer premises and communicate with some other Linux based K3s nodes in the cloud. This project is running for almost three years now. Previously, we have been connecting and managing all nodes via OpenVPN (so we can SSH into every node, even when it's connected at customer premises) and then installed K3s on each node. A few months ago we replaced OpenVPN with Netbird because of its many advantages like performance, peer-to-peer topology with central management etc.
Ever since, we were following updates as soon as possible. We started at 0.27.10 and now we are (or were) at 0.30.2. Unfortunately, starting somewhere between version 0.28.4 and 0.30.2 we started to observe frequent network partitions (disconnects). They would happen randomly after some hours, mostly several times a day, at least once per day, following no particular pattern. I checked many potential causes, including IP address changes which happens to CPE equipment every night (according to the Internet provider's plan), but none of this was the root cause.
Recently I decided to downgrade the network from version 0.30.2 to version 0.28.4 and since then, we didn't have a single network partition / disconnect any more.
To Reproduce
Setup a bunch of nodes and run them 24/7. If you setup the nodes using Ansible, you can discover network partitions like this:
If there is no network partition, then each node produces an empty output, otherwise it lists the nodes it cannot connect to.
Expected behavior
These Linux nodes should stay connected 24/7, real Internet outages aside.
Are you using NetBird Cloud?
Yes
NetBird version
see above
NetBird status -dA output:
n/a
Do you face any (non-mobile) client issues?
n/a
Screenshots
n/a
Additional context
n/a