netbirdio / netbird

Connect your devices into a single secure private WireGuard®-based mesh network with SSO/MFA and simple access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
9.81k stars 426 forks source link

When leveraging highly available routing peers in the same location, we experience route flapping periodically. #2150

Open briemann opened 2 weeks ago

briemann commented 2 weeks ago

Describe the problem

We're seeing an issue periodically crop up in our ecosystem where we've setup our routing peers in a high availability pair for each environment and in this issue window, random clients will experience route flapping where assets behind the RP nodes are accessible but extremely slow (due to the route flaps).

We've got a variety of 0.27 windows clients in the fleet that have all been able to replicate this behavior and myself this morning experienced this.

Generally the remedy is to shut the windows service down for a few minutes and start it back up, however this is less than ideal long term as we want non-power users to be leveraging this solution and telling them to do technical steps will fall on deaf ears.

As I suspect we're more of an edge case because of this HA setup, we're going to stop the netbird client on one of the two nodes in each location to try and isolate the issue to see if it's related to the HA pair or if it's client side. I just wanted to open this issue to see what we could supply in the meantime to get clarity from other avenues.

To Reproduce

Due to the nature it's not reproducible on command. Generally it happens on the first connection of the day, 9 our of 10 days logging in will be fine but that one day it won't.

Expected behavior

For the client not to flap when attempting to create a route.

Are you using NetBird Cloud?

No. Self Hosted.

NetBird version

0.27.10

NetBird status -d output:

See attached.

Screenshots

See attached output.log

Additional context

It looks almost like the routemanager has an issue with one of the RPs, sees the RP come online and flops the routes over to the other node because peer has a slightly better score.. I guess the fix for this would be to make sure that peers are aware they are right next to each other and that it's possible to have a slightly different score for the same environment to ensure it doesn't flap? Not sure, maybe I am off-base.

netbird_output.log netbird_status_output.txt

pascal-fischer commented 2 weeks ago

Hi @briemann,

I've checked the logs and the score is noticeably different. 0.93... to 2.92... which means the flapping is not due to similar latency to the routing peers. To me it looks like one of the routing peers might be reconnecting (or somewhat attempting) and the connection type is changing between e.g. P2P or relay this way causing the difference of 2 (the rest is due to latency).

Once you have the issue could you gather some debug logs (follow Debug for a specific time) to figure out what is causing the score to change so frequently. It might also be possible to catch whatever is happening by running netbird status command multiple time over one of the flaps.

briemann commented 1 week ago

@pascal-fischer sure thing, we turned off one node on all of the high availability pairs this week to do some other testing and will bring them back online next week. At that point i'll try to replicate or see how long it takes and will report back when we're able to.