netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
11.13k stars 512 forks source link

New peers are not propagated to already existing peers #2114

Open glaeqen opened 5 months ago

glaeqen commented 5 months ago

Describe the problem

After adding a new peer to the network, this peer is not getting propagated to the already existing peers. At best only peers that were added very recently (last 24h?) seem to get the list of peers updated. The only thing that helps is to netbird down and netbird up on all the affected, exisiting peers machines which forces a peer refresh. Existing peers seem to show a following error after a new peer was added

2024-06-10T15:57:29+02:00 ERRO signal/client/grpc.go:410: error while handling message of Peer [key: <KEY>] error: [wrongly addressed message <KEY>]

This feels so fundamentally wrong that it's hard to believe that it's a bug and not just my fault. I doubt it's a networking issue, affected peers are located both in the cloud and self-hosted networks so it does not seem probable.

Expected behavior

Automatic peer propagation

Are you using NetBird Cloud?

self-hosted NetBird's control plane

NetBird version

0.27.10

mlsmaycon commented 5 months ago

This is a very odd indeed. Assuming your connection with the management is fine this should never happen unless there is an manual update of the network map in the database or a async update causing the serial number to be lower.

If you are running on 0.27.5+, can you enable debug logs on one of the existing clients with: netbird debug log level debug

then add a new peer and check if there is a log similar to:

received outdated NetworkMap with serial XXXX, ignoring
glaeqen commented 5 months ago

It occured twice earlier this week but I cannot reproduce it since 🙄 When it happened I happened to test that when I updated ACL (I added an affected peer to some new, arbitrary group), netbird client of that peer noticed the ACL update request and as a side-effect it also did update the peer list. If it happens again I'll try to gather more detailed logs of all parties involved.

mlsmaycon commented 5 months ago

sounds great, thanks @glaeqen

mohamed-essam commented 3 months ago

@mlsmaycon Yes, the logs contain outdated NetworkMap, excerpt from the logs here:

2024-07-24T08:22:33Z DEBG client/internal/peer/conn.go:288: connection offer sent to peer V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0=, waiting for the confirmation
2024-07-24T08:22:33Z DEBG client/internal/peer/conn.go:256: trying to connect to peer XD5VjsfLgs3RNijJ7D+yqJFeA07sEaJ24MCfk10m3no=
2024-07-24T08:22:33Z DEBG client/internal/peer/conn.go:288: connection offer sent to peer XD5VjsfLgs3RNijJ7D+yqJFeA07sEaJ24MCfk10m3no=, waiting for the confirmation
2024-07-24T08:22:38Z DEBG management/client/grpc.go:251: got an update message from Management Service
2024-07-24T08:22:38Z WARN client/internal/engine.go:602: running SSH server is not permitted
2024-07-24T08:22:38Z DEBG client/internal/engine.go:751: received outdated NetworkMap with serial 260, ignoring
2024-07-24T08:22:38Z DEBG management/client/grpc.go:251: got an update message from Management Service
2024-07-24T08:22:38Z WARN client/internal/engine.go:602: running SSH server is not permitted
2024-07-24T08:22:38Z DEBG client/internal/engine.go:751: received outdated NetworkMap with serial 260, ignoring
2024-07-24T08:22:43Z DEBG client/internal/peer/conn.go:496: trying to cleanup 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY=
2024-07-24T08:22:43Z DEBG iface/iface.go:101: Removing peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY= from interface wt0 
2024-07-24T08:22:43Z DEBG client/internal/peer/conn.go:639: peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY= ICE ConnectionState has changed to Closed
2024-07-24T08:22:43Z DEBG client/internal/peer/conn.go:554: cleaned up connection to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY=
2024-07-24T08:22:43Z DEBG client/internal/engine.go:990: connection to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY= failed: connection to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY= timed out after 41.52s
--
2024-07-24T08:30:54Z DEBG client/internal/peer/conn.go:639: peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY= ICE ConnectionState has changed to Closed
2024-07-24T08:30:54Z DEBG client/internal/peer/conn.go:554: cleaned up connection to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY=
2024-07-24T08:30:54Z DEBG client/internal/engine.go:990: connection to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY= failed: connection to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY= timed out after 41.52s
2024-07-24T08:30:54Z DEBG management/client/grpc.go:251: got an update message from Management Service
2024-07-24T08:30:54Z WARN client/internal/engine.go:602: running SSH server is not permitted
2024-07-24T08:30:54Z DEBG client/internal/engine.go:751: received outdated NetworkMap with serial 260, ignoring
2024-07-24T08:30:55Z DEBG client/internal/peer/conn.go:256: trying to connect to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY=
2024-07-24T08:30:55Z DEBG client/internal/peer/conn.go:288: connection offer sent to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY=, waiting for the confirmation
2024-07-24T08:31:06Z DEBG client/internal/peer/conn.go:496: trying to cleanup V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0=
2024-07-24T08:31:06Z DEBG iface/iface.go:101: Removing peer V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0= from interface wt0 
2024-07-24T08:31:06Z DEBG client/internal/peer/conn.go:639: peer V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0= ICE ConnectionState has changed to Closed
--
2024-07-24T08:48:10Z DEBG client/internal/engine.go:990: connection to peer juyzNoQtxuvMYPKGs6+oXXRfLzyjAzHvwNzGZlSdmy4= failed: connection to peer juyzNoQtxuvMYPKGs6+oXXRfLzyjAzHvwNzGZlSdmy4= timed out after 38.621s
2024-07-24T08:48:12Z DEBG client/internal/peer/conn.go:256: trying to connect to peer juyzNoQtxuvMYPKGs6+oXXRfLzyjAzHvwNzGZlSdmy4=
2024-07-24T08:48:12Z DEBG client/internal/peer/conn.go:288: connection offer sent to peer juyzNoQtxuvMYPKGs6+oXXRfLzyjAzHvwNzGZlSdmy4=, waiting for the confirmation
2024-07-24T08:48:15Z DEBG management/client/grpc.go:251: got an update message from Management Service
2024-07-24T08:48:15Z WARN client/internal/engine.go:602: running SSH server is not permitted
2024-07-24T08:48:15Z DEBG client/internal/engine.go:751: received outdated NetworkMap with serial 260, ignoring
2024-07-24T08:48:15Z DEBG client/internal/peer/conn.go:496: trying to cleanup V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0=
2024-07-24T08:48:15Z DEBG iface/iface.go:101: Removing peer V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0= from interface wt0 
2024-07-24T08:48:15Z DEBG client/internal/peer/conn.go:639: peer V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0= ICE ConnectionState has changed to Closed
2024-07-24T08:48:15Z DEBG client/internal/peer/conn.go:554: cleaned up connection to peer V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0=
2024-07-24T08:48:15Z DEBG client/internal/engine.go:990: connection to peer V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0= failed: connection to peer V1uRp/w/p/b5eB8ulqlkzbPLaxzf8C2CyLkx1EkTzG0= timed out after 36.573s
--
2024-07-24T08:48:45Z DEBG client/internal/peer/conn.go:639: peer Ys9lRgrg1jEcXR7bQ/6ar2hQDNz2cxxdZl8Pjh0DETs= ICE ConnectionState has changed to Closed
2024-07-24T08:48:45Z DEBG client/internal/peer/conn.go:554: cleaned up connection to peer Ys9lRgrg1jEcXR7bQ/6ar2hQDNz2cxxdZl8Pjh0DETs=
2024-07-24T08:48:45Z DEBG client/internal/engine.go:990: connection to peer Ys9lRgrg1jEcXR7bQ/6ar2hQDNz2cxxdZl8Pjh0DETs= failed: connection to peer Ys9lRgrg1jEcXR7bQ/6ar2hQDNz2cxxdZl8Pjh0DETs= timed out after 44.38s
2024-07-24T08:48:45Z DEBG management/client/grpc.go:251: got an update message from Management Service
2024-07-24T08:48:45Z WARN client/internal/engine.go:602: running SSH server is not permitted
2024-07-24T08:48:45Z DEBG client/internal/engine.go:751: received outdated NetworkMap with serial 260, ignoring
2024-07-24T08:48:46Z DEBG client/internal/peer/conn.go:256: trying to connect to peer Ys9lRgrg1jEcXR7bQ/6ar2hQDNz2cxxdZl8Pjh0DETs=
2024-07-24T08:48:46Z DEBG client/internal/peer/conn.go:288: connection offer sent to peer Ys9lRgrg1jEcXR7bQ/6ar2hQDNz2cxxdZl8Pjh0DETs=, waiting for the confirmation
2024-07-24T08:48:46Z DEBG client/internal/peer/conn.go:256: trying to connect to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY=
2024-07-24T08:48:46Z DEBG client/internal/peer/conn.go:288: connection offer sent to peer 6wS58SsML1Ute6LkSYVdvbYuTXBcTX3Lb0h2UVLNbkY=, waiting for the confirmation
2024-07-24T08:48:49Z ERRO signal/client/grpc.go:399: error while handling message of Peer [key: jfCtRe6wchprw6xXmhQxkPlNRx4EOgZCq2JVTnWVCR8=] error: [wrongly addressed message jfCtRe6wchprw6xXmhQxkPlNRx4EOgZCq2JVTnWVCR8=]

I also have debug logs enabled on the server, let me know if there are any specific logs I should look for

mohamed-essam commented 3 months ago

I deployed a custom version of the client on the affected server with added logs in client/internal/engine.go:750

    log.Debugf("received NetworkMap with serial %d", serial)
    if e.networkSerial > serial {
        log.Debugf("received outdated NetworkMap with serial %d, ignoring, current latest serial is %d", serial, e.networkSerial)
        return nil
    }

I'm suspecting a malformed message sends a high serial number (e.g INT MAX) and causes all valid messages that come after it to be less than the stored serial number, will keep this client running and come back to you with any results

mlsmaycon commented 3 months ago

@mohamed-essam Recently, we switched the sync method to use read locks, which, for cases where a peer is often reconnecting in a small interval, could cause the network map serial number to be the same or lower in case of updates at the same time.

the check and log is fine as we expect this to happen in some cases, but not often. To debug, maybe it is better to start printing the serial number received to confirm the max int case.

Furthermore, we are improving the network map sync to prevent extra maps from being sent in these cases. See https://github.com/netbirdio/netbird/pull/2236

mohamed-essam commented 3 months ago

@mlsmaycon I can see the issue sometimes occurring where a client would print wrongly addressed message without outdated NetworkMap logs, I don't understand the exact behavior behind this but I can see in some instances the client thinks another client is offline when it's online and won't update the network map on its own

I made a custom version for the time being so it would restart the management connection every time it received and outdated NetworkMap or receives a wrongly addressed message when there's a networkmap locally (so it wouldn't restart management connection when it's not initialized yet), and it has been working perfectly for the past couple of days

My custom patch is pushed here, not sure if this belongs in the mainstream yet, but it works.

A more elegant way to do this would be implementing a manual Sync request so that it refreshes its network map forcibly if there's a chance it's out of sync, restarting the management connection does that but also does more than needed.