netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
10.76k stars 485 forks source link

Postmortem: Invalid Network Maps Causing Peer Disconnection Issue (May 24, 2024) #2090

Open mlsmaycon opened 4 months ago

mlsmaycon commented 4 months ago

Summary

On May 24th, 2024, at 7 AM UTC, we received reports that some customer peers were not connecting to other peers in their NetBird networks for an extended period, despite being connected to our management system. Upon investigation, we identified an issue that affected a small number of peers connected to our cloud services. These nodes incorrectly received an invalid network map due to a database lock issue. The fix was deployed to our servers at 2:30 PM UTC on the same day, resolving the problem.

Details

On May 24th, 2024, at 7 AM UTC, a customer reported that one of their nodes wasn't connecting to other peers in their NetBird network. The issue had been ongoing from May 23rd until 7 PM UTC and then again until 7 AM UTC the following day. During this period, the node was connected to the management service, but the logs indicated that it was receiving signal messages from peers not registered in its local network map. These signal messages are crucial for establishing peer-to-peer (P2P) communication.

Our investigation revealed an issue with a database lock that lasted around 30 milliseconds while the management server was being updated. This database lock was not handled properly by one of the integrated checks in our cloud services, causing the system to assume that around 200 reconnecting peers were not allowed to receive the full network map. Instead, these peers were sent an invalid network map. The network map is a peer configuration that dictates which peers can connect to a node, and it also carries the node's IP address, routes, and DNS configurations. An invalid network map prevents the peer from connecting to any other node in its NetBird network and forces the closure of all existing connections.

The affected nodes remained in this state until they received a new, valid network map, either by reconnecting to the management system or through updates in their NetBird network. Our logs confirmed that all peers had returned to the correct state by May 24th, 2024, 7 AM UTC.

Actions taken

  1. We investigated the source of the database lock and optimized the system to prevent such locks from occurring.
  2. Introduced a cache mechanism to reduce the load and dependency on the database for integrated checks.
  3. Added proper error handling for the integrated check to prevent sending invalid network maps to peers.