netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
11.24k stars 517 forks source link

Relay server bug - Connections not closed after EOF exception #2880

Closed mathiash98 closed 3 days ago

mathiash98 commented 1 week ago

Solved by #2879

Potential Relay server bug - Saturating number of connections on EOF websocket error: Hopefully I just misunderstand something:

Scenario:

Many peers with high ping and low internet speed

Background:

Hosting relay server behind nginx reverse proxy

Reproduction

  1. lsof count inside docker container before error: sudo docker exec netbird-relay-0-1 lsof | wc -> 800
  2. Whenever Relay server get error code relay/server/peer.go:61: failed to read message: failed to get reader: failed to read frame header: EOF it will return the Work code in https://github.com/netbirdio/netbird/blob/b4d7605147e6ddbd22c214b42ef43267bc78ce80/relay/server/peer.go#L48-L64 Which in turn result in removing deleting the peer from store in Relay.go file https://github.com/netbirdio/netbird/blob/b4d7605147e6ddbd22c214b42ef43267bc78ce80/relay/server/relay.go#L134-L139 and logs relay/server/relay.go:137: relay connection closed
  3. The client will also get an error code when sending the data and the same peer will then reconnect right after (I have not found the code for this yet) leading to log message: relay/server/relay.go:129: peer connected from: 172.23.0.1:39226
  4. The lsof has now increased by one sudo docker exec netbird-relay-0-1 lsof | wc -> 801
  5. When the lsof of the docker container reaches 1000 it will stop accepting new connections and the docker container must be restarted

Potential fix? -> Run p.conn.Close() inside https://github.com/netbirdio/netbird/blob/b4d7605147e6ddbd22c214b42ef43267bc78ce80/relay/server/peer.go#L48-L64 whenever an error is met?

Full log example showcasing that the peer_id is same:

netbird-relay-0-1  | 2024-11-11T12:54:32Z ERRO [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/peer.go:61: failed to read message: failed to get reader: failed to read frame header: EOF
netbird-relay-0-1  | 2024-11-11T12:54:32Z DEBG [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/relay.go:137: relay connection closed
netbird-relay-0-1  | 2024-11-11T12:54:32Z DEBG [peer_id: sha-0Uu8KZZs8rTiypKGr9PaxkKtTFWqnttiYaCZRwMFuHE=] relay/server/peer.go:196: peer not found: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=
netbird-relay-0-1  | 2024-11-11T12:54:34Z INFO [peer_id: sha-Pfq9J4xZuOoCYsq6OMQCUZ7j0GOemtq/28tDkCvWGJc=] relay/server/relay.go:129: peer connected from: 172.23.0.1:43988

Can mention that I also observe that whenever a client is gracefully shut down or healtcheck timeout leads to a reduction of connected sockets.

mathiash98 commented 3 days ago

Can confirm that https://github.com/netbirdio/netbird/pull/2879 has solved this issue