gabrielmer commented 1 month ago

Description

Once we started promptly disconnecting from excess in connections, we began seeing our nodes significantly exceeding their out connections targets.

The root cause was a race condition in our keep alive loop https://github.com/waku-org/nwaku/blob/643ab20fc67f251987d594cfb5aa4abb60ccc5b2/waku/node/waku_node.nim#L1241-L1258

The case is the following:

A node receives incoming connections beyond its target, and it takes a small amount of time from the moment nim-libp2p accepts the connection until our peer manager notices that it's beyond our in target and disconnects
In the time while we're connected to this in connection, we start running the keep alive loop and have that peer in the list of connected peers that we should ping
While we're pinging other peers, we disconnect from the in connection as we noticed it's beyond our target
Because the list of the nodes to ping was generated before we disconnected from the node, we ping the node. As there's no existing connection, we end up creating a new out connection towards the node

The proposed change to avoid this race condition is to delegate the responsibility of the periodic ping to the node that originally initiated the connection. Or in other words, whoever initiated a connection is the one responsible to ping periodically to maintain it open - there's no need to have both nodes pinging each other.

Changes

[x] extended connectedPeers() to allow to get connected peers from all protocols
[x] modified keepaliveLoop so that we only ping nodes in our out connections list

Issue

closes #3063

github-actions[bot] commented 1 month ago

This PR may contain changes to database schema of one of the drivers.

If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues.

Please make sure the label release-notes is added to make sure upgrade instructions properly highlight this change.

github-actions[bot] commented 1 month ago

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:3077

Built from 3deaacf5a778e7a870b7acc0bd66eddb9f6da601

gabrielmer commented 1 month ago

LGTM

Do we need to revisit how missed pings are handled? If only one side pings maybe we should be more lenient before disconnecting.

It may not be a problem in practice, IDK.

Great point! I see that the connection should timeout after 4-5 missed pings (~10 minutes without being reachable)

https://github.com/waku-org/nwaku/blob/e406673c4618f95b8c2536e99eb9b22adbbd2149/waku/node/waku_node.nim#L1261

I think it looks reasonable? Don't think it should give issues, lmk what you think :)

waku-org / nwaku

fix: out connections leak #3077

Description

Changes

Issue