Closed gabrielmer closed 1 month ago
This PR may contain changes to database schema of one of the drivers.
If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues.
Please make sure the label release-notes
is added to make sure upgrade instructions properly highlight this change.
You can find the image built from this PR at
quay.io/wakuorg/nwaku-pr:3077
Built from 3deaacf5a778e7a870b7acc0bd66eddb9f6da601
LGTM
Do we need to revisit how missed pings are handled? If only one side pings maybe we should be more lenient before disconnecting.
It may not be a problem in practice, IDK.
Great point! I see that the connection should timeout after 4-5 missed pings (~10 minutes without being reachable)
I think it looks reasonable? Don't think it should give issues, lmk what you think :)
Description
Once we started promptly disconnecting from excess
in
connections, we began seeing our nodes significantly exceeding theirout
connections targets.The root cause was a race condition in our keep alive loop https://github.com/waku-org/nwaku/blob/643ab20fc67f251987d594cfb5aa4abb60ccc5b2/waku/node/waku_node.nim#L1241-L1258
The case is the following:
nim-libp2p
accepts the connection until our peer manager notices that it's beyond ourin
target and disconnectsin
connection, we start running the keep alive loop and have that peer in the list of connected peers that we should pingin
connection as we noticed it's beyond our targetout
connection towards the nodeThe proposed change to avoid this race condition is to delegate the responsibility of the periodic ping to the node that originally initiated the connection. Or in other words, whoever initiated a connection is the one responsible to ping periodically to maintain it open - there's no need to have both nodes pinging each other.
Changes
connectedPeers()
to allow to get connected peers from all protocolskeepaliveLoop
so that we only ping nodes in ourout
connections listIssue
closes #3063