Closed vpavlin closed 1 month ago
Did some investigating of this issue. Replicated locally and noticed some patterns that lead to connections closed or peers deleted from store:
libp2p:connection:outbound:4zjz071714024970277:error could not create new stream for protocols /ipfs/ping/1.0.0 +2s Error: unexpected end of input
libp2p:ping:error error while pinging 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +34s Error: unexpected end of input
waku:error:keep-alive Ping failed for peer (16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE). Next ping will be attempted in 300 seconds. +20m
libp2p:gossipsub connection ended 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +924ms
common.js:113
libp2p:gossipsub delete peer 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +0ms
This one is the most common:
waku:error:connection-manager Error dialing peer 16Uiu2HAmKAeC9VYRWWHwZxpjtPdf3bGeFg6aN2L1YYQmyc3FJRUU - The dial request has no valid addresses +1ms
connection_manager.js:201 waku:info:connection-manager Deleting undialable peer 16Uiu2HAmKAeC9VYRWWHwZxpjtPdf3bGeFg6aN2L1YYQmyc3FJRUU from peer store. Reason: The dial request has no valid addresses +1ms
libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
at http://localhost:3000/dist/bundle.js:55349:41
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms
Eventually I reached a state where there was a single peer left that I can send a message to using light push. The peer is NOT in the keepAliveManager
.
The connection gets reset and the peer is removed....
libp2p:gossipsub:error CodeError: stream reset at MplexStream.reset (http://localhost:3000/dist/bundle.js:39511:21) at MplexStreamMuxer._handleIncoming (http://localhost:3000/dist/bundle.js:39919:28) at MplexStreamMuxer.sink (http://localhost:3000/dist/bundle.js:39826:36) at async Promise.all (index 0) +10m
common.js:113 libp2p:gossipsub connection ended 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +10m
common.js:113 libp2p:gossipsub delete peer 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
but the same peer quickly reconnects:
libp2p:mplex initiator stream with id 31 and protocol /vac/waku/relay/2.0.0 ended +2ms
common.js:113 libp2p:mplex new receiver stream 8 +188ms
common.js:113 libp2p:connection:outbound:8k5axg1714026544411 incoming stream opened on /vac/waku/relay/2.0.0 +3m
common.js:113 libp2p:gossipsub new peer 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +191ms
common.js:113 libp2p:gossipsub Added peer has no IP in current address 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV /dns4/node-01.ac-cn-hongkong-c.waku.sandbox.status.im/tcp/8000/wss/p2p/16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
common.js:113 libp2p:gossipsub create inbound stream 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
common.js:113 libp2p:connection:outbound:8k5axg1714026544411 starting new stream for protocols /vac/waku/relay/2.0.0 +1ms
and I can keep sending messages. It's been a little while and this last peer still hasn't been lost. I'm curious if the peer being absent from waku.connectionManager.keepAliveManager
has something to do with it.
I while leave the light-js example running overnight and check on it again. Let me know if any of these logs are helpful @danisharora099 @weboko
I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:
[
"The dial request has no valid addresses",
"The dial request has no valid addresses",
...
"The dial request has no valid addresses",
"The dial request has no valid addresses"
]
This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive.
I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:
[ "The dial request has no valid addresses", "The dial request has no valid addresses", ... "The dial request has no valid addresses", "The dial request has no valid addresses" ]
This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive.
It seems that most peers are not running wss, in which case this is expected. I have a PR open for this but I want to do some additional testing before merging: https://github.com/waku-org/js-waku/pull/1983
Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding
Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding
I believe I have found the culprit, but like the PR above it probably needs more testing https://github.com/waku-org/js-waku/pull/1984
I believe I identified the prime suspect causing this issue (and dropping connections in general):
libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
at http://localhost:3000/dist/bundle.js:55349:41
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms
It seems what happens is:
I believe I identified the prime suspect causing this issue (and dropping connections in general):
libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 at http://localhost:3000/dist/bundle.js:55349:41 libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms
It seems what happens is:
- peer tries to ping js-waku
- there are already maximum amount of inbound streams opened from the peer to the node for the ping protocol
- libp2p closes the entire connection?
A mitigation for this that seems to work is increasing the maximum number of incoming connections for the ping protocol to 10 from the default of 2. I tried this with the light-js example and it's been running for 40 minutes without dropping below 9 peers, sending a light push message every 10 seconds. I'm not sure what are the performance implications of increasing this number but it seems preferable to the node losing peer connections and not recovering them.
The most important fixes for this have been included in the latest release of js-waku. The remaining PR for filtering wss connections is not as relevant. @weboko can this be considered resolved?
Agree that this task is complete. If we observe any issues - let's action in the scope of some other bug / task. Thank you for documenting investigation well!