Open lexnv opened 1 month ago
Does this happen when you run the node with --profile testnet
/--release
? I remember when I was testing litep2p with higher load, the debug build really started to struggle. FWIW, I only saw these issues emerge when I switched from tokio to Substrate's task executor so I think there might be something there but I didn't see any issues after running polkadot in release mode
I started two nodes from a release build and left them running for ~2weeks:
./target/release/polkadot -d /home/ubuntu/workspace/kusama-db --chain kusama --pruning=1000 --sync=warp --network-backend litep2p --detailed-log-output
./target/release/polkadot -d /home/ubuntu/workspace/kusama-db-full --chain kusama --port 30355 --pruning=1000 --network-backend litep2p --detailed-log-output
I'm happy to announce that both are still producing blocks. Good point about the executor, it might be that my VM was overloaded. I'm running 2 nodes on the same machine and doing dev-work at the same time (compile substrate / subxt, running test nodes etc) which might put extra stress on the system
I think the connection<->protocol interface could definitely be better and I believe there are race conditions there right now, especially with sudden connection closing/redialing.
What could've potentially happened is that a connection was opened but Identify was not polled for long enough time that the keep-alive timeout expired, remote had opened a substream over that connection while it was still open, and when finally Identify is polled, the connection had already been closed (open_substream()
fails), peer is not added to peers
but the inbound substream event is still read from the event channel which now belongs to a peer who is not registered to Identify because it doesn't have any open connections.
There seems to be a state mismatch in the identify protocol implementation.
The following warning appeared 44 times on a warp sync node and 8 times on a full sync node, both nodes were running for almost 2 weeks.
The warning is coming from:
https://github.com/paritytech/litep2p/blob/42ed1c1463a8b11eac5344af2871ba752335b06e/src/protocol/libp2p/identify.rs#L255-L266
This may happen when:
on_connection_established
fails to open a new substream, either because the peerID isn't connected or the connection is cloggedon_connection_closed
is called beforeon_inbound_substream
on_inbound_substream
is called beforeon_connection_established
cc @paritytech/networking