paritytech / litep2p

Peer-to-peer networking library
MIT License
54 stars 5 forks source link

identify: Open inbound substream for unknown peer #103

Open lexnv opened 1 month ago

lexnv commented 1 month ago

There seems to be a state mismatch in the identify protocol implementation.

The following warning appeared 44 times on a warp sync node and 8 times on a full sync node, both nodes were running for almost 2 weeks.

2024-05-06 03:16:41.456  WARN tokio-runtime-worker litep2p::ipfs::identify: inbound identify substream opened for peer who doesn't exist peer=PeerId("12D3KooWA5taCbB5CEcz1VS126XjcbwBYqm9GmsASX5mBZU8611v") protocol=/ipfs/id/1.0.0

The warning is coming from:

https://github.com/paritytech/litep2p/blob/42ed1c1463a8b11eac5344af2871ba752335b06e/src/protocol/libp2p/identify.rs#L255-L266

This may happen when:

  1. on_connection_established fails to open a new substream, either because the peerID isn't connected or the connection is clogged
  2. on_connection_closed is called before on_inbound_substream
  3. on_inbound_substream is called before on_connection_established

cc @paritytech/networking

altonen commented 1 month ago

Does this happen when you run the node with --profile testnet/--release? I remember when I was testing litep2p with higher load, the debug build really started to struggle. FWIW, I only saw these issues emerge when I switched from tokio to Substrate's task executor so I think there might be something there but I didn't see any issues after running polkadot in release mode

lexnv commented 1 month ago

I started two nodes from a release build and left them running for ~2weeks:

./target/release/polkadot -d /home/ubuntu/workspace/kusama-db --chain kusama --pruning=1000 --sync=warp --network-backend litep2p --detailed-log-output

./target/release/polkadot -d /home/ubuntu/workspace/kusama-db-full --chain kusama --port 30355 --pruning=1000 --network-backend litep2p --detailed-log-output

I'm happy to announce that both are still producing blocks. Good point about the executor, it might be that my VM was overloaded. I'm running 2 nodes on the same machine and doing dev-work at the same time (compile substrate / subxt, running test nodes etc) which might put extra stress on the system

altonen commented 1 month ago

I think the connection<->protocol interface could definitely be better and I believe there are race conditions there right now, especially with sudden connection closing/redialing.

What could've potentially happened is that a connection was opened but Identify was not polled for long enough time that the keep-alive timeout expired, remote had opened a substream over that connection while it was still open, and when finally Identify is polled, the connection had already been closed (open_substream() fails), peer is not added to peers but the inbound substream event is still read from the event channel which now belongs to a peer who is not registered to Identify because it doesn't have any open connections.