paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.8k stars 652 forks source link

litep2p: Update network backend to v0.7.0 #5609

Closed lexnv closed 2 weeks ago

lexnv commented 3 weeks ago

This release introduces several new features, improvements, and fixes to the litep2p library. Key updates include enhanced error handling, configurable connection limits, and a new API for managing public addresses.

For a detailed set of changes, see litep2p changelog.

This PR makes use of:

Warp sync time improvement

Measuring warp sync time is a bit inaccurate since the network is not deterministic and we might end up using faster peers (peers with more resources to handle our requests). However, I did not see warp sync times of 16 minutes, instead, they are roughly stabilized between 8 and 10 minutes.

For measuring warp-sync time, I've used sub-trige-logs

Litep2p

Phase Time
Warp 426.999999919s
State 99.000000555s
Total 526.000000474s

Libp2p

Phase Time
Warp 731.999999837s
State 71.000000882s
Total 803.000000719s

Closes: https://github.com/paritytech/polkadot-sdk/issues/4986

Low peer count

After exposing the litep2p::public_addresses interface, we can report to litep2p confirmed external addresses. This should mitigate or at least improve: https://github.com/paritytech/polkadot-sdk/issues/4925. Will keep the issue around to confirm this.

Improved metrics

We are one step closer to exposing similar metrics as libp2p: https://github.com/paritytech/polkadot-sdk/issues/4681.

cc @paritytech/networking

Next Steps

paritytech-cicd-pr commented 3 weeks ago

The CI pipeline was cancelled due to failure one of the required jobs. Job name: test-linux-stable 3/3 Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7276240

lexnv commented 2 weeks ago

I've changed a bit two things since last review:

We got around 4.7k warnings from hickory:

WARN tokio-runtime-worker hickory_proto::xfer::dns_exchange: failed to associate send_message response to the sender"

The fix similar to: https://github.com/paritytech/substrate/pull/12253 (disable logging for this crate).

Local node triaging (litep2p)

Count Level Triage report
232 warn 🥩 ran out of peers to request justif #. from num_cache=. num_live=. err=.
4 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Peer disconnected with inflight after backoffs. Banned, disconnecting. )
2 warn ❌ Error while dialing .: .
1 warn 💔 Error importing block .: . ( Parent block of 0xd7a9…f573 has no associated weight )
1 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Same block request multiple times. Banned, disconnecting. )

Other warnings:

    "2024-09-06 15:39:22.641  WARN tokio-runtime-worker litep2p::ipfs::identify: inbound identify substream opened for peer who doesn't exist peer=PeerId(\"12D3KooWRHaoLvJuJptSUgsc1bzXsKToRUR6qS2KW1MVgnJqLpKx\") protocol=/ipfs/id/1.0.0",

   "2024-09-07 20:01:07.952  WARN tokio-runtime-worker hickory_proto::xfer::dns_exchange: failed to associate send_message response to the sender",

    "2024-09-07 21:43:54.157  WARN tokio-runtime-worker litep2p::transport-manager: unknown connection opened as secondary connection, discarding peer=PeerId(\"12D3KooWGTnNXimfyieaZAeyRDvZLQpFF7Nr9a8bS3oN4yMPQExZ\") connection_id=ConnectionId(2347697) address=\"/ip4/212.224.112.221/tcp/49054/ws/p2p/12D3KooWGTnNXimfyieaZAeyRDvZLQpFF7Nr9a8bS3oN4yMPQExZ\" dial_record=AddressRecord { score: 100, address: \"/ip4/212.224.112.221/tcp/30333/p2p/12D3KooWGTnNXimfyieaZAeyRDvZLQpFF7Nr9a8bS3oN4yMPQExZ\", connection_id: Some(ConnectionId(2347695)) }",

This has resurfaced the litep2p::transport-manager: unknown connection opened as secondary connection: https://github.com/paritytech/litep2p/issues/172. Have created a new issue for this: https://github.com/paritytech/litep2p/issues/242

Local node triaging (libp2p)

Count Level Triage report
683 warn Notification block pinning limit reached. Unpinning block with hash = .*
11 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Not requested block data. Banned, disconnecting. )
4 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Unsupported protocol. Banned, disconnecting. )
2 warn Can't listen on . because: .
1 warn Re-finalized block #. (.) in the canonical chain, current best finalized is #.*
1 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Same block request multiple times. Banned, disconnecting. )
1 warn ❌ Error while dialing .: .
1 warn 💔 Error importing block .: . ( Parent block of 0xd7a9…f573 has no associated weight )

Other warnings:

   - "2024-09-06 14:13:20.673  WARN tokio-runtime-worker sc_network::service: 💔 The bootnode you want to connect to at `/dns/ksm14.rotko.net/tcp/33224/p2p/12D3KooWAa5THTw8HPfnhEei23HdL8P9McBXdozG2oTtMMksjZkK` provided a different peer ID `12D3KooWDTWSFqWQNqHdrAc2srGsqzK7GMw9RAjFTfUjcka5FEJN` than the one you expect `12D3KooWAa5THTw8HPfnhEei23HdL8P9McBXdozG2oTtMMksjZkK`.    ",
   - "2024-09-06 22:58:25.843 ERROR tokio-runtime-worker sc_utils::mpsc: The number of unprocessed messages in channel `mpsc-notification-to-protocol-2-beefy` exceeded 100000.",

Versi network testing

Manual triaging (until sub-triage-logs gains access to loki):

WARN tokio-runtime-worker babe: 👶 Epoch(s) skipped: from 33226 to 33241    

2024-09-07 06:16:30.004  WARN tokio-runtime-worker parachain::dispute-coordinator: error=Runtime(RuntimeRequest(NotSupported { runtime_api_name: "candidate_events" }))

2024-09-07 06:16:30.084  WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: Api called for an unknown Block: Header was not found in the database: 0x5aaa2a515394a2f9da57ab3ea792808f93822dcdcac76fd9de173776bd9d31ca api="candidate_events"

2024-09-07 06:16:59.828  WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: Api called for an unknown Block: Header was not found in the database: 0x5aaa2a515394a2f9da57ab3ea792808f93822dcdcac76fd9de173776bd9d31ca api="candidate_events"

Warnings appeared after the versi-net was scaled down from 100 to 20 validators Saturaday, roughly at Sat Sep 7 06:09:42. Warnings continued for around 1h. This was the first time we introduced scaling in our versi-net testing, will continue to keep an eye on this and check how libp2p behaves in comparison.

Grafana logs

Polkadot-Forum commented 2 weeks ago

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/litep2p-network-backend-updates/9973/1