paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 697 forks source link

Node keeps loosing peers. #528

Open arkpar opened 2 years ago

arkpar commented 2 years ago

Start a new warp sync using latest polkadot+substrate master with an empty db dir in GCP. The nodes keeps loosing almost all peers every couple of minutes.

a@a-db:~/src/polkadot$ ./target/release/polkadot -d ~/db3 --sync=warp -l db=trace,sync=trace 2>warp.log
2022-09-08 12:45:31.522  INFO main sc_cli::runner: Parity Polkadot
2022-09-08 12:45:31.522  INFO main sc_cli::runner: ✌️  version 0.9.28-b2ff4c05ab3
2022-09-08 12:45:31.522  INFO main sc_cli::runner: ❤️  by Parity Technologies <admin@parity.io>, 2017-2022
2022-09-08 12:45:31.522  INFO main sc_cli::runner: 📋 Chain specification: Polkadot
2022-09-08 12:45:31.522  INFO main sc_cli::runner: 🏷  Node name: certain-bean-5721
2022-09-08 12:45:31.522  INFO main sc_cli::runner: 👤 Role: FULL
2022-09-08 12:45:31.522  INFO main sc_cli::runner: 💾 Database: RocksDb at /home/arkadiy/db3/chains/polkadot/db/full
2022-09-08 12:45:31.522  INFO main sc_cli::runner: ⛓  Native runtime: polkadot-9280 (parity-polkadot-0.tx13.au0)
2022-09-08 12:45:35.022  INFO main sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0x29d0…4e17, header-hash: 0x91b1…90c3)
2022-09-08 12:45:35.040  INFO main afg: 👴 Loading GRANDPA authority set from genesis on what appears to be first startup.
2022-09-08 12:45:35.618  INFO main babe: 👶 Creating empty BABE epoch changes on what appears to be first startup.
2022-09-08 12:45:35.620  INFO main sub-libp2p: 🏷  Local node identity is: 12D3KooWL9R2Y3DuFdAueryD5nn6cL6BZs4RkMimC9cYkxyHE3Uh
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Operating system: linux
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 CPU architecture: x86_64
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Target environment: gnu
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 CPU cores: 8
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Memory: 16039MB
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Kernel: 4.19.0-21-cloud-amd64
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Linux distribution: Debian GNU/Linux 10 (buster)
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Virtual machine: yes
2022-09-08 12:45:35.710  INFO main sc_service::builder: 📦 Highest known block at #0
2022-09-08 12:45:35.710  INFO tokio-runtime-worker substrate_prometheus_endpoint: 〽️ Prometheus exporter started at 127.0.0.1:9615
2022-09-08 12:45:35.713  INFO                 main sc_rpc_server: Running JSON-RPC HTTP server: addr=127.0.0.1:9933, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])
2022-09-08 12:45:35.714  INFO                 main sc_rpc_server: Running JSON-RPC WS server: addr=127.0.0.1:9944, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])
2022-09-08 12:45:35.714  INFO                 main sc_sysinfo: 🏁 CPU score: 609MB/s
2022-09-08 12:45:35.714  INFO                 main sc_sysinfo: 🏁 Memory score: 4530MB/s
2022-09-08 12:45:35.714  INFO                 main sc_sysinfo: 🏁 Disk score (seq. writes): 475MB/s
2022-09-08 12:45:35.714  INFO                 main sc_sysinfo: 🏁 Disk score (rand. writes): 204MB/s
2022-09-08 12:45:35.717  INFO tokio-runtime-worker libp2p_mdns::behaviour::iface: creating instance on iface 10.132.0.41
2022-09-08 12:45:38.867  INFO tokio-runtime-worker sub-libp2p: 🔍 Discovered new external address for our node: /ip4/34.78.133.145/tcp/30333/ws/p2p/12D3KooWL9R2Y3DuFdAueryD5nn6cL6BZs4RkMimC9cYkxyHE3Uh
2022-09-08 12:45:42.255  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading finality proofs, 15.97 Mib (19 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.7MiB/s ⬆ 26.9kiB/s
2022-09-08 12:45:49.053  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading finality proofs, 31.92 Mib (39 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.7MiB/s ⬆ 7.4kiB/s
2022-09-08 12:45:55.835  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading finality proofs, 47.89 Mib (45 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.7MiB/s ⬆ 16.7kiB/s
2022-09-08 12:45:55.836  WARN tokio-runtime-worker telemetry: ❌ Error while dialing /dns/telemetry.polkadot.io/tcp/443/x-parity-wss/%2Fsubmit%2F: Custom { kind: Other, error: Timeout }
2022-09-08 12:46:00.836  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 99.24 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 6.5MiB/s ⬆ 23.8kiB/s
2022-09-08 12:46:05.836  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 154.66 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 7.0MiB/s ⬆ 12.8kiB/s
2022-09-08 12:46:10.837  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 207.67 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 6.9MiB/s ⬆ 7.3kiB/s
2022-09-08 12:46:15.837  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 263.09 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 6.8MiB/s ⬆ 6.0kiB/s
2022-09-08 12:46:20.864  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 304.35 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 4.4MiB/s ⬆ 5.0kiB/s
2022-09-08 12:46:25.865  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 401.33 Mib (49 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 17.4MiB/s ⬆ 9.4kiB/s
2022-09-08 12:46:30.865  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 500.22 Mib (38 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 17.7MiB/s ⬆ 10.9kiB/s
2022-09-08 12:46:35.866  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (37 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 13.8MiB/s ⬆ 5.5kiB/s
2022-09-08 12:46:40.866  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (29 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.6kiB/s ⬆ 2.5kiB/s
2022-09-08 12:46:45.866  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (28 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 23.9kiB/s ⬆ 15.6kiB/s
2022-09-08 12:46:50.866  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (31 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 3.0kiB/s ⬆ 2.9kiB/s
2022-09-08 12:46:55.867  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (28 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 16.8kiB/s ⬆ 10.3kiB/s
2022-09-08 12:47:00.867  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (36 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 3.4kiB/s ⬆ 3.8kiB/s
2022-09-08 12:47:04.229  WARN    async-std/runtime trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
2022-09-08 12:47:05.868  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (36 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 1.4kiB/s ⬆ 1.8kiB/s
2022-09-08 12:47:10.868  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (37 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 1.8kiB/s ⬆ 1.4kiB/s
2022-09-08 12:47:19.066  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (38 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 1.4kiB/s ⬆ 1.3kiB/s
2022-09-08 12:47:25.257  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (39 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.0kiB/s ⬆ 1.8kiB/s
2022-09-08 12:47:28.702  INFO tokio-runtime-worker db: Removing block #BlockId::Hash(0x91b171bb158e2d3848fa23a9f1c25182fb8e20313b2c1eb49219da7a70ce90c3)
2022-09-08 12:47:30.412  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (40 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.8kiB/s ⬆ 2.7kiB/s
2022-09-08 12:47:35.574  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (29 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 5.2kiB/s ⬆ 5.9kiB/s
2022-09-08 12:47:40.762  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (32 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 3.8kiB/s ⬆ 3.8kiB/s
2022-09-08 12:47:48.754  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (28 peers), best: #11958735 (0x7bef…21c4), finalized #11958735 (0x7bef…21c4), ⬇ 9.0kiB/s ⬆ 5.6kiB/s
2022-09-08 12:47:50.786  INFO tokio-runtime-worker sync: Warp sync is complete (577 MiB), restarting block sync.
2022-09-08 12:47:52.460  INFO tokio-runtime-worker substrate: ✨ Imported #11958757 (0x130e…291d)
2022-09-08 12:47:53.758  INFO tokio-runtime-worker substrate: ⏩ Block history, paritytech/substrate#3648 (7 peers), best: #11958757 (0x130e…291d), finalized #11958735 (0x7bef…21c4), ⬇ 442.0kiB/s ⬆ 9.0kiB/s
2022-09-08 12:47:54.771  INFO tokio-runtime-worker substrate: ✨ Imported #11958758 (0xe724…8fbd)
2022-09-08 12:47:58.759  INFO tokio-runtime-worker substrate: ⏩ Block history, paritytech/substrate#14656 (21 peers), best: #11958758 (0xe724…8fbd), finalized #11958754 (0x3e53…0381), ⬇ 838.2kiB/s ⬆ 31.2kiB/s
2022-09-08 12:48:00.759  INFO tokio-runtime-worker substrate: ✨ Imported #11958759 (0x30e1…423e)
2022-09-08 12:48:00.822  INFO tokio-runtime-worker substrate: ✨ Imported #11958759 (0x75b0…42c0)
2022-09-08 12:48:03.760  INFO tokio-runtime-worker substrate: ⏩ Block history, #21504 (3 peers), best: #11958759 (0x30e1…423e), finalized #11958756 (0x4f92…0927), ⬇ 570.2kiB/s ⬆ 11.1kiB/s
2022-09-08 12:48:06.662  INFO tokio-runtime-worker substrate: ✨ Imported #11958760 (0x5779…1a5f)
2022-09-08 12:48:08.761  INFO tokio-runtime-worker substrate: ⏩ Block history, #30848 (23 peers), best: #11958760 (0x5779…1a5f), finalized #11958757 (0x130e…291d), ⬇ 1.0MiB/s ⬆ 87.0kiB/s
2022-09-08 12:48:13.310  INFO tokio-runtime-worker substrate: ✨ Imported #11958761 (0x4829…083b)
2022-09-08 12:48:13.434  INFO tokio-runtime-worker sc_informant: ♻️  Reorg on #11958761,0x4829…083b to #11958761,0xcf72…6052, common ancestor #11958760,0x5779…1a5f
2022-09-08 12:48:13.435  INFO tokio-runtime-worker substrate: ✨ Imported #11958761 (0xcf72…6052)
2022-09-08 12:48:13.763  INFO tokio-runtime-worker substrate: ⏩ Block history, #39424 (2 peers), best: #11958761 (0xcf72…6052), finalized #11958758 (0xe724…8fbd), ⬇ 784.1kiB/s ⬆ 60.7kiB/s

Notice how the number of peers jumps from 20+ to 2. Unfortunately our network protocol does not include reason for why other peers decide to disconnect. However it looks like this is happening after we announce latest blocks to other peers. It could be that something was added to the block announcement that broke network protocol compatibility because of this code here: https://github.com/paritytech/substrate/blob/ded44948e2d5a398abcb4e342b0513cb690961bb/primitives/consensus/common/src/block_validation.rs#L89

bkchr commented 2 years ago

@dmitry-markin and @altonen could both of you please both look into this?

dmitry-markin commented 2 years ago

To share intermediate results: I was able to reproduce the issue on polkadot+substrate master, but also on polkadot-0.9.28 & polkadot-0.9.16. So, it looks like this is not something introduced by the recent changes. Also, a couple of times everything worked fine (on master), so I'm starting to be suspicious that the issue might be caused by some external factors.

ggwpez commented 2 years ago

So, it looks like this is not something introduced by the recent changes. Also, a couple of times everything worked fine (on master), so I'm starting to be suspicious that the issue might be caused by some external factors.

I saw this quite a while ago on my PC as well, but thought it was normal. Like dropping peers from 30 to 10 or even less, and then slowly creeping back up.

nazar-pc commented 2 years ago

I didn't look into warping, but if it downloads significant chunks of data, it might have the same root cause as https://github.com/paritytech/substrate/issues/12105, namely keep alive timeout.

dmitry-markin commented 2 years ago

I didn't look into warping, but if it downloads significant chunks of data, it might have the same root cause as paritytech/substrate#12105, namely keep alive timeout.

I tried reducing keep alive timeout from 10 sec to 1 sec (locally) to see if it affects the peer count, and it looks like it doesn't. Overall, the peer count jitter seems purely random: sometimes the peer count is stable on 20-30 peers, and sometimes it starts dropping to almost 0.

melekes commented 1 year ago

@dmitry-markin curious if you saw any errors. Or it just silently disconnects? Surely libp2p has some "debug" flag.

dmitry-markin commented 1 year ago

@dmitry-markin curious if you saw any errors. Or it just silently disconnects? Surely libp2p has some "debug" flag.

I don't recall anything suspicious in the logs, but it's been a while and I'm unsure if I run the node with something like -l sub-libp2p=trace. Probably I did, but makes sense double checking.

melekes commented 1 year ago

Tried the latest substrate / polkadot master. Peer count went from 40 to 21 and then recovered.

Some of the errors I saw:

## unsupported address (4 times with same address at least)

2023-01-06 15:01:25.402 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Failed to reach PeerId("12D3KooWEsPEadSjLAPyxckqVJkp54aVdPuX3DD6a1FTL2y5cB9x"): Faile
d to negotiate transport protocol(s): [(/dns/polkadot-connect-3.parity.io/tcp/443/wss/p2p/12D3KooWEsPEadSjLAPyxckqVJkp54aVdPuX3DD6a1FTL2y5cB9x: : Unsupported resolved address: /ip4/34.89.193.251/tcp/443/wss/p2p/12D3KooWEsPEadSjLAPyxckqVJkp54aVdPuX3DD6a1FTL2y5cB9x: Unsupported resolved address: /ip4/34.89.193.251/tcp/443/wss/p2p/12D3KooWEsPEadSjLAPyxckqVJkp54aVdPuX3DD6a1FTL2y5cB9x)]

## timeout 
2023-01-06 15:00:58.063 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Failed to reach PeerId("12D3KooWLC8iRopWZLnVw8wCKqER9JKU5g68LtbDVeBd4PDrgKGW"): Faile
d to negotiate transport protocol(s): [(/ip4/54.158.91.224/tcp/30333/ws/p2p/12D3KooWLC8iRopWZLnVw8wCKqER9JKU5g68LtbDVeBd4PDrgKGW: : Timeout has been reached)]

## YamuxError(Closed)) BUT later connected

2023-01-06 15:00:58.146 DEBUG tokio-runtime-worker sub-libp2p: Libp2p => Disconnected(PeerId("12D3KooWJnpAkaRth5hb1giJEEa2rfkh2ekKZTWj8nkk2AdZbYWf"), Some(IO(Custom { kind: Other, error: A(YamuxError(Closed)) })))

The primary source of the majority of disconnects appears to be YamuxError(Closed)). However it's not clear why the remote nodes are terminating the connection. Maybe it's because something we're sending them as pointed out in the issue's description.

altonen commented 1 year ago

I've found two reasons as to why the peer count drops:

After a while the peer count has dropped to 0-2 range and stays there. Genesis mismatch sounds like a bug in the ancestry search code but I'm not sure why the requests are refused. From the logs it looks like at some point node sends the same request a third time soon after the other two after which the connection is closed. This could explain why the connection is closed because syncing code considers three same requests from the peer as fatal error.

Related issue: https://github.com/paritytech/polkadot-sdk/issues/531

dmitry-markin commented 1 year ago

That's quite strange that ancestry search even starts in case of genesis mismatch — we now have genesis hash hardcoded in substrate & polkadot protocol names, so the nodes must not talk to each other at all.

altonen commented 1 year ago

They do talk and exchange genesis hashes and then proceed to sync. I think ancestry search has bug which incorrectly determines the nodes have different genesis hashes.

dmitry-markin commented 1 year ago

May be that's because of the legacy fallback protocol name which doesn't contain the genesis hash: https://github.com/paritytech/substrate/blob/master/client/network/sync/src/block_request_handler.rs#L92-L95

And we should phase-out the legacy names?

dmitry-markin commented 1 year ago

Apart from this it's also a good idea to check a genesis hash in ancestry search, of course.

altonen commented 1 year ago

The genesis hash is also exchanged in the BlockAnnounces handshake so they should be either way aware of their genesis hashes. I also got connections from peers which were actually on different chains and they were swiftly disconnected in on_sync_peer_connected().

altonen commented 1 year ago

I think the issue might be this function. It allows the peer to send the same request multiple times since it doesn't properly keep track of which blocks have been requested and keeps resending requests, sometimes in very rapid succession. This causes the remote node to close the connection when it notices that a peer has sent the same request multiple times.

Ancestor search bug is probably related to the fact that the peer state is tied to the latest sent request but there is no notion of a stale response so if local node has sent ancestor search requests before without hearing any response and then sends one more, this time starting from genesis (here) and then receives receives a response to some previous request (i.e., the response is stale), the code can fail with genesis mismatch.

I noticed another issues with block requests: local node requests blocks { X, ..., X + 64 } and if one of the received blocks is too far in the future, it is rejected (probably as it should) with Verification failed for block ... and the connection is closed but I don't now if closing the connection is correct behavior.

arkpar commented 1 year ago

I think the issue might be this function. It allows the peer to send the same request multiple times since it doesn't properly keep track of which blocks have been requested and keeps resending requests, sometimes in very rapid succession.

Care to elaborate? Is there a test or a log that demonstrate this? The whole purpose of this module is to keep track of what's begin downloaded. Blocks returned by this function are marked as being downloaded. The only way it can return the same blocks for the same peer is after that status have been cleared. Which may only happen either on timeout, or when the block data has been recevied and drained.

Ancestor search bug is probably related to the fact that the peer state is tied to the latest sent request but there is no notion of a stale response so if local node has sent ancestor search requests before without hearing any response and then sends one more, this time starting from genesis (here) and then receives receives a response to some previous request (i.e., the response is stale), the code can fail with genesis mismatch.

IIRC Request-respnse protocol should guarantee that each response is matched to a request with an ID. on_block_data in the sync module accepts the response along with the original request.

I noticed another issues with block requests: local node requests blocks { X, ..., X + 64 } and if one of the received blocks is too far in the future, it is rejected (probably as it should) with Verification failed for block ... and the connection is closed but I don't now if closing the connection is correct behavior.

What do you mean by "too far in the future"? As in block timestamp is too far in the future?

altonen commented 1 year ago

Care to elaborate? Is there a test or a log that demonstrate this? The whole purpose of this module is to keep track of what's begin downloaded. Blocks returned by this function are marked as being downloaded. The only way it can return the same blocks for the same peer is after that status have been cleared. Which may only happen either on timeout, or when the block data has been recevied and drained.

I will try to reproduce it with a test but here is a log showing the bug:

2023-02-13 16:54:51.248 TRACE tokio-runtime-worker sync: New block request for 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC, (best:14231262, common:14231227) BlockRequest { id: 6205, fields: HEADER | BODY | JUSTIFICATION, from: Number(14592), direction: Descending, max: Some(64) }    
2023-02-13 16:54:51.248 DEBUG tokio-runtime-worker sync: start block request: BlockRequest { id: 6205, fields: HEADER | BODY | JUSTIFICATION, from: Number(14592), direction: Descending, max: Some(64) }, peer 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC    
2023-02-13 16:54:51.248 TRACE tokio-runtime-worker sync: send request, request id: 3319, peer id 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC    
2023-02-13 16:54:51.485 TRACE tokio-runtime-worker sync: BlockResponse 6205 from 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC with 64 blocks  (14592..14529)    
2023-02-13 16:54:51.485 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: Reversing incoming block list    
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: gap sync 16641..16705    
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: Too far ahead for peer 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC (16641)    
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: try get gap block request for peer    
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: gap sync 16065..16129    
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: New gap block request for 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC, (best:14231262, common:14231227) BlockRequest { id: 6228, fields: HEADER | BODY | JUSTIFICATION, from: Number(16128), direction: Descending, max: Some(64) }    
2023-02-13 16:54:51.486 DEBUG tokio-runtime-worker sync: start block request: BlockRequest { id: 6228, fields: HEADER | BODY | JUSTIFICATION, from: Number(16128), direction: Descending, max: Some(64) }, peer 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC    
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: send request, request id: 3337, peer id 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC    
2023-02-13 16:54:51.757 TRACE tokio-runtime-worker sync: BlockResponse 6228 from 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC with 64 blocks  (16128..16065)    
2023-02-13 16:54:51.757 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: Reversing incoming block list    
2023-02-13 16:54:51.757 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: gap sync 16065..16129    
2023-02-13 16:54:51.758 TRACE tokio-runtime-worker sync: New block request for 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC, (best:14231262, common:14231227) BlockRequest { id: 6261, fields: HEADER | BODY | JUSTIFICATION, from: Number(16128), direction: Descending, max: Some(64) }    
2023-02-13 16:54:51.758 DEBUG tokio-runtime-worker sync: start block request: BlockRequest { id: 6261, fields: HEADER | BODY | JUSTIFICATION, from: Number(16128), direction: Descending, max: Some(64) }, peer 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC    
2023-02-13 16:54:51.758 TRACE tokio-runtime-worker sync: send request, request id: 3357, peer id 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC    
2023-02-13 16:54:52.044 TRACE tokio-runtime-worker sync: BlockResponse 6261 from 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC with 64 blocks  (16128..16065)    
2023-02-13 16:54:52.044 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: Reversing incoming block list    
2023-02-13 16:54:52.045 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: gap sync 15361..15425    

My interpretation of this is that node sends a gap sync request starting from block 16128, at 16:54:51.486, then a response to some previous block request is received and as a result of the response, another block request is sent at 16:54:51.758 starting from block 16128 again while the previous request is still in flight.

IIRC Request-respnse protocol should guarantee that each response is matched to a request with an ID. on_block_data in the sync module accepts the response along with the original request.

That is true. But right now ChainSync does not utilize these request IDs which causes the genesis mismatch to happen.

I noticed another issues with block requests: local node requests blocks { X, ..., X + 64 } and if one of the received blocks is too far in the future, it is rejected (probably as it should) with Verification failed for block ... and the connection is closed but I don't now if closing the connection is correct behavior.

What do you mean by "too far in the future"? As in block timestamp is too far in the future?

The slot is too far in the future, as reported by Aura.

arkpar commented 1 year ago

My interpretation of this is that node sends a gap sync request starting from block 16128, at 16:54:51.486, then a response to some previous block request is received and as a result of the response, another block request is sent at 16:54:51.758 starting from block 16128 again while the previous request is still in flight.

To me it looks like the response it actully correct and contains the blocks that were actually requested. Notice that the request for blocks in descending order. Response id also matches request id.

The interesting bit is that the first request was issued for the gap sync ("New gap block request for ..") and then another request was issued for the main sync ("New block request for ..") for the same blocks. So there's clearly a bug there with the gap sync interfering the the main sync.

That is true. But right now ChainSync does not utilize these request IDs which causes the genesis mismatch to happen.

ChainSync expects that only one request per peer is alive. Any previous requests should be canceled as soon as new one is sent. Before the refactoring to use the libp2p request-response protocol, this was enforced explicitly. Now I believe it is handled here: https://github.com/paritytech/substrate/blob/master/client/network/src/request_responses.rs#L646 though I would not be surprised if there's some kind of race here.

The slot is too far in the future, as reported by Aura.

Yeah, ideally this should be handled by the import queue.

altonen commented 1 year ago

ChainSync expects that only one request per peer is alive.

I think this is source of the problem. Peer is sending multiple requests and apparently the cancelling behavior (RequestFailure::Obsolete) is either not relayed to ChainSync or it ignores it somewhere. I also found this piece of code which indicates that there is a mechanism for tracking which blocks are being requested so duplicates would not sent but the HashMap is never actually used to check if the request is a duplicate before sending it. The check was probably there as some point but may have been lost during refactoring.

I'll start working on the fixes. Thanks for the information.

Yeah, ideally this should be handled by the import queue.

So the import queue should hold on to the block if it's in a future slot and not disconnect the peer, right?

bkchr commented 1 year ago

So the import queue should hold on to the block if it's in a future slot and not disconnect the peer, right?

If there isn't a huge drift in the local clock, this isn't possible (or should not happen). Where did you observe this behavior? The import queue should reject these peers as they are trying too fool you at least from our own local view. But this should also only happen when we are syncing on the tip of the chain.

altonen commented 1 year ago

I had some clock drift because NTP was not enabled. Enabling it fixed the importing issue.

altonen commented 1 year ago

In addition to sending duplicate requests, this bug is also related to paritytech/polkadot-sdk#556. Peer count is reported incorrectly to be 40 and blocks requests work because the block request handler doesn't check if the peer is accepted by the syncing code (which to me sounds like a security bug) and syncing works for the duration of warp sync but when it starts to download blocks and announcing them, it notices that the block announcement substream has been closed and then terminates the substream on its end too. That issue is currently blocked by https://github.com/paritytech/substrate/pull/12828 but when it's merged, I'll continue the implementation of custom handshakes.

I think I also found reason the why the code sends duplicate requests even if there are measures in place to prevent that. If peer sends a block request and then syncing gets restarted, the block request sent before the restart is not discarded but received normally which will then reset the peer to be available again and send another request. Then the request after the sync restart is received which does the same thing which basically results two requests being constantly in active. The block request scheduling can then request the same range in two different requests from the same peer which causes the connection to be closed.

arkpar commented 1 year ago

I will try to reproduce it with a test but here is a log showing the bug:

@altonen Is there a way to reproduce this log? The duplicate request sure looks suspicious but I can't reproduce it.

altonen commented 1 year ago

I will try to reproduce it with a test but here is a log showing the bug:

@altonen Is there a way to reproduce this log? The duplicate request sure looks suspicious but I can't reproduce it.

I've been sidetracked with another issue that I think is the central issue of this bug report. The problem seems to be that there nearly all full node slots are occupied on the network but this is not noticed until the local node sends a block announcement. The reason for duplicate requests was a recent change in moving block requests to ChainSync (made by myself) which didn't properly clear outbound requests upon sync restart. The clock drift I had on my computer exacerbated that issue but after fixing that bug, I still had the same behavior of loosing peers after warp sync is complete (the original issue). We're currently working on a fix for that and I'll fix the duplicate request bug after this other thing is fixed.

ghost commented 1 year ago

a recent change in moving block requests to ChainSync

this one? https://github.com/paritytech/substrate/pull/12739

felixshu commented 7 months ago

Hi, @altonen. We are experiencing similar issues and are looking for potential solutions to stabilize peer connectivity. Do we have any clue how to resolve the problem?

altonen commented 7 months ago

Hi, @altonen. We are experiencing similar issues and are looking for potential solutions to stabilize peer connectivity. Do we have any clue how to resolve the problem?

@dmitry-markin @lexnv

dmitry-markin commented 7 months ago

Hi, @altonen. We are experiencing similar issues and are looking for potential solutions to stabilize peer connectivity. Do we have any clue how to resolve the problem?

Hi @felixshu could you post the logs of the moment when your node loses peers (+ a couple of minutes before), the command line that you use to start the node, and the version/commit-id you are running?