Open arkpar opened 2 years ago
@dmitry-markin and @altonen could both of you please both look into this?
To share intermediate results: I was able to reproduce the issue on polkadot+substrate master, but also on polkadot-0.9.28 & polkadot-0.9.16. So, it looks like this is not something introduced by the recent changes. Also, a couple of times everything worked fine (on master), so I'm starting to be suspicious that the issue might be caused by some external factors.
So, it looks like this is not something introduced by the recent changes. Also, a couple of times everything worked fine (on master), so I'm starting to be suspicious that the issue might be caused by some external factors.
I saw this quite a while ago on my PC as well, but thought it was normal. Like dropping peers from 30 to 10 or even less, and then slowly creeping back up.
I didn't look into warping, but if it downloads significant chunks of data, it might have the same root cause as https://github.com/paritytech/substrate/issues/12105, namely keep alive timeout.
I didn't look into warping, but if it downloads significant chunks of data, it might have the same root cause as paritytech/substrate#12105, namely keep alive timeout.
I tried reducing keep alive timeout from 10 sec to 1 sec (locally) to see if it affects the peer count, and it looks like it doesn't. Overall, the peer count jitter seems purely random: sometimes the peer count is stable on 20-30 peers, and sometimes it starts dropping to almost 0.
@dmitry-markin curious if you saw any errors. Or it just silently disconnects? Surely libp2p has some "debug" flag.
@dmitry-markin curious if you saw any errors. Or it just silently disconnects? Surely libp2p has some "debug" flag.
I don't recall anything suspicious in the logs, but it's been a while and I'm unsure if I run the node with something like -l sub-libp2p=trace
. Probably I did, but makes sense double checking.
Tried the latest substrate / polkadot master
. Peer count went from 40 to 21 and then recovered.
Some of the errors I saw:
## unsupported address (4 times with same address at least)
2023-01-06 15:01:25.402 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Failed to reach PeerId("12D3KooWEsPEadSjLAPyxckqVJkp54aVdPuX3DD6a1FTL2y5cB9x"): Faile
d to negotiate transport protocol(s): [(/dns/polkadot-connect-3.parity.io/tcp/443/wss/p2p/12D3KooWEsPEadSjLAPyxckqVJkp54aVdPuX3DD6a1FTL2y5cB9x: : Unsupported resolved address: /ip4/34.89.193.251/tcp/443/wss/p2p/12D3KooWEsPEadSjLAPyxckqVJkp54aVdPuX3DD6a1FTL2y5cB9x: Unsupported resolved address: /ip4/34.89.193.251/tcp/443/wss/p2p/12D3KooWEsPEadSjLAPyxckqVJkp54aVdPuX3DD6a1FTL2y5cB9x)]
## timeout
2023-01-06 15:00:58.063 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Failed to reach PeerId("12D3KooWLC8iRopWZLnVw8wCKqER9JKU5g68LtbDVeBd4PDrgKGW"): Faile
d to negotiate transport protocol(s): [(/ip4/54.158.91.224/tcp/30333/ws/p2p/12D3KooWLC8iRopWZLnVw8wCKqER9JKU5g68LtbDVeBd4PDrgKGW: : Timeout has been reached)]
## YamuxError(Closed)) BUT later connected
2023-01-06 15:00:58.146 DEBUG tokio-runtime-worker sub-libp2p: Libp2p => Disconnected(PeerId("12D3KooWJnpAkaRth5hb1giJEEa2rfkh2ekKZTWj8nkk2AdZbYWf"), Some(IO(Custom { kind: Other, error: A(YamuxError(Closed)) })))
The primary source of the majority of disconnects appears to be YamuxError(Closed))
. However it's not clear why the remote nodes are terminating the connection. Maybe it's because something we're sending them as pointed out in the issue's description.
I've found two reasons as to why the peer count drops:
After a while the peer count has dropped to 0-2 range and stays there. Genesis mismatch sounds like a bug in the ancestry search code but I'm not sure why the requests are refused. From the logs it looks like at some point node sends the same request a third time soon after the other two after which the connection is closed. This could explain why the connection is closed because syncing code considers three same requests from the peer as fatal error.
Related issue: https://github.com/paritytech/polkadot-sdk/issues/531
That's quite strange that ancestry search even starts in case of genesis mismatch — we now have genesis hash hardcoded in substrate & polkadot protocol names, so the nodes must not talk to each other at all.
They do talk and exchange genesis hashes and then proceed to sync. I think ancestry search has bug which incorrectly determines the nodes have different genesis hashes.
May be that's because of the legacy fallback protocol name which doesn't contain the genesis hash: https://github.com/paritytech/substrate/blob/master/client/network/sync/src/block_request_handler.rs#L92-L95
And we should phase-out the legacy names?
Apart from this it's also a good idea to check a genesis hash in ancestry search, of course.
The genesis hash is also exchanged in the BlockAnnounces
handshake so they should be either way aware of their genesis hashes. I also got connections from peers which were actually on different chains and they were swiftly disconnected in on_sync_peer_connected()
.
I think the issue might be this function. It allows the peer to send the same request multiple times since it doesn't properly keep track of which blocks have been requested and keeps resending requests, sometimes in very rapid succession. This causes the remote node to close the connection when it notices that a peer has sent the same request multiple times.
Ancestor search bug is probably related to the fact that the peer state is tied to the latest sent request but there is no notion of a stale response so if local node has sent ancestor search requests before without hearing any response and then sends one more, this time starting from genesis (here) and then receives receives a response to some previous request (i.e., the response is stale), the code can fail with genesis mismatch.
I noticed another issues with block requests: local node requests blocks { X, ..., X + 64 }
and if one of the received blocks is too far in the future, it is rejected (probably as it should) with Verification failed for block ...
and the connection is closed but I don't now if closing the connection is correct behavior.
I think the issue might be this function. It allows the peer to send the same request multiple times since it doesn't properly keep track of which blocks have been requested and keeps resending requests, sometimes in very rapid succession.
Care to elaborate? Is there a test or a log that demonstrate this? The whole purpose of this module is to keep track of what's begin downloaded. Blocks returned by this function are marked as being downloaded. The only way it can return the same blocks for the same peer is after that status have been cleared. Which may only happen either on timeout, or when the block data has been recevied and drained.
Ancestor search bug is probably related to the fact that the peer state is tied to the latest sent request but there is no notion of a stale response so if local node has sent ancestor search requests before without hearing any response and then sends one more, this time starting from genesis (here) and then receives receives a response to some previous request (i.e., the response is stale), the code can fail with genesis mismatch.
IIRC Request-respnse protocol should guarantee that each response is matched to a request with an ID. on_block_data
in the sync
module accepts the response along with the original request.
I noticed another issues with block requests: local node requests blocks
{ X, ..., X + 64 }
and if one of the received blocks is too far in the future, it is rejected (probably as it should) withVerification failed for block ...
and the connection is closed but I don't now if closing the connection is correct behavior.
What do you mean by "too far in the future"? As in block timestamp is too far in the future?
Care to elaborate? Is there a test or a log that demonstrate this? The whole purpose of this module is to keep track of what's begin downloaded. Blocks returned by this function are marked as being downloaded. The only way it can return the same blocks for the same peer is after that status have been cleared. Which may only happen either on timeout, or when the block data has been recevied and drained.
I will try to reproduce it with a test but here is a log showing the bug:
2023-02-13 16:54:51.248 TRACE tokio-runtime-worker sync: New block request for 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC, (best:14231262, common:14231227) BlockRequest { id: 6205, fields: HEADER | BODY | JUSTIFICATION, from: Number(14592), direction: Descending, max: Some(64) }
2023-02-13 16:54:51.248 DEBUG tokio-runtime-worker sync: start block request: BlockRequest { id: 6205, fields: HEADER | BODY | JUSTIFICATION, from: Number(14592), direction: Descending, max: Some(64) }, peer 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC
2023-02-13 16:54:51.248 TRACE tokio-runtime-worker sync: send request, request id: 3319, peer id 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC
2023-02-13 16:54:51.485 TRACE tokio-runtime-worker sync: BlockResponse 6205 from 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC with 64 blocks (14592..14529)
2023-02-13 16:54:51.485 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: Reversing incoming block list
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: gap sync 16641..16705
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: Too far ahead for peer 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC (16641)
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: try get gap block request for peer
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: gap sync 16065..16129
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: New gap block request for 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC, (best:14231262, common:14231227) BlockRequest { id: 6228, fields: HEADER | BODY | JUSTIFICATION, from: Number(16128), direction: Descending, max: Some(64) }
2023-02-13 16:54:51.486 DEBUG tokio-runtime-worker sync: start block request: BlockRequest { id: 6228, fields: HEADER | BODY | JUSTIFICATION, from: Number(16128), direction: Descending, max: Some(64) }, peer 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC
2023-02-13 16:54:51.486 TRACE tokio-runtime-worker sync: send request, request id: 3337, peer id 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC
2023-02-13 16:54:51.757 TRACE tokio-runtime-worker sync: BlockResponse 6228 from 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC with 64 blocks (16128..16065)
2023-02-13 16:54:51.757 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: Reversing incoming block list
2023-02-13 16:54:51.757 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: gap sync 16065..16129
2023-02-13 16:54:51.758 TRACE tokio-runtime-worker sync: New block request for 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC, (best:14231262, common:14231227) BlockRequest { id: 6261, fields: HEADER | BODY | JUSTIFICATION, from: Number(16128), direction: Descending, max: Some(64) }
2023-02-13 16:54:51.758 DEBUG tokio-runtime-worker sync: start block request: BlockRequest { id: 6261, fields: HEADER | BODY | JUSTIFICATION, from: Number(16128), direction: Descending, max: Some(64) }, peer 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC
2023-02-13 16:54:51.758 TRACE tokio-runtime-worker sync: send request, request id: 3357, peer id 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC
2023-02-13 16:54:52.044 TRACE tokio-runtime-worker sync: BlockResponse 6261 from 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC with 64 blocks (16128..16065)
2023-02-13 16:54:52.044 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: Reversing incoming block list
2023-02-13 16:54:52.045 TRACE tokio-runtime-worker sync: 12D3KooWCkRmWgJAc1UpbB2LrE8WneT6FPVckWo9fsceeEEw3DMC: gap sync 15361..15425
My interpretation of this is that node sends a gap sync request starting from block 16128, at 16:54:51.486, then a response to some previous block request is received and as a result of the response, another block request is sent at 16:54:51.758 starting from block 16128 again while the previous request is still in flight.
IIRC Request-respnse protocol should guarantee that each response is matched to a request with an ID.
on_block_data
in thesync
module accepts the response along with the original request.
That is true. But right now ChainSync
does not utilize these request IDs which causes the genesis mismatch to happen.
I noticed another issues with block requests: local node requests blocks
{ X, ..., X + 64 }
and if one of the received blocks is too far in the future, it is rejected (probably as it should) withVerification failed for block ...
and the connection is closed but I don't now if closing the connection is correct behavior.What do you mean by "too far in the future"? As in block timestamp is too far in the future?
The slot is too far in the future, as reported by Aura.
My interpretation of this is that node sends a gap sync request starting from block 16128, at 16:54:51.486, then a response to some previous block request is received and as a result of the response, another block request is sent at 16:54:51.758 starting from block 16128 again while the previous request is still in flight.
To me it looks like the response it actully correct and contains the blocks that were actually requested. Notice that the request for blocks in descending order. Response id also matches request id.
The interesting bit is that the first request was issued for the gap sync ("New gap block request for ..") and then another request was issued for the main sync ("New block request for ..") for the same blocks. So there's clearly a bug there with the gap sync interfering the the main sync.
That is true. But right now ChainSync does not utilize these request IDs which causes the genesis mismatch to happen.
ChainSync
expects that only one request per peer is alive. Any previous requests should be canceled as soon as new one is sent. Before the refactoring to use the libp2p request-response protocol, this was enforced explicitly. Now I believe it is handled here:
https://github.com/paritytech/substrate/blob/master/client/network/src/request_responses.rs#L646
though I would not be surprised if there's some kind of race here.
The slot is too far in the future, as reported by Aura.
Yeah, ideally this should be handled by the import queue.
ChainSync
expects that only one request per peer is alive.
I think this is source of the problem. Peer is sending multiple requests and apparently the cancelling behavior (RequestFailure::Obsolete
) is either not relayed to ChainSync
or it ignores it somewhere. I also found this piece of code which indicates that there is a mechanism for tracking which blocks are being requested so duplicates would not sent but the HashMap
is never actually used to check if the request is a duplicate before sending it. The check was probably there as some point but may have been lost during refactoring.
I'll start working on the fixes. Thanks for the information.
Yeah, ideally this should be handled by the import queue.
So the import queue should hold on to the block if it's in a future slot and not disconnect the peer, right?
So the import queue should hold on to the block if it's in a future slot and not disconnect the peer, right?
If there isn't a huge drift in the local clock, this isn't possible (or should not happen). Where did you observe this behavior? The import queue should reject these peers as they are trying too fool you at least from our own local view. But this should also only happen when we are syncing on the tip of the chain.
I had some clock drift because NTP was not enabled. Enabling it fixed the importing issue.
In addition to sending duplicate requests, this bug is also related to paritytech/polkadot-sdk#556. Peer count is reported incorrectly to be 40 and blocks requests work because the block request handler doesn't check if the peer is accepted by the syncing code (which to me sounds like a security bug) and syncing works for the duration of warp sync but when it starts to download blocks and announcing them, it notices that the block announcement substream has been closed and then terminates the substream on its end too. That issue is currently blocked by https://github.com/paritytech/substrate/pull/12828 but when it's merged, I'll continue the implementation of custom handshakes.
I think I also found reason the why the code sends duplicate requests even if there are measures in place to prevent that. If peer sends a block request and then syncing gets restarted, the block request sent before the restart is not discarded but received normally which will then reset the peer to be available again and send another request. Then the request after the sync restart is received which does the same thing which basically results two requests being constantly in active. The block request scheduling can then request the same range in two different requests from the same peer which causes the connection to be closed.
I will try to reproduce it with a test but here is a log showing the bug:
@altonen Is there a way to reproduce this log? The duplicate request sure looks suspicious but I can't reproduce it.
I will try to reproduce it with a test but here is a log showing the bug:
@altonen Is there a way to reproduce this log? The duplicate request sure looks suspicious but I can't reproduce it.
I've been sidetracked with another issue that I think is the central issue of this bug report. The problem seems to be that there nearly all full node slots are occupied on the network but this is not noticed until the local node sends a block announcement. The reason for duplicate requests was a recent change in moving block requests to ChainSync
(made by myself) which didn't properly clear outbound requests upon sync restart. The clock drift I had on my computer exacerbated that issue but after fixing that bug, I still had the same behavior of loosing peers after warp sync is complete (the original issue). We're currently working on a fix for that and I'll fix the duplicate request bug after this other thing is fixed.
a recent change in moving block requests to
ChainSync
this one? https://github.com/paritytech/substrate/pull/12739
Hi, @altonen. We are experiencing similar issues and are looking for potential solutions to stabilize peer connectivity. Do we have any clue how to resolve the problem?
Hi, @altonen. We are experiencing similar issues and are looking for potential solutions to stabilize peer connectivity. Do we have any clue how to resolve the problem?
@dmitry-markin @lexnv
Hi, @altonen. We are experiencing similar issues and are looking for potential solutions to stabilize peer connectivity. Do we have any clue how to resolve the problem?
Hi @felixshu could you post the logs of the moment when your node loses peers (+ a couple of minutes before), the command line that you use to start the node, and the version/commit-id you are running?
Start a new warp sync using latest polkadot+substrate master with an empty db dir in GCP. The nodes keeps loosing almost all peers every couple of minutes.
Notice how the number of peers jumps from 20+ to 2. Unfortunately our network protocol does not include reason for why other peers decide to disconnect. However it looks like this is happening after we announce latest blocks to other peers. It could be that something was added to the block announcement that broke network protocol compatibility because of this code here: https://github.com/paritytech/substrate/blob/ded44948e2d5a398abcb4e342b0513cb690961bb/primitives/consensus/common/src/block_validation.rs#L89