paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.85k stars 670 forks source link

libp2p may be falsely re-establishing connections for reserved peers, or dropping messages #1499

Open rphmeier opened 1 year ago

rphmeier commented 1 year ago

In testing the Polkadot collator-protocol for asynchronous backing, we have noticed some anomalous behavior, where collators are sometimes unable to broadcast their collations to validators because validators don't send them a ViewUpdate message (which they are supposed to do upon initiating the connection).

The behavior that we're seeing, repeatedly, is that collators initiating a connection immediately following a disconnection appear to open up a substream that is at least initially one-sided: the collator sends its view message, but receives nothing in return and therefore does not advertise its collation.

The underlying code is using ProtocolHandle::set_reserved_peers with some validators to connect to. The (collator) node is not a validator, and the validators do not put this node in their reserved_peers. The validators also disconnect the collator after some time, after which the collator is likely to initiate a reconnect. This reconnection is what appears tenuous. Versi is currently running #1410, which contains the debugging logs seen in the Grafana links here.

https://grafana.teleport.parity.io/goto/X-qj7ViIR?orgId=1 This is the general query I have been using for debugging.

Between XX:48:00.083 and XX:48:06 in the following Grafana query you can see an example of this - the validators disconnect, the collator sees an immediate reconnect, but no view message is propagated https://grafana.teleport.parity.io/goto/2K9d44mSR?orgId=1 .

Reproduced here ``` 2023-09-11T19:48:06-05:00 2023-09-12 00:48:06.006 DEBUG tokio-runtime-worker parachain::provisioner: bitfields count before selection bitfields_count=30 leaf_hash=0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.087 DEBUG tokio-runtime-worker parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=6 removed=0 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.087 DEBUG tokio-runtime-worker parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=6 failed_to_resolve=0 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.086 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Accepted collation, connecting to validators. para_id=501 candidate_relay_parent=0x3362…0e53 relay_parent_mode=Enabled { max_candidate_depth: 3, allowed_ancestry_len: 2 } candidate_hash=0x92387faecb5f1fcfa67e44d0c67369cf9f5e7e342c0e134ac8569dbb6ccbb669 pov_hash=0xc73fe48bc046ba53199c8faba99a752b20f8f6d647eec174db1e0182900f650a core=CoreIndex(0) current_validators=[Public(14fb42862c12d4676b60ce1e24b6c1ef814e0bf2845287ed6a0125aa061cbd3b (5CYDTKc9...)), Public(261f8cd2fe601c22a2792e5385ad2fb72d920954d0775007038b445d8e49121d (5Cvh3oUZ...)), Public(80cdef345604b0ed161800af8fdc1bde9248da3d4aeb498ad3beee00d826197d (5EybALCL...)), Public(363612c9b6a48afac272622c84876240324223caf9cd6b5f5dce807d1d56e305 (5DHnWJjE...)), Public(c2220ceedfb0bc06b2b66023beec32efb9ec2be9c1f17de34b66ed8188322e34 (5GTFG9KW...)), Public(c0cbfb352384c1986e702dfc24efa84717641628b8ae8740a3d84ff2ce056e74 (5GRVeRoY...))] traceID=194360645723336064205722654233368685007 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.086 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Received session info session_index=3330 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.067 DEBUG tokio-runtime-worker parachain::prospective-parachains: Creating fragment tree relay_parent=0x3362519ec9ae3ec65bc6f0b0cd7ee6676fc5cf828c32e9ad2928692f13be0e53 min_relay_parent=66473 para_id=Id(501) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.064 DEBUG tokio-runtime-worker parachain::availability-store: Candidate backed candidate_hash=0x22d4d68bcc317e4f8b214d0fa25de9db8bcafb7271897501a35fb53ee3a03948 traceID=46298870302261069930402068708525468123 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.061 INFO tokio-runtime-worker substrate: ✨ Imported #66475 (0x3362…0e53) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.044 DEBUG tokio-runtime-worker parachain::prospective-parachains: Creating fragment tree relay_parent=0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6 min_relay_parent=66473 para_id=Id(501) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.042 DEBUG tokio-runtime-worker parachain::availability-store: Candidate backed candidate_hash=0x22d4d68bcc317e4f8b214d0fa25de9db8bcafb7271897501a35fb53ee3a03948 traceID=46298870302261069930402068708525468123 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.037 INFO tokio-runtime-worker substrate: ✨ Imported #66475 (0x28bb…cfa6) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.082 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change peer_id=PeerId("12D3KooWLQNS8ULhha5LLTWnSV3DHeXBfUFqr8vjNPGM9SAzBdgu") view=View { heads: [], finalized_number: 0 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.082 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer connected peer_id=PeerId("12D3KooWLQNS8ULhha5LLTWnSV3DHeXBfUFqr8vjNPGM9SAzBdgu") observed_role=Authority maybe_authority=Some({Public(c2220ceedfb0bc06b2b66023beec32efb9ec2be9c1f17de34b66ed8188322e34 (5GTFG9KW...))}) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.082 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] Informing collation peer of our view action="PeerConnected" peer=PeerId("12D3KooWLQNS8ULhha5LLTWnSV3DHeXBfUFqr8vjNPGM9SAzBdgu") local_view=View { heads: [0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6, 0x3362519ec9ae3ec65bc6f0b0cd7ee6676fc5cf828c32e9ad2928692f13be0e53], finalized_number: 66472 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.082 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWLQNS8ULhha5LLTWnSV3DHeXBfUFqr8vjNPGM9SAzBdgu") role=Authority 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.078 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change peer_id=PeerId("12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu") view=View { heads: [], finalized_number: 0 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.078 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer connected peer_id=PeerId("12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu") observed_role=Authority maybe_authority=Some({Public(c0cbfb352384c1986e702dfc24efa84717641628b8ae8740a3d84ff2ce056e74 (5GRVeRoY...))}) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.077 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] Informing collation peer of our view action="PeerConnected" peer=PeerId("12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu") local_view=View { heads: [0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6, 0x3362519ec9ae3ec65bc6f0b0cd7ee6676fc5cf828c32e9ad2928692f13be0e53], finalized_number: 66472 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.077 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu") role=Authority 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.077 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change peer_id=PeerId("12D3KooWDs275UdeFaprVeRS4QyB43dJnLfZPrVWigryKi6Zzrp7") view=View { heads: [], finalized_number: 0 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.077 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] Informing collation peer of our view action="PeerConnected" peer=PeerId("12D3KooWDs275UdeFaprVeRS4QyB43dJnLfZPrVWigryKi6Zzrp7") local_view=View { heads: [0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6, 0x3362519ec9ae3ec65bc6f0b0cd7ee6676fc5cf828c32e9ad2928692f13be0e53], finalized_number: 66472 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.077 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer connected peer_id=PeerId("12D3KooWDs275UdeFaprVeRS4QyB43dJnLfZPrVWigryKi6Zzrp7") observed_role=Authority maybe_authority=Some({Public(363612c9b6a48afac272622c84876240324223caf9cd6b5f5dce807d1d56e305 (5DHnWJjE...))}) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.077 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change peer_id=PeerId("12D3KooWHScCvPM5MeaF14t4DmP2RWhGoC6M1fmiinkZcVeTh8JG") view=View { heads: [], finalized_number: 0 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer connected peer_id=PeerId("12D3KooWHScCvPM5MeaF14t4DmP2RWhGoC6M1fmiinkZcVeTh8JG") observed_role=Authority maybe_authority=Some({Public(14fb42862c12d4676b60ce1e24b6c1ef814e0bf2845287ed6a0125aa061cbd3b (5CYDTKc9...))}) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWDs275UdeFaprVeRS4QyB43dJnLfZPrVWigryKi6Zzrp7") role=Authority 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] Informing collation peer of our view action="PeerConnected" peer=PeerId("12D3KooWHScCvPM5MeaF14t4DmP2RWhGoC6M1fmiinkZcVeTh8JG") local_view=View { heads: [0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6, 0x3362519ec9ae3ec65bc6f0b0cd7ee6676fc5cf828c32e9ad2928692f13be0e53], finalized_number: 66472 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change peer_id=PeerId("12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu") view=View { heads: [], finalized_number: 0 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer connected peer_id=PeerId("12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu") observed_role=Authority maybe_authority=Some({Public(80cdef345604b0ed161800af8fdc1bde9248da3d4aeb498ad3beee00d826197d (5EybALCL...))}) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWHScCvPM5MeaF14t4DmP2RWhGoC6M1fmiinkZcVeTh8JG") role=Authority 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] Informing collation peer of our view action="PeerConnected" peer=PeerId("12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu") local_view=View { heads: [0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6, 0x3362519ec9ae3ec65bc6f0b0cd7ee6676fc5cf828c32e9ad2928692f13be0e53], finalized_number: 66472 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change peer_id=PeerId("12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F") view=View { heads: [], finalized_number: 0 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer connected peer_id=PeerId("12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F") observed_role=Authority maybe_authority=Some({Public(261f8cd2fe601c22a2792e5385ad2fb72d920954d0775007038b445d8e49121d (5Cvh3oUZ...))}) 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu") role=Authority 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] Informing collation peer of our view action="PeerConnected" peer=PeerId("12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F") local_view=View { heads: [0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6, 0x3362519ec9ae3ec65bc6f0b0cd7ee6676fc5cf828c32e9ad2928692f13be0e53], finalized_number: 66472 } 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.076 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F") role=Authority 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.075 DEBUG tokio-runtime-worker parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=6 removed=0 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.075 DEBUG tokio-runtime-worker parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=6 failed_to_resolve=0 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.074 DEBUG tokio-runtime-worker aura::cumulus: [Parachain] Slot claimed. Building relay_parent=0x3362519ec9ae3ec65bc6f0b0cd7ee6676fc5cf828c32e9ad2928692f13be0e53 unincluded_segment_len=2 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.074 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer disconnected peer_id=PeerId("12D3KooWDs275UdeFaprVeRS4QyB43dJnLfZPrVWigryKi6Zzrp7") is_validator=true 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.074 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer disconnected peer_id=PeerId("12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F") is_validator=true 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.074 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer disconnected peer_id=PeerId("12D3KooWHScCvPM5MeaF14t4DmP2RWhGoC6M1fmiinkZcVeTh8JG") is_validator=true 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.074 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer disconnected peer_id=PeerId("12D3KooWLQNS8ULhha5LLTWnSV3DHeXBfUFqr8vjNPGM9SAzBdgu") is_validator=true 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.074 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer disconnected peer_id=PeerId("12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu") is_validator=true 2023-09-11T19:48:00-05:00 2023-09-12 00:48:00.074 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer disconnected peer_id=PeerId("12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu") is_validator=true ```

For an example of how it should look, with logs on both sides:

expand ``` 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.196 DEBUG tokio-runtime-worker parachain::collator-protocol: Received advertise collation peer_id=PeerId("12D3KooWPL1JNeXKWFh8FNnhexqgPeC4kjKTwHP4EZ5vETXrt3WD") para_id=501 relay_parent=0x6b81b6f29bd8a3a65a1c3eb85ffc7a3c1a3e5064304f26e18018170ab7c679a9 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.196 DEBUG tokio-runtime-worker parachain::collator-protocol: Seconding is not allowed by backing, queueing advertisement relay_parent=0x6e358bbb5d766472550a791a4c9ae59f3e3bedec980b8d350916ee12bb0fcf6e para_id=Id(501) candidate_hash=0x952c476acf9f6be90ef8fb3b3203afce3d747b7875bac55dd37c08d13657900d traceID=198284880947251209089663558824344399822 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.195 DEBUG tokio-runtime-worker parachain::collator-protocol: Declared as collator for current para peer_id=PeerId("12D3KooWPL1JNeXKWFh8FNnhexqgPeC4kjKTwHP4EZ5vETXrt3WD") collator_id=Public(7480aee452f1a733e0bb31951e4f52ffeacc1dc88c4cd33eccf029c5c5380c1f (13dknL1S...)) para_id=Id(501) 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.195 DEBUG tokio-runtime-worker parachain::network-bridge-rx: Informing collation peer of our view action="PeerConnected" peer=PeerId("12D3KooWPL1JNeXKWFh8FNnhexqgPeC4kjKTwHP4EZ5vETXrt3WD") local_view=View { heads: [0x6e358bbb5d766472550a791a4c9ae59f3e3bedec980b8d350916ee12bb0fcf6e], finalized_number: 66592 } 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.195 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWPL1JNeXKWFh8FNnhexqgPeC4kjKTwHP4EZ5vETXrt3WD") role=Full 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.196 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] peer=PeerId("12D3KooWM26dTEt9hgY6PJB8hFCeFa11iUTaPbnUXbzTRZL2Pyma") old_view=View { heads: [], finalized_number: 0 } new_view=View { heads: [0x6e358bbb5d766472550a791a4c9ae59f3e3bedec980b8d350916ee12bb0fcf6e], finalized_number: 66592 } action="NewViewForDebug" 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.196 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Advertising collation. relay_parent=0x6b81b6f29bd8a3a65a1c3eb85ffc7a3c1a3e5064304f26e18018170ab7c679a9 candidate_hash=0x7ee2e7e06010a62d3abf8c6851ba6b5a665bed23907cc83e522d803b1c188469 peer_id=12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu traceID=168660889572385806264142991431455894362 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.196 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Advertising collation. relay_parent=0x6e358bbb5d766472550a791a4c9ae59f3e3bedec980b8d350916ee12bb0fcf6e candidate_hash=0x952c476acf9f6be90ef8fb3b3203afce3d747b7875bac55dd37c08d13657900d peer_id=12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu traceID=198284880947251209089663558824344399822 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.196 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change: new leaf, advertising for given relay parents leaf=0x6e358bbb5d766472550a791a4c9ae59f3e3bedec980b8d350916ee12bb0fcf6e relay_parents=[0x6e358bbb5d766472550a791a4c9ae59f3e3bedec980b8d350916ee12bb0fcf6e, 0x701e161a8745f4edb8fc2420d36f652c5bcc63acff60793279c99605d250a462, 0x6b81b6f29bd8a3a65a1c3eb85ffc7a3c1a3e5064304f26e18018170ab7c679a9] 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.196 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change peer_id=PeerId("12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu") view=View { heads: [0x6e358bbb5d766472550a791a4c9ae59f3e3bedec980b8d350916ee12bb0fcf6e], finalized_number: 66592 } 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.195 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer view change peer_id=PeerId("12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu") view=View { heads: [], finalized_number: 0 } 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.195 DEBUG tokio-runtime-worker parachain::network-bridge-rx: [Relaychain] Informing collation peer of our view action="PeerConnected" peer=PeerId("12D3KooWM26dTEt9hgY6PJB8hFCeFa11iUTaPbnUXbzTRZL2Pyma") local_view=View { heads: [0x6e358bbb5d766472550a791a4c9ae59f3e3bedec980b8d350916ee12bb0fcf6e], finalized_number: 66592 } 2023-09-11T20:00:00-05:00 2023-09-12 01:00:00.195 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Peer connected peer_id=PeerId("12D3KooWKHDhC7HiNndfaZxU7atamfUorsfeFJiw9tVQSE4qveSu") observed_role=Authority maybe_authority=Some({Public(80cdef345604b0ed161800af8fdc1bde9248da3d4aeb498ad3beee00d826197d (5EybALCL...))}) ```

The expected behavior is that the NotificationStreamOpened event should not land on one side unless it is clear that it will land on both sides, and that no messages will be dropped.

bkchr commented 1 year ago

@paritytech/networking can you look into this with high priority? Afair you have done some changes in substream handling recently?

altonen commented 1 year ago

@bkchr last change to anything related to substreams was the Peerset refactoring in June. Since then we haven't merged anything.

This sounds a little like the finality lagging issue we were seeing earlier this year where the substream was one-sided but the node was not able to detect it because it didn't have anything to send.

altonen commented 1 year ago

@rphmeier @sandreim

I have couple of questions:

This immediate reconnection after disconnection is suspicious because there should be back-off mechanism for closed substreams. The back-off is not working if the whole connection is closed which would indicate that the validator/collator pair had only one protocol open and that the validator/collator was redialed immediately after the connection was closed.

altonen commented 1 year ago

What I find weird is that authority discovery is doing something which to me doesn't make sense:

00:48:00.076 parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F") role=Authority
00:48:00.075 parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=6 removed=0
00:48:00.075 parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=6 failed_to_resolve=0
00:48:00.074 parachain::collator-protocol: [Relaychain] Peer disconnected peer_id=PeerId("12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F") is_validator=true
00:48:00.072 parachain::collator-protocol: [Relaychain] Advertising collation. relay_parent=0x28bb60d46950f54bc37edaeed644b785955b8988dd3d5aa2d6066b0966fbcfa6 candidate_hash=0x93056876796f64dd62aac272ea69d89f23b0e0d6dfbe349536990898a753b3ad peer_id=12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F traceID=195424595621779617186766630064172423327
00:48:00.072 parachain::network-bridge-rx: [Relaychain] action="PeerDisconnected" peer_set=Collation peer=PeerId("12D3KooWAx6E45vmY2ucom9Bd7A2rCPvcfFDLyvZa4xxNAZbJg8F")
00:48:00.072 parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=0 removed=6
00:48:00.072 parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=0 failed_to_resolve=0

It keeps disconnecting and then reconnecting to the same nodes. I'm not sure why it disconnects the peer in the first place but then reconnecting to it in a few milliseconds doesn't sound correct. Any comment on this?

Grafana for validator Grafana for collator

One issue in the current implementation of Notifications is that it doesn't inform the protocol that the inbound substream is closed (which is why we have the peer eviction hack in SyncingEngine) and the connection is detected to be closed only if node attempts to send data to remote peer.

What I think can happen is that if local node closes a connection to remote and opens it again before remote node can send any data, the new inbound substream (from remote's PoV) is discarded and this is noticed by the local node only when it tries to send data. Since the Notifications allows half-open connections, the inbound substream towards the local node is never opened because ProtocolController is blissfully unaware that there is a new connection attempt since the connection is already open. This can then result in a pair of substreams, neither which is writable and the closed connection is detected by both nodes only when they attempt to send something.

I need to check if any of that is actually true but I think this ping-pong in the connection state caused by the authority discovery is playing a role here.

vstakhov commented 1 year ago

is it possible to get sub-libp2p=trace log target enabled?

I have enabled that, what particular log lines should I check?

altonen commented 1 year ago

PSM <= Dropped or OpenDesiredByRemote. I'll go check them as well

dmitry-markin commented 1 year ago

Just a guess: is it possible that a collator removes validators for some reason from the reserved peers and this causes the disconnect?

EDIT. May be it's expected behavior, and if not — we'll need to also enable peerset=trace and look for messages like "Disconnecting previously reserved node ..."

rphmeier commented 1 year ago

are the collators and validators only connected over the collation protocol or do they have some other protocols open as well, e.g., syncing or GRANDPA? Should they have some other protocol open in addition to collation?

They are likely connected on sync and Validation (this one, they aren't really supposed to be, but they are and it's not a problem per se) for the periods they are connected.

is it only the collator that opens substream to a validator over the collation protocol and never vice versa?

That's correct. Validators do not add collators to reserved peers on any protocol, and collators add validators to their reserved peers on the Collation protocol only.

ordian commented 1 year ago

There's one potential issue with gossip-support subsystem running on collators and potentially removing all validators from the reserved Validation set https://github.com/paritytech/polkadot-sdk/blob/ee6eeb74668502798d40f0c83e3710300fa9368e/polkadot/node/network/gossip-support/src/lib.rs#L256 I think that might also trigger a disconnect. We should probably not even run gossip-support on collators.

altonen commented 1 year ago

Just a guess: is it possible that a collator removes validators for some reason from the reserved peers and this causes the disconnect?

EDIT. May be it's expected behavior, and if not — we'll need to also enable peerset=trace and look for messages like "Disconnecting previously reserved node ..."

I found this one from the logs

15:06:24.062  WARN tokio-runtime-worker peerset: [Relaychain] Trying to remove unknown reserved node 12D3KooWRJpmAB6diptmuYe3JEEu3vqcm9psZVXFjZLM45pFACJV from SetId(3).    
15:06:24.062 DEBUG tokio-runtime-worker parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=3 removed=3
15:06:24.062 DEBUG tokio-runtime-worker parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=3 failed_to_resolve=0

@ordian would that affect the collation set though?

ordian commented 1 year ago

I don't know if the current code affects collation peerset, I expect that this doesn't https://github.com/paritytech/polkadot-sdk/blob/ee6eeb74668502798d40f0c83e3710300fa9368e/polkadot/node/network/bridge/src/validator_discovery.rs#L95-L100

But yeah, the issue might be in collator side issuing disconnecting requests https://github.com/paritytech/polkadot-sdk/blob/ee6eeb74668502798d40f0c83e3710300fa9368e/polkadot/node/network/collator-protocol/src/collator_side/validators_buffer.rs#L95

altonen commented 1 year ago

I think we should try disabling those disconnection requests, at least if they're targeting a node that will be added a few moments later again. I don't see the purpose of those but I could be misunderstanding something.

If those are disabled and the issue still persists, it would give us some new information.

rphmeier commented 1 year ago

I can see how the underlying validators_buffer logic could trigger this issue, and we can add a patch to work around it.

But this does not fix the underlying issue, that the connection is being falsely "opened" just because it was closed just before. Is there any good fix for that?

altonen commented 1 year ago

I don't know any good fix off the top of my head but I'll keep looking. Disappointing results in #1526 though. Looks like the immediate reconnections are gone but now the authority discovery only finds one then keep disconnecting and connecting to that one node and sometimes switching between which node it connects to. But the backoff mechanisms looks to be working since PeerDisconnected is not immediately followed by PeerConnected for the same node.

It's also advertising collation after each new connection.

dmitry-markin commented 1 year ago

I've performed some tests, and it looks like @altonen's hypothesis is correct. I've added a custom (dummy) protocol here: https://github.com/paritytech/polkadot-sdk/pull/1541 and made one node to connect, then disconnect, then connect again to another node. What happens is the "client" node sees the second connection as successful, while the "server" node only recognizes the first connection.

Here is the relevant logs for the "client":

2023-09-13 17:28:16.603  INFO tokio-runtime-worker dummy: Dummy protocol, adding reserved peer: 12D3KooWDT2LMUNxtNiJPKLdwryfpVL8u7DoBiKnXGThcTcy5WJG
2023-09-13 17:28:16.607  INFO tokio-runtime-worker dummy: Dummy protocol connected, counter: 0, inbound: false, peer: 12D3KooWDT2LMUNxtNiJPKLdwryfpVL8u7DoBiKnXGThcTcy5WJG    
2023-09-13 17:28:16.607  INFO tokio-runtime-worker dummy: Dummy protocol, remove reserved peer 12D3KooWDT2LMUNxtNiJPKLdwryfpVL8u7DoBiKnXGThcTcy5WJG
2023-09-13 17:28:16.608  INFO tokio-runtime-worker dummy: Dummy protocol disconnected, peer: 12D3KooWDT2LMUNxtNiJPKLdwryfpVL8u7DoBiKnXGThcTcy5WJG    
2023-09-13 17:28:16.608  INFO tokio-runtime-worker dummy: Dummy protocol, add reserved peer again 12D3KooWDT2LMUNxtNiJPKLdwryfpVL8u7DoBiKnXGThcTcy5WJG
2023-09-13 17:28:16.610  INFO tokio-runtime-worker dummy: Dummy protocol connected, counter: 1, inbound: false, peer: 12D3KooWDT2LMUNxtNiJPKLdwryfpVL8u7DoBiKnXGThcTcy5WJG

For the "server":

2023-09-13 17:28:16.609  INFO tokio-runtime-worker dummy: Dummy protocol connected, counter: 0, inbound: true, peer: 12D3KooWFEmunNHUFfMRgrwNxw9ZLSjRKojVjGqvEZzCKm45BxQN
<no more logs about dummy>

(If we wait longer, the nodes get disconnected on the default peer set and the dummy protocol connection logic gets screwed, but this is not relevant.)

This can explain why when we connect to the validator for the second time we don't receive a ViewUpdate message — the validator just doesn't know about this connection.

dmitry-markin commented 1 year ago

Merging https://github.com/paritytech/substrate/pull/13396 (reverted) would fix the issue, but when we merged it (and fixed the bug), the syncing got screwed, because it relied on this bug present under the conditions of insufficient inbound slots on the network.

altonen commented 1 year ago

There is at least one problem in the handling of reserved peers that I can find. Here are relevant logs.

What happens is that at 17:04:24.116 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu is added as a reserved peer, it connects, the peers exchange collation messages, peer view is updated correctly (I presume) and two collations are advertised to them.

Then there are few changes in the validators set and I assume one of these updates doesn't contain 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu but there's no way to confirm it. That theory would be corroborated by the fact that later on when another change in the validator set happens and this time it removes some peers, 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu is attempted to be removed but Peerset gives the following warning:

2023-09-12 17:04:48.066  WARN tokio-runtime-worker peerset: [Relaychain] Trying to remove unknown reserved node 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu from SetId(3).    

After these updates to reserved peers (but before this peerset warning is emitted), the connection to 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu is still kept open, probably because the remote node allows non-reserved inbound peers: https://github.com/paritytech/polkadot-sdk/blob/61be78c621ab2fa390cd3bfc79c8307431d0ea90/polkadot/node/network/protocol/src/peer_set.rs#L103-L110

After another change in the set at 17:04:24.159, the connection is still kept open but after this change, a collation message is received from 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu and the following log is printed

2023-09-12 17:04:24.181 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] received an unexpected `CollationSeconded`: unknown statement candidate_hash=0x2d2a406b48f38eb7152e7e4aa49d5026cd2433f175f4ecaf4876f91ad8e3e674 origin=PeerId("12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu") traceID=60034642852585063408885082690613497894

since this is a debug it could be harmless but after the latest update at to peer set 17:04:24.159, 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu is still sent collations and 3 of the 5 messages that are received back result in this same "unknown statement" debug and two result in NewViewForDebug.

At 17:04:48.066 there is another change in the validator set, this time removing peers and now the peerset warning is also printed, indicating that 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu is not counted as a reserved peer even though some subsystem in Polkadot apparently thinks so. After 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu is disconnected, a reconnection is issued basically immediately but no message is received from then. At 17:05:00.05 another message is sent but this time the node notices that the substream is closed and the peer is "disconnected", mainly by information NetworkBridge and the collation subsystem.

After that another reconnection is made and this time it works and the nodes send each other messages.

I think this immediate disconnection/reconnection causes the peers to have differing view of the connection and that causes the node that does the connecting to have an inbound substream that is not active, causing it to receive no messages.

Could someone who knows the this collator/validator discovery code better go over this explanation and see if I've misunderstood something about this process? Right now, based on what's in the logs, it looks like ProtocolController in Substrate and collator subsystem/NetworkBridge have a different view of currently connected reserved peers. I also think this issue with connecting and reconnecting immediately is on the Polkadot side and it doesn't sound sensible to me. But this also highlights a problem on Substrate side; it should detect closed connections better and have a proper back-off for reconnections, which apparently is not happening now.


Really nice work Dmitry! I would prefer fixing https://github.com/paritytech/polkadot-sdk/issues/508 instead since I've once already got my ass handed to me because of closed inbound substreams and AFAIK there isn't a single protocol in Polkadot SDK which uses this feature. Regardless of which substream we use to detect if the connection is closed, the end result will be the same and some people may have zero peers if they don't open their firewall. FWIW, https://github.com/paritytech/polkadot-sdk/pull/1370 will cause that same thing to happen because the block announcement handshake will fail if the remote node is full.

rphmeier commented 1 year ago

Then there are few changes in the validators set and I assume one of these updates doesn't contain 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu but there's no way to confirm it

In our scenario it is quite unlikely - eras are common but the validator set is not changing.

since this is a debug it could be harmless but after the latest update at to peer set 17:04:24.159, 12D3KooWSj4FWbJSdtAQR1NhE8S7V5792Z6rxddvAgUiwZJ4bZcu is still sent collations and 3 of the 5 messages that are received back result in this same "unknown statement" debug and two result in NewViewForDebug.

This is unrelated and is addressed in https://github.com/paritytech/polkadot-sdk/pull/1475

I think this immediate disconnection/reconnection causes the peers to have differing view of the connection and that causes the node that does the connecting to have an inbound substream that is not active, causing it to receive no messages. Could someone who knows the this collator/validator discovery code better go over this explanation and see if I've misunderstood something about this process

Yes, this sounds like what we observed - but the network bridge is likely not the issue here (it simply wraps NotificationStream* events from the underlying network service and does no handling of reserved peers except to forward requests to the authority discovery service).

Is the recommendation to merge #1563 in favor of #1526 and this should fix the issue?