rococo asset-hub and others don't always advertise their collations

alexggh commented 7 months ago

Looking at block times for rococo-asset-hub and others we can see that there are moment when they decide not to advertise their collection to validators, the logs point to that:


2024-02-22 00:29:11.176 | 2024-02-21 22:29:11.154 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Advertising collation. relay_parent=0x4d5b87ffa8d9cba09462a460539fcedbdf2bb876ebfc736a283978a0383348c8 candidate_hash=0xeb25ac57d43f3eccad16b0a5cae23054e72207c669da69972665462e56ce5cd4 peer_id=12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn traceID=312564189526210266554051532702187925588 |  
-- | -- | --
  |   | 2024-02-22 00:29:11.176 | 2024-02-21 22:29:11.154 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Advertising collation. relay_parent=0x244c7422a2e3e542fbe96e1d3d0ee3c53af16278516294dd0094e7f8b3bba9b2 candidate_hash=0xaad65ae4deaa25590db854854d95480b03118f0864e1b66fa11053b26cf3321c peer_id=12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn traceID=227081754360958899348741667046089377803 |  
  |   | 2024-02-22 00:29:06.180 | 2024-02-21 22:29:06.168  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x0fdff979355ef2974910c0cdb9c1736dc2cab86e56e1fc7a760f0b5295fb1fde pov_hash=0x65327906ba3fbb2144ab19cb4eb9c91a9c09d39652f2a4e5f0ab5a83bd82f415 traceID=21101362059343468297627011593342645101 |  
  |   | 2024-02-22 00:29:00.208 | 2024-02-21 22:29:00.160  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0xd0a234db3548163f63b8736ff0592e77a655c551249f4e2aa1e6ed35869ca893 pov_hash=0xe7d9cf62f1163a261a9cc5b455118394c51e8dc8d54c4b6c52d563268211a141 traceID=277321647267101977391502449035806584439 |  
  |   | 2024-02-22 00:28:54.196 | 2024-02-21 22:28:54.162  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x10d8dbe219d596fe4179e98d133ae619a52ba5e4134fd5beee4b2b42d5b79db5 pov_hash=0x159b57ed3f55b321ddcd1eb0e5d3b219de08e3d46dcb6cdf7fe6e490878addf4 traceID=22393643815265444164061667538671167001 |  
  |   | 2024-02-22 00:28:48.186 | 2024-02-21 22:28:48.140  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x476bc46903194d9d49c425875dbea01daa02183f921b0a4dc873dbe0ffd995c1 pov_hash=0x8046229d7c9b29f713a1e5a91a9a721853dd9ecc6d02302400911e2c70b9d711 traceID=94934747136790677956209421564648529949 |  
  |   | 2024-02-22 00:28:48.186 | 2024-02-21 22:28:48.140  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0xdd491d0c0d7328968a652ba1f8d6557d3c173b590fbfcbb4fb8cd985b254abcc pov_hash=0x8dfc7ee468e94b0b6c4a8c44a0e57502c0416fb13521be4eaa492f97d5e530fd traceID=294139013883918429666814951162104927613 |  
  |   | 2024-02-22 00:28:48.036 | 2024-02-21 22:28:47.996  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0xe29f3817fb510ea6eb22c1ec0720ce958c6d191a7deb0117e8897b8214556ef4 pov_hash=0xb4c770cf5ec29a2d85791dece202f00d445c15219ae24a1323f314a1a9751d85 traceID=301232239962862296839600241828773809813

On rococo the validators groups size is 1 so there isn't a lot of redundancy, however that seems to happen way too often to be ignored, additionally the asset-hub has async backing enabled, however it does not seem to be related because other parachains seem to be affected as well.

Looking at the logs it seems like sometimes the collators can't connect to their assigned validators.

Next steps

[x] Run 1 or two collators with trace enabled.
```
sub-libp2p=trace
peerset=trace
```

alexggh commented 7 months ago

Checked a few ocassions it seems the collators can't dial to the validator they are assigned to, some examples:

Collator was assigned to validator rococo-validator-a-node-7 https://grafana.teleport.parity.io/goto/dmUVC9oIg?orgId=1 Collator was assigned to validator rococo-validator-h-node-5 https://grafana.teleport.parity.io/goto/nWjOjroIg?orgId=1

The validators seem to be receiving collations when they get assigned to other parachains, e.g: https://grafana.teleport.parity.io/goto/rM-xC9TIR?orgId=1

Reach failure for Some(PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH")) through "/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH": Other(Custom { kind: Other, error: Custom { kind: Other, error: Timeout } })

@dmitry-markin @altonen @BulatSaif any thoughts if this is a bug in our connecting logic or just an infrastructure problem where we don't have full connectivity to some nodes.

It seems like rococo-rockmine-collator-experimental-node-0 https://grafana.teleport.parity.io/goto/BUtrj9TIg?orgId=1 can connect to the same validators but not the collators producing blocks, so this leads me to think that it is just a network partition problem.

altonen commented 7 months ago

I was looking into rococo-rockmine-collator-node-0 logs and there's something in the logs that I don't understand (logs are in reverse):

2024-02-29 10:28:06.117 parachain::network-bridge-rx: [Relaychain] action="PeerDisconnected" peer_set=Collation peer=PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d")
2024-02-29 10:28:06.117 peerset: [Relaychain] Reserved peer 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d (Outbound) dropped from SetId(3).    
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] External API <= Closed(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))    
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] PSM <= Dropped(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))    
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) <= Close(SetId(3))    
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) => CloseDesired(SetId(3))    
2024-02-29 10:28:04.481 sub-libp2p: [Relaychain] Libp2p => Disconnected(12D3KooWQELiM7oVu7tQPgj51rWN733a1FoDEg2XVHQhzJyUCXYR, SetId(3), ConnectionId(9856)): Disabled.    
2024-02-29 10:28:04.104 parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=1 removed=0
2024-02-29 10:28:04.104 parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=1 failed_to_resolve=0
2024-02-29 10:28:04.100 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWRsMJngwJwc614i8Rbik2nsm9d4NuKTcY8cvRJH1bupsE, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/23006" }, ConnectionId(9910)): Not requested by PSM, disabling.    
2024-02-29 10:28:03.022  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x0f105c1e361e7ddba3e93e02820f39d50b21f3a1c694cf3962bc9e8d94e1f9d8 pov_hash=0xf3d1343ed98b05a2b765556253d3ba7c64308920182b74103eeadeb59cc67c42 traceID=20023365061787759369622147911581120981
2024-02-29 10:28:01.555 sub-libp2p: [Relaychain] Libp2p => Disconnected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), ConnectionId(9907)): Disabled.    
2024-02-29 10:28:01.469 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/33269" }, ConnectionId(9907)): Secondary connection. Leaving closed.    
2024-02-29 10:28:01.469 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/62370" }, ConnectionId(9906)): Not requested by PSM, disabling.    
2024-02-29 10:28:01.150 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWGX1RBgFn1rqAJDr1X5gm1wBcFPziAqXR7eTHYhBTBBE2, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/46197" }, ConnectionId(9905)): Not requested by PSM, disabling.    
2024-02-29 10:28:00.829 sub-libp2p: [Relaychain] External API <= Message(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))    
2024-02-29 10:28:00.829 sub-libp2p: [Relaychain] Handler(ConnectionId(9881)) => Notification(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3), 38 bytes)    
2024-02-29 10:28:00.828 parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d") role=Authority
2024-02-29 10:28:00.828 sub-libp2p: [Relaychain] External API <= Open(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))    
2024-02-29 10:28:00.828 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) => OpenResultOk(SetId(3))    
2024-02-29 10:28:00.827 sub-libp2p: [Relaychain] Handler(PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d"), ConnectionId(9881)) <= Open(SetId(3))    
2024-02-29 10:28:00.827 sub-libp2p: [Relaychain] PSM => Connect(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3)): Enabling connections.    
2024-02-29 10:28:00.826 peerset: [Relaychain] Connecting to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d on SetId(3) (0/0 num_out/max_out).    
2024-02-29 10:28:00.152 parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=1 removed=0
2024-02-29 10:28:00.152 parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=1 failed_to_resolve=0

There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?

I looked at some of the dial failures and I don't think they're a node problem, except maybe https://github.com/paritytech/polkadot-sdk/issues/498 which I also saw in some of the dial failures. But I agree that they look more like a network problem.

BulatSaif commented 7 months ago

Reach failure for Some(PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH")) through "/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH": Other(Custom { kind: Other, error: Custom { kind: Other, error: Timeout } }) 
@dmitry-markin @altonen @BulatSaif any thoughts if this is a bug in our connecting logic or just an infrastructure problem where we don't have full connectivity to some nodes.

Port is incorrect it should be 32685 not 30333. Fixed ports are only on bootnodes. Also rococo validator are on spot instance and may restart and change VM(IP). We use this two flags to set address:

 --public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

30333 is internal port and should not be shared

Full cli:

polkadot --name=rococo-validator-a-node-7 --base-path=/chain-data --keystore-path=/keystore --chain=rococo --validator --database=paritydb --state-pruning=1000 --prometheus-external --prometheus-port 9615 --unsafe-rpc-external --rpc-port=9944 --rpc-cors=all --rpc-methods=unsafe --node-key-file /keystore/node-key --log=parachain=debug --insecure-validator-i-know-what-i-do --enable-offchain-indexing true --public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

BulatSaif commented 7 months ago

This logs looks interesting:

Node see correct port:

2024-02-28 15:58:08.055  INFO main sub-libp2p: 🔍 Discovered new external address for our node: /ip4/34.90.232.201/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH

but later gets new external address with incorrect port.

2024-02-28 15:58:08.652  INFO tokio-runtime-worker sub-libp2p: 🔍 Discovered new external address for our node: /ip4/34.90.232.201/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH

alexggh commented 7 months ago

With subexp-explorer this is what I get:

authority="5C5jWPr5ygq4VgwfgsnrcHjp8H11xMT1gYXqR9a4oDJ7CeWF" 
peer_id=PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH") 
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-a-node-7)" 
version={
"/ip4/34.91.211.207/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.32.166.239/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.32.166.239/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"}

It seems to be it is advertising everything, although all of them seem to be tried:

Failed to reach PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"): Failed to negotiate transport protocol(s): 
[(/ip4/35.204.235.25/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH: : Timeout has been reached)
(/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH: : Timeout has been reached)]

alexggh commented 7 months ago

There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?

Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.

alexggh commented 7 months ago

@BulatSaif could you also enable, thank you!

parachain=trace

sandreim commented 7 months ago

There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?

Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.

One reason for collation not being advertised in this case is a missing PeerView update from the validator.

alexggh commented 7 months ago

Regarding the connectivity, definitely something seem to be off, for example I can telnet to some of the ports here:

authority="5C5HkAmbAGC7nqv7Uo2o3LFxBsFZ4Q62av6L7YDJnD56oGq1" 
peer_id=PeerId("12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3") 
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-9)" 
version={"/ip4/35.204.163.222/tcp/30333/p2p/12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3", 
"/ip4/35.204.163.222/tcp/30156/p2p/12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3"}

But I can't telnet here:

authority="5C5jWPr5ygq4VgwfgsnrcHjp8H11xMT1gYXqR9a4oDJ7CeWF" 
peer_id=PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH") 
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-a-node-7)" 
version={"/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.91.211.207/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.32.166.239/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.32.166.239/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"
}

BulatSaif commented 7 months ago

@BulatSaif could you also enable, thank you!
parachain=trace

Done enabled on rococo-validator-a.*

alexggh commented 7 months ago

@BulatSaif could you also enable, thank you!
parachain=trace
Done enabled on rococo-validator-a.*

My bad I wanted it on the asset-hub-collators, could you please add it there as well. Thank you

BulatSaif commented 7 months ago

My bad I wanted it on the asset-hub-collators, could you please add it there as well. Thank you

Done :heavy_check_mark:

alexggh commented 7 months ago

There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?

Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.

One reason for collation not being advertised in this case is a missing PeerView update from the validator.

This was spot-on, it seems that collator and validator are not in sync when it comes to PeerConnected events for each other, this is why even when if connection between collator and validator the collation is not advertised, e.g:

You can see that the collator Connects and Disconnects every time it has something to send to the validator, but the validator bridge doesn't get the same signals.

The validator keeps thinking the collator is connected while the collator thinks it is disconnected from the validator, that happens for a few seconds and it recovers after a few blocks.

So, I think we've got two problems on this flow:

Why are disconnecting so often from validators we got allocated to, and the answer is this code: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/collator-protocol/src/collator_side/mod.rs#L1472 which gets called every 4 seconds and previous collation had been advertised so we emit ConnectToValidators with 0 peers which will disconnect the currently connected ones.
Even with 1, why are they getting, out of sync, is should work wouldn't it ?

sandreim commented 7 months ago

One possible fix would to be to keep connections to the backing group if it doesn't change. From validator perspective we would only disconnect inactive collators after 4 blocks. so we should be fine.

CollatorEvictionPolicy {
            inactive_collator: Duration::from_secs(24),
            undeclared: Duration::from_secs(1),
        }

alexggh commented 7 months ago

The disconnects on collator side happen because of this: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/src/protocol_controller.rs#L520

            let disconnect = self.reserved_only ||
                match direction {
                    Direction::Inbound => self.num_in >= self.max_in,
                    Direction::Outbound => self.num_out >= self.max_out,
                };

            if disconnect {
                // Disconnect the node.
                trace!(
                    target: LOG_TARGET,
                    "Disconnecting previously reserved node {peer_id} ({direction:?}) on {:?}.",
                    self.set_id,
                );
                self.drop_connection(peer_id);

Because collator removes all the peers out of reserved_peers every RECONNECT_TIMEOUT. I guess a quick workaround to prevent this condition would be to increase max_in and max_out for collators, @BulatSaif any idea what values we use there for asset-hub ?

altonen commented 7 months ago

Nothing prevents other nodes from taking those extra slots so I'm not sure effective that fix is. If you want to keep connected to the peer, why can't it be kept as a reserved peer?

alexggh commented 7 months ago

Nothing prevents other nodes from taking those extra slots so I'm not sure effective that fix is. If you want to keep connected to the peer, why can't it be kept as a reserved peer?

I agree that we should keep them as reserved, I was just looking for a quick solution to test that if collators don't disconnect often from validators we prevent them getting out of sync and miss advertising collations.

My question for you @altonen is the out of sync that I describe above just a flavour of the known issue that you described here https://github.com/paritytech/polkadot-sdk/issues/2995#issuecomment-1900340823 ?

altonen commented 7 months ago

Yeah it could be the same issue, at least by the sound of it. There is actually a plan to finally fix it, here's some information: https://github.com/paritytech/devops/issues/3164#issuecomment-1968623373

If you want/can, you could try deploying #3426 on one of the rococo collators/validators and see if the issue disappears. If the issue is the one discussed in #2995, it should disappear with #3426

alexggh commented 6 months ago

@BulatSaif: Can you help me with deploying asset-hub on rococo from https://github.com/paritytech/polkadot-sdk/pull/3544/files, so I can check if it improves the situations.

alexggh commented 6 months ago

Over the weekend I see a lot of dial failures because of this Unexpected peer ID :

Libp2p => Failed to reach PeerId("12D3KooWRFeDwgX2BHz2jstWUFj61zQFfCZzed6ZbVqfM5K5Wo5k"): Dial error: Unexpected peer ID 12D3KooWP93Dzk8T7GWxyWw9jhLcz8Pksokk3R9vL2eEH337bNkT at Dialer { address: "/ip4/34.147.118.161/tcp/30333/p2p/12D3KooWRFeDwgX2BHz2jstWUFj61zQFfCZzed6ZbVqfM5K5Wo5k", role_override: Dialer }.

altonen commented 6 months ago

It's a peer ID mismatch which can happen because the DHT contains stale peer records and when the local node attempts to dial a peer with a stale record, the dial may succeed but the Noise handshake fails because the peer is using a new ID.

alexggh commented 6 months ago

Looked a bit on this error: https://grafana.teleport.parity.io/goto/6Nz6bIASR?orgId=1

Connection attempt to PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav") 
failed with WrongPeerId { obtained: PeerId("12D3KooWMh2imeAzsZKGQgm2cv6Uoep3GBYtwGfujt1bs5YfVzkH"),
endpoint: Dialer { 
address: "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", 
role_override: Dialer } }.

When I look with subp2p-explorer, there are multiple peers reporting the same IP and port, I wouldn't expect that to happen, e.g 34.32.137.4/tcp/30333/ seems to be discovered on: rococo-validator-f-node-6 , rococo-validator-d-node-7, rococo-validator-e-node-5, rococo-validator-d-node-5, rococo-validator-i-node-3

164:authority="5CyAwh1sbGPqneT1ZdpMArRMLEf69RL6iThLxAxt2ZAuCHnD" peer_id=PeerId("12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-6)" version={"/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.89.111/tcp/32012/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.89.111/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.234.172.171/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn"} 
167:authority="5DLx7NUaNEVKazF8vVCcXKMqYRdV5t9Hpaxn19FsM7YSzzEb" peer_id=PeerId("12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-7)" version={"/ip4/34.141.146.74/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.89.111/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.91.205.11/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.90.84.170/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.91.151.87/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.147.59.8/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.90.210.63/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.89.111/tcp/30737/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp"} 
225:authority="5GMtYUSu9Lf5DkFAj8XiuZ4GxJURu9xofkvfs4Kx1xzhQoYn" peer_id=PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-1)" version={"/ip4/35.204.19.37/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.19.37/tcp/31532/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav"} 
230:authority="5GYnjEgkfjG2XK52XuZ3dUqha1QqXPkhpGhBvfeQnrF8EZ6X" peer_id=PeerId("12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-e-node-5)" version={"/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.234.172.171/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.204.237.204/tcp/32356/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs"} 
239:authority="5GukpQ8z2EeUf6oauwkAaZfHqkEewZryTeLAwAZnkbCkFch2" peer_id=PeerId("12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-5)" version={"/ip4/35.204.134.137/tcp/32315/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.205.11/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.90.84.170/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.147.59.8/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.147.83.199/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.151.87/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD"} 
249:authority="5HGraXWF7cRJatExhny7n2D4BQsLAddnngQnn5DhvDbkzAxS" peer_id=PeerId("12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-i-node-3)" version={"/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.147.63.230/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.91.196.154/tcp/30792/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.90.92.174/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA"}

Is this something expected ?

alexggh commented 6 months ago

Updates

The current state of affairs is that it seems that validators are publishing in DHT ports and IPs that are not meant to be used from outside, so collators will randomly fail to connect to validators when using this ports and IPs: E.g: https://grafana.teleport.parity.io/goto/aMX9dQAIR?orgId=1

The addresses the collators sees for that node, these are all the IPs that the validator pod had allocated during its lifetime.

Addresses of PeerId("12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7"): [
"/ip4/34.90.232.201/tcp/30032", "/ip4/34.147.63.230/tcp/30333", "/ip4/35.234.172.171/tcp/30333", "/ip4/34.90.232.201/tcp/30333", "/ip4/35.204.190.160/tcp/30333", "/ip4/34.147.36.17/tcp/30333", 
"/ip4/34.141.218.8/tcp/30333", "/ip4/35.234.160.139/tcp/30333", "/ip4/34.91.123.48/tcp/30333", 
"/ip4/34.90.210.63/tcp/30333", "/ip4/34.90.1.96/tcp/30333",
 "/ip4/34.90.1.96/tcp/30032", "/ip4/34.147.95.62/tcp/30333", 
"/ip4/34.147.95.62/tcp/30032", "/ip4/34.90.62.145/tcp/30333", 
"/ip4/34.34.35.247/tcp/30333", "/ip4/34.32.249.188/tcp/30032",
 "/ip4/34.32.249.188/tcp/30333"
]

The fix here should be to find a way not to expose externally this addresses, @altonen suggested something like this:

We could introduce a CLI flag --no-external-address-discovery that would call
 [Swarm::remove_external_address()](https://docs.rs/libp2p/latest/libp2p/struct.Swarm.html#method.remove_external_address) [here](https://github.com/paritytech/polkadot-sdk/blob/1c435e91c117b877c803427a91c0ccd18c527382/substrate/client/network/src/discovery.rs#L642-L658) 
when a new address is discovered, unless it discovers one of the specified addresses. 

Based on the issue, https://github.com/paritytech/polkadot-sdk/issues/2523,  we saw in December with the 
misconfigured proxy, I'd assume people external to Parity might be interested in this feature as well.

@BulatSaif @PierreBesson @altonen : Feel free to add more context.

bkchr commented 6 months ago

The addresses the collators sees for that node, these are all the IPs that the validator pod had allocated during its lifetime.

Maybe we need to prune more eagerly old addresses. However, I don't get how the collator is not able to connect to the validator? Or do you want to tell me that none of the addresses in this list is actually the address of the validator right now? Otherwise I would expect that when I want to connect to the node, it will try all the addresses.

alexggh commented 6 months ago

The addresses the collators sees for that node, these are all the IPs that the validator pod had allocated during its lifetime.

Maybe we need to prune more eagerly old addresses.

I'm really noob around that part, so I'm fine with anything that would get us out of this flaky state.

However, I don't get how the collator is not able to connect to the validator? Or do you want to tell me that none of the addresses in this list is actually the address of the validator right now? Otherwise I would expect that when I want to connect to the node, it will try all the addresses.

In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.

Connection attempt to PeerId("12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7") failed with WrongPeerId { obtained: PeerId("12D3KooWM56JbKWAXsDyWh313z73aKYVMp1Hj2nSnAKY3q6MnoC9"), endpoint: Dialer { address: "/ip4/34.90.210.63/tcp/30333/p2p/12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7", role_override: Dialer } }.

However, in the past days I did find cases where even when trying an expected port the dial was failing.

bkchr commented 6 months ago

In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.

This sounds like the "underlying bug" to me. I mean given the way nodes collect these addresses, we should not stop trying to connect because the peerid behind some address is something different. We need to try to connect to all addresses until we have found the peer we are searching for or we tried all addresses.

bkchr commented 6 months ago

@dmitry-markin @lexnv this is something we should fix ASAP. It should also not be that complicated to fix. (Assuming it is not buried somewhere really deep inside libp2p)

dmitry-markin commented 6 months ago

There are three things we should try:

Check if https://github.com/paritytech/polkadot-sdk/pull/3426 is safe for merging (real life tests are needed on Polkadot+Kusama+Westend, as in the past we had issues when there were not enough inbound slots on the network) and if it improves the situation with validators not noticing collator reconnections.
Implement what was proposed above with --no-external-address-discovery to safeguard against incorrectly reported external addresses by remote nodes.
Understand why none of the advertised addressed is the correct external address:

In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.

Bullet 3. I understand the least, as the node should report at least one correct address. This one is IMO the most important one, as it directly affects the connectivity.

dmitry-markin commented 6 months ago

In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.

This sounds like the "underlying bug" to me. I mean given the way nodes collect these addresses, we should not stop trying to connect because the peerid behind some address is something different. We need to try to connect to all addresses until we have found the peer we are searching for or we tried all addresses.

libp2p concurrently dials 8 known addresses: https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool/concurrent_dial.rs#L56-L61 https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool.rs#L1132 And tries the rest as soon as some dial fails: https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool/concurrent_dial.rs#L91-L96

In substrate, we just log such errors once the connection is finally established: https://github.com/paritytech/polkadot-sdk/blob/b0f34e4b2953f814254f92c8d151f22ffcf9c034/substrate/client/network/src/service.rs#L1513-L1514

If we expect nodes to have more than 8 addresses in DHT, it probably makes sense to increase the concurrency factor, as we currently don't set the transport timeout for p2p connections (and with default TCP SYN retries it's about 2 mins on linux). I would also set the transport timeout to some defined smaller value, like 15-20 secs.

dmitry-markin commented 6 months ago

Why the validators are changing their public addresses so frequently in the first place? Is it because they run on spot instances? May be we shouldn't use spot instances for validators?

alexggh commented 6 months ago

Why the validators are changing their public addresses so frequently in the first place? Is it because they run on spot instances? May be we shouldn't use spot instances for validators?

The arguments this nodes are started are:

--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.

altonen commented 6 months ago

Extending Alex's reply, my understanding of the issue is that the pod private port 30333 is accidentally exposed, instead of the public address port, 32685. The private port could be exposed either by port reuse, where by it binds to 30333 and the remote nodes somehow observe the node dialing from this address, or through address translation (link, link), where it translates the observed address to its listen address. To verify what is happening, you may need libp2p trace logs from both the listener and dialer.

FWIW, I think there were two different issues: peer ID mismatch and connection refused.

dmitry-markin commented 6 months ago

From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.

Remote nodes learn about our addresses via Identify protocol and add reported addresses to DHT. rust-libp2p combines listen addresses with external addresses and sends all of them via Identify message: https://github.com/libp2p/rust-libp2p/blob/v0.51/protocols/identify/src/behaviour.rs#L341-L346 and we later add them to DHT: https://github.com/paritytech/polkadot-sdk/blob/b0f34e4b2953f814254f92c8d151f22ffcf9c034/substrate/client/network/src/service.rs#L1415-L1419

What we can try doing to ensure remote nodes do not learn our listen addresses (and do not add them to DHT), is not report them to Identify behavior here: https://github.com/paritytech/polkadot-sdk/blob/b0f34e4b2953f814254f92c8d151f22ffcf9c034/substrate/client/network/src/peer_info.rs#L417

dmitry-markin commented 6 months ago

Extending Alex's reply, my understanding of the issue is that the pod private port 30333 is accidentally exposed, instead of the public address port, 32685.

In this case it's not (only) an accident, but the current logic of Identify, which always publishes our listen addresses in addition to observed/set external addresses.

dmitry-markin commented 6 months ago

@alexggh Can you check if the issue is mitigated if a collator runs a patch from https://github.com/paritytech/polkadot-sdk/pull/3656?

We also have a fix that should remove rogue listen addresses from DHT for validators https://github.com/paritytech/polkadot-sdk/pull/3657, but need to run a burn-in for it to make sure it's safe.

alexggh commented 6 months ago

@alexggh Can you check if the issue is mitigated if a collator runs a patch from https://github.com/paritytech/polkadot-sdk/pull/3656?

@BulatSaif is our chief burn-in officer, can you please deploy asset-hub with the above PR. Thank you!

lexnv commented 6 months ago

Looking a bit at the logs Alex provided above:

subp2p-explorer collects the addresses reported in the DHT here
- These are the addresses from the logs above. The Identify protocol is only used to extract the agent_version (logs from Alex here)

Advertising DHT addresses

Substrate reports the addresses in DHT here. Which uses the network.external_addresses() method.
- I believe the network external addresses are already correctly managed in the peerInfo from here. The Idenfity protocol indeed reported some mismatches, thanks @dmitry-markin for solving this in https://github.com/paritytech/polkadot-sdk/pull/3657 🙏
- Substrate should report only global addresses in DHT, as from here. Maybe publish_non_global_ips was set to true. I think is_global() is doing the right thing there.

TLDR; I don't think https://github.com/paritytech/polkadot-sdk/pull/3657 solves the issue here. @altonen @dmitry-markin could you double check please? 🙏

Different PeerID but Same Addresss

225:authority="5GMtYUSu9Lf5DkFAj8XiuZ4GxJURu9xofkvfs4Kx1xzhQoYn"
   peer_id=PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav")
   addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-1)" 
  "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav"

239:authority="5GukpQ8z2EeUf6oauwkAaZfHqkEewZryTeLAwAZnkbCkFch2"
   peer_id=PeerId("12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD")
   addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-5)" 
  "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD"

In this case, the DHT record contains the address /ip4/34.32.137.4/tcp/30333/p2p/ for different peer IDs 12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav and 12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD.

I'd have to look again at the subp2p-explorer to make sure I report the data correctly there. Although Alex mentioned https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1976608361 that the WrongPeerID error was encountered by our grafana logs.

This could happen when the nodes that we manage share the external address. And they go on this code path:

https://github.com/paritytech/polkadot-sdk/blob/1ead59773e2dab336d2b54295419bbc3fe7c687f/substrate/client/authority-discovery/src/worker.rs#L329

From this line, if the rococo-validator-f-node-1 and rococo-validator-d-node-5 share /ip4/34.32.137.4/tcp/30333, we'll add the peerID of the local node and advertise that address in the DHT. And so we'll end up with two peers pointed at the same address.

I think this could happen if the nodes are placed under the same network (likely)? Would love to get your thoughts on this as I don't have that much context for the rococo validators here 🙏

BulatSaif commented 6 months ago

The arguments this nodes are started are:
--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333
From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.

Here how you can reproduce issue locally:

Start node

docker run -it --rm parity/polkadot:v1.8.0 --chain rococo  --public-addr=/ip4/35.204.131.85/tcp/32685   --listen-addr=/ip4/127.0.0.1/tcp/22

After a few second you will see such logs:


2024-03-12 12:06:36 🔍 Discovered new external address for our node: /ip4/35.204.131.85/tcp/32685/p2p/12D3KooWLf25UasBJcozLYUQiN1dgZidPpzAi7kPTg4eoXt81gam
...
2024-03-12 12:06:37 🔍 Discovered new external address for our node: /ip4/212.227.198.212/tcp/22/p2p/12D3KooWLf25UasBJcozLYUQiN1dgZidPpzAi7kPTg4eoXt81gam

The second discovered address is incorrect, 212.227.198.212 is your ISP IP.

This mainly affects testnets since they are usually deployed within a container environment, such as Kubernetes. However, I've also seen cases where comunity mainet RPC bootnodes were deployed in a private network and exposed with port forwarding on the edge router they have same issue.

dmitry-markin commented 6 months ago

@lexnv My understanding of the issue is that the validators are run on spot instances, which have both:

A configured external address (some snat/dnat configuration?)
Some ephemeral global address of the spot instance the node is listening on.

This ephemeral address is reported via Identify to remote nodes, and they add it to DHT because it's global. And this ephemeral address changes periodically, and gets added to DHT again and again.

As a result, we have plenty of obsolete addresses in the DHT. And as we can see from https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1988182237 all those addresses are global — they are the addresses the pods had at some point.

BulatSaif commented 6 months ago

@alexggh Can you check if the issue is mitigated if a collator runs a patch from #3656?

@BulatSaif is our chief burn-in officer, can you please deploy asset-hub with the above PR. Thank you!

@alexggh @dmitry-markin #3656 is deployed on asset-hub-rococo

lexnv commented 6 months ago

Looking at the logs from grafana:


2024-03-12 15:38:04.239 DEBUG tokio-runtime-worker libp2p_swarm: [Relaychain] Connection attempt to PeerId("12D3KooWJhwBW1UjxhxHaCUCWYmCRjsWx6ZEbwHTqfQmPjNx8GjC") failed with WrongPeerId { obtained: PeerId("12D3KooWMh2imeAzsZKGQgm2cv6Uoep3GBYtwGfujt1bs5YfVzkH"), endpoint: Dialer { address: "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWJhwBW1UjxhxHaCUCWYmCRjsWx6ZEbwHTqfQmPjNx8GjC", role_override: Dialer } }. 

...

2024-03-12 15:26:24.212  WARN tokio-runtime-worker peerset: [Relaychain] Trying to remove unknown reserved node 12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7 from SetId(3).    

2024-03-12 15:20:36.109  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x1f91c6421302e9acbf6d5ecd02369e2a9e6787fbe9722d86f46e3fe9a9d2d77a pov_hash=0x069b0c4cd14d90105f911b6e0ca4d1a3209498441628baa835524b9bd9f8f8bc traceID=41962972065863927574614223903885008426

2024-03-12 15:20:24.223  WARN tokio-runtime-worker peerset: [Relaychain] Trying to remove unknown reserved node 12D3KooWPpC6rv56G2NiKKj4V6wHqQ1vMzymLsXrLm4wihDG4Acr from SetId(3).    

2024-03-12 15:17:33.053  WARN tokio-runtime-worker libp2p_kad::behaviour: [Relaychain] Locating closest peers for replication failed: Err(Timeout { key: Key(b"\x83\x1e\xec\x1d\x1f<\x18\x88&\xf4\\9\xc7]\x9a<\x15\xd9\x9d0Xj\xda\xe2m1'c\xe6Kw\x1c"), success: [], quorum: 20 })    

2024-03-12 15:07:38.250  WARN tokio-runtime-worker peerset: [Relaychain] Report 12D3KooWGX1RBgFn1rqAJDr1X5gm1wBcFPziAqXR7eTHYhBTBBE2: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    

2024-03-12 14:57:40.743  WARN                 main sc_service::config: [Parachain] Using default protocol ID "sup" because none is configured in the chain specs

dmitry-markin commented 6 months ago

The arguments this nodes are started are:
--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

@alexggh just to clarity, how is the listen address mapped to the public address (i.e., why dialing 35.204.131.85:32685 would reach the node)?

Is there some dnat / port forwarding rule in the firewall?

alexggh commented 6 months ago

@BulatSaif Should be able to help you here ^^, I'm not from the infra team, so I'm not able to explain to you our networking setup.

BulatSaif commented 6 months ago

Member

We use Kubernetes NodePort, which has various implementations, but the simplest one involves iptables rules. Any traffic received on the specific port by any VM in the cluster will be forwarded to the correct pod.

dmitry-markin commented 6 months ago

We had a discussion with @BulatSaif about validators' networking in kubernetes on spot instances, but it's still not completely clear how multiple external addresses end up in the authority discovery DHT record.

For sure, there are two issues:

Address translation reported in https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1991246855, which is the reason why private ports 30333 end up in the authority discovery record.
Accumulation of temporary/ephemeral pod addresses in the authority discovery record. This shouldn't happen from our understanding of pod networking, but nevertheless happens. The authority discovery record in DHT should be overwritten by the most recent --public-addr used, but this doesn't happen for some reason. My wild guess is that DHT might be partitioned / put_record() requests do not propagate to all target peers, and there are multiple authority records present at the same time, all found by subp2p-explorer.

I'm working on a fix https://github.com/paritytech/polkadot-sdk/pull/3657 that will resolve issue 1 once extended on authority discovery DHT records and all addresses, including external addresses not specified with --public-addr. It could resolve issue 2 if the guess about DHT partitioning is wrong or something is off with the validator kubernetes networking.

dmitry-markin commented 6 months ago

@BulatSaif can we deploy up to date master to one of the validators to see what https://github.com/paritytech/polkadot-sdk/pull/3668 is reporting? This is needed to rule out that the validator is publishing rogue addresses into the DHT authority discovery record.

-lsub-authority-discovery=debug flag is needed.

BulatSaif commented 6 months ago

@BulatSaif can we deploy up to date master to one of the validators to see what #3668 is reporting? This is needed to rule out that the validator is publishing rogue addresses into the DHT authority discovery record.

-lsub-authority-discovery=debug flag is needed.

rococo-validator-a* updated

alexggh commented 6 months ago

@bkchr root-caused this parts of this issue here: https://github.com/paritytech/polkadot-sdk/issues/3673#issuecomment-2009863496

I think because of this, we are going to truncate out the good addresses. https://github.com/paritytech/polkadot-sdk/blob/a6713c55fd5082d333518c3ca13f2a4294726fcc/substrate/client/authority-discovery/src/worker.rs#L571-L572

paritytech / polkadot-sdk