Open alexggh opened 7 months ago
Checked a few ocassions it seems the collators can't dial to the validator they are assigned to, some examples:
Collator was assigned to validator rococo-validator-a-node-7
https://grafana.teleport.parity.io/goto/dmUVC9oIg?orgId=1
Collator was assigned to validator rococo-validator-h-node-5
https://grafana.teleport.parity.io/goto/nWjOjroIg?orgId=1
The validators seem to be receiving collations when they get assigned to other parachains, e.g: https://grafana.teleport.parity.io/goto/rM-xC9TIR?orgId=1
Reach failure for Some(PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH")) through "/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH": Other(Custom { kind: Other, error: Custom { kind: Other, error: Timeout } })
@dmitry-markin @altonen @BulatSaif any thoughts if this is a bug in our connecting logic or just an infrastructure problem where we don't have full connectivity to some nodes.
It seems like rococo-rockmine-collator-experimental-node-0
https://grafana.teleport.parity.io/goto/BUtrj9TIg?orgId=1 can connect to the same validators but not the collators producing blocks, so this leads me to think that it is just a network partition problem.
I was looking into rococo-rockmine-collator-node-0
logs and there's something in the logs that I don't understand (logs are in reverse):
2024-02-29 10:28:06.117 parachain::network-bridge-rx: [Relaychain] action="PeerDisconnected" peer_set=Collation peer=PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d")
2024-02-29 10:28:06.117 peerset: [Relaychain] Reserved peer 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d (Outbound) dropped from SetId(3).
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] External API <= Closed(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] PSM <= Dropped(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) <= Close(SetId(3))
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) => CloseDesired(SetId(3))
2024-02-29 10:28:04.481 sub-libp2p: [Relaychain] Libp2p => Disconnected(12D3KooWQELiM7oVu7tQPgj51rWN733a1FoDEg2XVHQhzJyUCXYR, SetId(3), ConnectionId(9856)): Disabled.
2024-02-29 10:28:04.104 parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=1 removed=0
2024-02-29 10:28:04.104 parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=1 failed_to_resolve=0
2024-02-29 10:28:04.100 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWRsMJngwJwc614i8Rbik2nsm9d4NuKTcY8cvRJH1bupsE, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/23006" }, ConnectionId(9910)): Not requested by PSM, disabling.
2024-02-29 10:28:03.022 WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x0f105c1e361e7ddba3e93e02820f39d50b21f3a1c694cf3962bc9e8d94e1f9d8 pov_hash=0xf3d1343ed98b05a2b765556253d3ba7c64308920182b74103eeadeb59cc67c42 traceID=20023365061787759369622147911581120981
2024-02-29 10:28:01.555 sub-libp2p: [Relaychain] Libp2p => Disconnected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), ConnectionId(9907)): Disabled.
2024-02-29 10:28:01.469 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/33269" }, ConnectionId(9907)): Secondary connection. Leaving closed.
2024-02-29 10:28:01.469 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/62370" }, ConnectionId(9906)): Not requested by PSM, disabling.
2024-02-29 10:28:01.150 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWGX1RBgFn1rqAJDr1X5gm1wBcFPziAqXR7eTHYhBTBBE2, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/46197" }, ConnectionId(9905)): Not requested by PSM, disabling.
2024-02-29 10:28:00.829 sub-libp2p: [Relaychain] External API <= Message(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))
2024-02-29 10:28:00.829 sub-libp2p: [Relaychain] Handler(ConnectionId(9881)) => Notification(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3), 38 bytes)
2024-02-29 10:28:00.828 parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d") role=Authority
2024-02-29 10:28:00.828 sub-libp2p: [Relaychain] External API <= Open(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))
2024-02-29 10:28:00.828 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) => OpenResultOk(SetId(3))
2024-02-29 10:28:00.827 sub-libp2p: [Relaychain] Handler(PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d"), ConnectionId(9881)) <= Open(SetId(3))
2024-02-29 10:28:00.827 sub-libp2p: [Relaychain] PSM => Connect(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3)): Enabling connections.
2024-02-29 10:28:00.826 peerset: [Relaychain] Connecting to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d on SetId(3) (0/0 num_out/max_out).
2024-02-29 10:28:00.152 parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=1 removed=0
2024-02-29 10:28:00.152 parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=1 failed_to_resolve=0
There's a ConnectToValidators
request, substream is opened successfully and the peer is reported to NetworkBridge
but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d
?
I looked at some of the dial failures and I don't think they're a node problem, except maybe https://github.com/paritytech/polkadot-sdk/issues/498 which I also saw in some of the dial failures. But I agree that they look more like a network problem.
Reach failure for Some(PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH")) through "/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH": Other(Custom { kind: Other, error: Custom { kind: Other, error: Timeout } })
@dmitry-markin @altonen @BulatSaif any thoughts if this is a bug in our connecting logic or just an infrastructure problem where we don't have full connectivity to some nodes.
Port is incorrect it should be 32685
not 30333
. Fixed ports are only on bootnodes.
Also rococo validator are on spot instance and may restart and change VM(IP).
We use this two flags to set address:
--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333
30333
is internal port and should not be shared
Full cli:
polkadot --name=rococo-validator-a-node-7 --base-path=/chain-data --keystore-path=/keystore --chain=rococo --validator --database=paritydb --state-pruning=1000 --prometheus-external --prometheus-port 9615 --unsafe-rpc-external --rpc-port=9944 --rpc-cors=all --rpc-methods=unsafe --node-key-file /keystore/node-key --log=parachain=debug --insecure-validator-i-know-what-i-do --enable-offchain-indexing true --public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333
This logs looks interesting:
Node see correct port:
2024-02-28 15:58:08.055 INFO main sub-libp2p: 🔍 Discovered new external address for our node: /ip4/34.90.232.201/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH
but later gets new external address with incorrect port.
2024-02-28 15:58:08.652 INFO tokio-runtime-worker sub-libp2p: 🔍 Discovered new external address for our node: /ip4/34.90.232.201/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH
With subexp-explorer this is what I get:
authority="5C5jWPr5ygq4VgwfgsnrcHjp8H11xMT1gYXqR9a4oDJ7CeWF"
peer_id=PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH")
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-a-node-7)"
version={
"/ip4/34.91.211.207/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH",
"/ip4/34.32.166.239/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH",
"/ip4/34.32.166.239/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH",
"/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"}
It seems to be it is advertising everything, although all of them seem to be tried:
Failed to reach PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"): Failed to negotiate transport protocol(s):
[(/ip4/35.204.235.25/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH: : Timeout has been reached)
(/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH: : Timeout has been reached)]
There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?
Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.
@BulatSaif could you also enable, thank you!
parachain=trace
There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?
Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.
One reason for collation not being advertised in this case is a missing PeerView update from the validator.
Regarding the connectivity, definitely something seem to be off, for example I can telnet to some of the ports here:
authority="5C5HkAmbAGC7nqv7Uo2o3LFxBsFZ4Q62av6L7YDJnD56oGq1"
peer_id=PeerId("12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3")
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-9)"
version={"/ip4/35.204.163.222/tcp/30333/p2p/12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3",
"/ip4/35.204.163.222/tcp/30156/p2p/12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3"}
But I can't telnet here:
authority="5C5jWPr5ygq4VgwfgsnrcHjp8H11xMT1gYXqR9a4oDJ7CeWF"
peer_id=PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH")
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-a-node-7)"
version={"/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH",
"/ip4/34.91.211.207/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH",
"/ip4/34.32.166.239/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH",
"/ip4/34.32.166.239/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"
}
@BulatSaif could you also enable, thank you!
parachain=trace
Done enabled on rococo-validator-a.*
@BulatSaif could you also enable, thank you!
parachain=trace
Done enabled on
rococo-validator-a.*
My bad I wanted it on the asset-hub-collators, could you please add it there as well. Thank you
My bad I wanted it on the asset-hub-collators, could you please add it there as well. Thank you
Done :heavy_check_mark:
There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?
Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.
One reason for collation not being advertised in this case is a missing PeerView update from the validator.
This was spot-on, it seems that collator and validator are not in sync when it comes to PeerConnected
events for each other, this is why even when if connection between collator and validator the collation is not advertised, e.g:
You can see that the collator Connects and Disconnects every time it has something to send to the validator, but the validator bridge doesn't get the same signals.
The validator keeps thinking the collator is connected while the collator thinks it is disconnected from the validator, that happens for a few seconds and it recovers after a few blocks.
So, I think we've got two problems on this flow:
Why are disconnecting so often from validators we got allocated to, and the answer is this code: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/collator-protocol/src/collator_side/mod.rs#L1472 which gets called every 4 seconds and previous collation had been advertised so we emit ConnectToValidators with 0 peers which will disconnect the currently connected ones.
Even with 1, why are they getting, out of sync, is should work wouldn't it ?
One possible fix would to be to keep connections to the backing group if it doesn't change. From validator perspective we would only disconnect inactive collators after 4 blocks. so we should be fine.
CollatorEvictionPolicy {
inactive_collator: Duration::from_secs(24),
undeclared: Duration::from_secs(1),
}
The disconnects on collator side happen because of this: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/src/protocol_controller.rs#L520
let disconnect = self.reserved_only ||
match direction {
Direction::Inbound => self.num_in >= self.max_in,
Direction::Outbound => self.num_out >= self.max_out,
};
if disconnect {
// Disconnect the node.
trace!(
target: LOG_TARGET,
"Disconnecting previously reserved node {peer_id} ({direction:?}) on {:?}.",
self.set_id,
);
self.drop_connection(peer_id);
Because collator removes all the peers out of reserved_peers every RECONNECT_TIMEOUT. I guess a quick workaround to prevent this condition would be to increase max_in
and max_out
for collators, @BulatSaif any idea what values we use there for asset-hub ?
Nothing prevents other nodes from taking those extra slots so I'm not sure effective that fix is. If you want to keep connected to the peer, why can't it be kept as a reserved peer?
Nothing prevents other nodes from taking those extra slots so I'm not sure effective that fix is. If you want to keep connected to the peer, why can't it be kept as a reserved peer?
I agree that we should keep them as reserved, I was just looking for a quick solution to test that if collators don't disconnect often from validators we prevent them getting out of sync and miss advertising collations.
My question for you @altonen is the out of sync that I describe above just a flavour of the known issue that you described here https://github.com/paritytech/polkadot-sdk/issues/2995#issuecomment-1900340823 ?
Yeah it could be the same issue, at least by the sound of it. There is actually a plan to finally fix it, here's some information: https://github.com/paritytech/devops/issues/3164#issuecomment-1968623373
If you want/can, you could try deploying #3426 on one of the rococo collators/validators and see if the issue disappears. If the issue is the one discussed in #2995, it should disappear with #3426
@BulatSaif: Can you help me with deploying asset-hub on rococo from https://github.com/paritytech/polkadot-sdk/pull/3544/files, so I can check if it improves the situations.
Over the weekend I see a lot of dial failures because of this Unexpected peer ID
:
Libp2p => Failed to reach PeerId("12D3KooWRFeDwgX2BHz2jstWUFj61zQFfCZzed6ZbVqfM5K5Wo5k"): Dial error: Unexpected peer ID 12D3KooWP93Dzk8T7GWxyWw9jhLcz8Pksokk3R9vL2eEH337bNkT at Dialer { address: "/ip4/34.147.118.161/tcp/30333/p2p/12D3KooWRFeDwgX2BHz2jstWUFj61zQFfCZzed6ZbVqfM5K5Wo5k", role_override: Dialer }.
It's a peer ID mismatch which can happen because the DHT contains stale peer records and when the local node attempts to dial a peer with a stale record, the dial may succeed but the Noise handshake fails because the peer is using a new ID.
Looked a bit on this error: https://grafana.teleport.parity.io/goto/6Nz6bIASR?orgId=1
Connection attempt to PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav")
failed with WrongPeerId { obtained: PeerId("12D3KooWMh2imeAzsZKGQgm2cv6Uoep3GBYtwGfujt1bs5YfVzkH"),
endpoint: Dialer {
address: "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav",
role_override: Dialer } }.
When I look with subp2p-explorer, there are multiple peers reporting the same IP and port, I wouldn't expect that to happen, e.g 34.32.137.4/tcp/30333/
seems to be discovered on: rococo-validator-f-node-6
, rococo-validator-d-node-7
, rococo-validator-e-node-5
, rococo-validator-d-node-5
, rococo-validator-i-node-3
164:authority="5CyAwh1sbGPqneT1ZdpMArRMLEf69RL6iThLxAxt2ZAuCHnD" peer_id=PeerId("12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-6)" version={"/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.89.111/tcp/32012/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.89.111/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.234.172.171/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn"}
167:authority="5DLx7NUaNEVKazF8vVCcXKMqYRdV5t9Hpaxn19FsM7YSzzEb" peer_id=PeerId("12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-7)" version={"/ip4/34.141.146.74/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.89.111/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.91.205.11/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.90.84.170/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.91.151.87/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.147.59.8/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.90.210.63/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.89.111/tcp/30737/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp"}
225:authority="5GMtYUSu9Lf5DkFAj8XiuZ4GxJURu9xofkvfs4Kx1xzhQoYn" peer_id=PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-1)" version={"/ip4/35.204.19.37/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.19.37/tcp/31532/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav"}
230:authority="5GYnjEgkfjG2XK52XuZ3dUqha1QqXPkhpGhBvfeQnrF8EZ6X" peer_id=PeerId("12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-e-node-5)" version={"/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.234.172.171/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.204.237.204/tcp/32356/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs"}
239:authority="5GukpQ8z2EeUf6oauwkAaZfHqkEewZryTeLAwAZnkbCkFch2" peer_id=PeerId("12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-5)" version={"/ip4/35.204.134.137/tcp/32315/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.205.11/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.90.84.170/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.147.59.8/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.147.83.199/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.151.87/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD"}
249:authority="5HGraXWF7cRJatExhny7n2D4BQsLAddnngQnn5DhvDbkzAxS" peer_id=PeerId("12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-i-node-3)" version={"/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.147.63.230/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.91.196.154/tcp/30792/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.90.92.174/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA"}
Is this something expected ?
The current state of affairs is that it seems that validators are publishing in DHT ports and IPs that are not meant to be used from outside, so collators will randomly fail to connect to validators when using this ports and IPs: E.g: https://grafana.teleport.parity.io/goto/aMX9dQAIR?orgId=1
The addresses the collators sees for that node, these are all the IPs that the validator pod
had allocated during its lifetime.
Addresses of PeerId("12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7"): [
"/ip4/34.90.232.201/tcp/30032", "/ip4/34.147.63.230/tcp/30333", "/ip4/35.234.172.171/tcp/30333", "/ip4/34.90.232.201/tcp/30333", "/ip4/35.204.190.160/tcp/30333", "/ip4/34.147.36.17/tcp/30333",
"/ip4/34.141.218.8/tcp/30333", "/ip4/35.234.160.139/tcp/30333", "/ip4/34.91.123.48/tcp/30333",
"/ip4/34.90.210.63/tcp/30333", "/ip4/34.90.1.96/tcp/30333",
"/ip4/34.90.1.96/tcp/30032", "/ip4/34.147.95.62/tcp/30333",
"/ip4/34.147.95.62/tcp/30032", "/ip4/34.90.62.145/tcp/30333",
"/ip4/34.34.35.247/tcp/30333", "/ip4/34.32.249.188/tcp/30032",
"/ip4/34.32.249.188/tcp/30333"
]
The fix here should be to find a way not to expose externally this addresses, @altonen suggested something like this:
We could introduce a CLI flag --no-external-address-discovery that would call
[Swarm::remove_external_address()](https://docs.rs/libp2p/latest/libp2p/struct.Swarm.html#method.remove_external_address) [here](https://github.com/paritytech/polkadot-sdk/blob/1c435e91c117b877c803427a91c0ccd18c527382/substrate/client/network/src/discovery.rs#L642-L658)
when a new address is discovered, unless it discovers one of the specified addresses.
Based on the issue, https://github.com/paritytech/polkadot-sdk/issues/2523, we saw in December with the
misconfigured proxy, I'd assume people external to Parity might be interested in this feature as well.
@BulatSaif @PierreBesson @altonen : Feel free to add more context.
The addresses the collators sees for that node, these are all the IPs that the validator
pod
had allocated during its lifetime.
Maybe we need to prune more eagerly old addresses. However, I don't get how the collator is not able to connect to the validator? Or do you want to tell me that none of the addresses in this list is actually the address of the validator right now? Otherwise I would expect that when I want to connect to the node, it will try all the addresses.
The addresses the collators sees for that node, these are all the IPs that the validator
pod
had allocated during its lifetime.Maybe we need to prune more eagerly old addresses.
I'm really noob around that part, so I'm fine with anything that would get us out of this flaky state.
However, I don't get how the collator is not able to connect to the validator? Or do you want to tell me that none of the addresses in this list is actually the address of the validator right now? Otherwise I would expect that when I want to connect to the node, it will try all the addresses.
In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.
Connection attempt to PeerId("12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7") failed with WrongPeerId { obtained: PeerId("12D3KooWM56JbKWAXsDyWh313z73aKYVMp1Hj2nSnAKY3q6MnoC9"), endpoint: Dialer { address: "/ip4/34.90.210.63/tcp/30333/p2p/12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7", role_override: Dialer } }.
However, in the past days I did find cases where even when trying an expected port the dial was failing.
In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.
This sounds like the "underlying bug" to me. I mean given the way nodes collect these addresses, we should not stop trying to connect because the peerid behind some address is something different. We need to try to connect to all addresses until we have found the peer we are searching for or we tried all addresses.
@dmitry-markin @lexnv this is something we should fix ASAP. It should also not be that complicated to fix. (Assuming it is not buried somewhere really deep inside libp2p)
There are three things we should try:
--no-external-address-discovery
to safeguard against incorrectly reported external addresses by remote nodes.In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.
Bullet 3. I understand the least, as the node should report at least one correct address. This one is IMO the most important one, as it directly affects the connectivity.
In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.
This sounds like the "underlying bug" to me. I mean given the way nodes collect these addresses, we should not stop trying to connect because the peerid behind some address is something different. We need to try to connect to all addresses until we have found the peer we are searching for or we tried all addresses.
libp2p concurrently dials 8 known addresses: https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool/concurrent_dial.rs#L56-L61 https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool.rs#L1132 And tries the rest as soon as some dial fails: https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool/concurrent_dial.rs#L91-L96
In substrate, we just log such errors once the connection is finally established: https://github.com/paritytech/polkadot-sdk/blob/b0f34e4b2953f814254f92c8d151f22ffcf9c034/substrate/client/network/src/service.rs#L1513-L1514
If we expect nodes to have more than 8 addresses in DHT, it probably makes sense to increase the concurrency factor, as we currently don't set the transport timeout for p2p connections (and with default TCP SYN retries it's about 2 mins on linux). I would also set the transport timeout to some defined smaller value, like 15-20 secs.
Why the validators are changing their public addresses so frequently in the first place? Is it because they run on spot instances? May be we shouldn't use spot instances for validators?
Why the validators are changing their public addresses so frequently in the first place? Is it because they run on spot instances? May be we shouldn't use spot instances for validators?
The arguments this nodes are started are:
--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333
From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.
Extending Alex's reply, my understanding of the issue is that the pod private port 30333 is accidentally exposed, instead of the public address port, 32685. The private port could be exposed either by port reuse, where by it binds to 30333 and the remote nodes somehow observe the node dialing from this address, or through address translation (link, link), where it translates the observed address to its listen address. To verify what is happening, you may need libp2p trace logs from both the listener and dialer.
FWIW, I think there were two different issues: peer ID mismatch and connection refused.
From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.
Remote nodes learn about our addresses via Identify protocol and add reported addresses to DHT. rust-libp2p combines listen addresses with external addresses and sends all of them via Identify message: https://github.com/libp2p/rust-libp2p/blob/v0.51/protocols/identify/src/behaviour.rs#L341-L346 and we later add them to DHT: https://github.com/paritytech/polkadot-sdk/blob/b0f34e4b2953f814254f92c8d151f22ffcf9c034/substrate/client/network/src/service.rs#L1415-L1419
What we can try doing to ensure remote nodes do not learn our listen addresses (and do not add them to DHT), is not report them to Identify behavior here: https://github.com/paritytech/polkadot-sdk/blob/b0f34e4b2953f814254f92c8d151f22ffcf9c034/substrate/client/network/src/peer_info.rs#L417
Extending Alex's reply, my understanding of the issue is that the pod private port 30333 is accidentally exposed, instead of the public address port, 32685.
In this case it's not (only) an accident, but the current logic of Identify, which always publishes our listen addresses in addition to observed/set external addresses.
@alexggh Can you check if the issue is mitigated if a collator runs a patch from https://github.com/paritytech/polkadot-sdk/pull/3656?
We also have a fix that should remove rogue listen addresses from DHT for validators https://github.com/paritytech/polkadot-sdk/pull/3657, but need to run a burn-in for it to make sure it's safe.
@alexggh Can you check if the issue is mitigated if a collator runs a patch from https://github.com/paritytech/polkadot-sdk/pull/3656?
@BulatSaif is our chief burn-in officer, can you please deploy asset-hub with the above PR. Thank you!
Looking a bit at the logs Alex provided above:
Identify
protocol is only used to extract the agent_version
(logs from Alex here)network.external_addresses()
method.
peerInfo
from here. The Idenfity protocol indeed reported some mismatches, thanks @dmitry-markin for solving this in https://github.com/paritytech/polkadot-sdk/pull/3657 🙏 publish_non_global_ips
was set to true. I think is_global() is doing the right thing there.TLDR; I don't think https://github.com/paritytech/polkadot-sdk/pull/3657 solves the issue here. @altonen @dmitry-markin could you double check please? 🙏
225:authority="5GMtYUSu9Lf5DkFAj8XiuZ4GxJURu9xofkvfs4Kx1xzhQoYn"
peer_id=PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav")
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-1)"
"/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav"
239:authority="5GukpQ8z2EeUf6oauwkAaZfHqkEewZryTeLAwAZnkbCkFch2"
peer_id=PeerId("12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD")
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-5)"
"/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD"
In this case, the DHT record contains the address /ip4/34.32.137.4/tcp/30333/p2p/
for different peer IDs 12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav
and 12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD
.
I'd have to look again at the subp2p-explorer to make sure I report the data correctly there. Although Alex mentioned https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1976608361 that the WrongPeerID
error was encountered by our grafana logs.
This could happen when the nodes that we manage share the external address. And they go on this code path:
From this line, if the rococo-validator-f-node-1
and rococo-validator-d-node-5
share /ip4/34.32.137.4/tcp/30333
, we'll add the peerID of the local node and advertise that address in the DHT. And so we'll end up with two peers pointed at the same address.
I think this could happen if the nodes are placed under the same network (likely)? Would love to get your thoughts on this as I don't have that much context for the rococo validators here 🙏
The arguments this nodes are started are:
--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333
From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.
Here how you can reproduce issue locally:
Start node
docker run -it --rm parity/polkadot:v1.8.0 --chain rococo --public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/127.0.0.1/tcp/22
2024-03-12 12:06:36 🔍 Discovered new external address for our node: /ip4/35.204.131.85/tcp/32685/p2p/12D3KooWLf25UasBJcozLYUQiN1dgZidPpzAi7kPTg4eoXt81gam
...
2024-03-12 12:06:37 🔍 Discovered new external address for our node: /ip4/212.227.198.212/tcp/22/p2p/12D3KooWLf25UasBJcozLYUQiN1dgZidPpzAi7kPTg4eoXt81gam
The second discovered address is incorrect, 212.227.198.212 is your ISP IP.
This mainly affects testnets since they are usually deployed within a container environment, such as Kubernetes. However, I've also seen cases where comunity mainet RPC bootnodes were deployed in a private network and exposed with port forwarding on the edge router they have same issue.
@lexnv My understanding of the issue is that the validators are run on spot instances, which have both:
This ephemeral address is reported via Identify to remote nodes, and they add it to DHT because it's global. And this ephemeral address changes periodically, and gets added to DHT again and again.
As a result, we have plenty of obsolete addresses in the DHT. And as we can see from https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1988182237 all those addresses are global — they are the addresses the pods had at some point.
@alexggh Can you check if the issue is mitigated if a collator runs a patch from #3656?
@BulatSaif is our chief burn-in officer, can you please deploy asset-hub with the above PR. Thank you!
@alexggh @dmitry-markin #3656 is deployed on asset-hub-rococo
Looking at the logs from grafana:
2024-03-12 15:38:04.239 DEBUG tokio-runtime-worker libp2p_swarm: [Relaychain] Connection attempt to PeerId("12D3KooWJhwBW1UjxhxHaCUCWYmCRjsWx6ZEbwHTqfQmPjNx8GjC") failed with WrongPeerId { obtained: PeerId("12D3KooWMh2imeAzsZKGQgm2cv6Uoep3GBYtwGfujt1bs5YfVzkH"), endpoint: Dialer { address: "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWJhwBW1UjxhxHaCUCWYmCRjsWx6ZEbwHTqfQmPjNx8GjC", role_override: Dialer } }.
...
2024-03-12 15:26:24.212 WARN tokio-runtime-worker peerset: [Relaychain] Trying to remove unknown reserved node 12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7 from SetId(3).
2024-03-12 15:20:36.109 WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x1f91c6421302e9acbf6d5ecd02369e2a9e6787fbe9722d86f46e3fe9a9d2d77a pov_hash=0x069b0c4cd14d90105f911b6e0ca4d1a3209498441628baa835524b9bd9f8f8bc traceID=41962972065863927574614223903885008426
2024-03-12 15:20:24.223 WARN tokio-runtime-worker peerset: [Relaychain] Trying to remove unknown reserved node 12D3KooWPpC6rv56G2NiKKj4V6wHqQ1vMzymLsXrLm4wihDG4Acr from SetId(3).
2024-03-12 15:17:33.053 WARN tokio-runtime-worker libp2p_kad::behaviour: [Relaychain] Locating closest peers for replication failed: Err(Timeout { key: Key(b"\x83\x1e\xec\x1d\x1f<\x18\x88&\xf4\\9\xc7]\x9a<\x15\xd9\x9d0Xj\xda\xe2m1'c\xe6Kw\x1c"), success: [], quorum: 20 })
2024-03-12 15:07:38.250 WARN tokio-runtime-worker peerset: [Relaychain] Report 12D3KooWGX1RBgFn1rqAJDr1X5gm1wBcFPziAqXR7eTHYhBTBBE2: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.
2024-03-12 14:57:40.743 WARN main sc_service::config: [Parachain] Using default protocol ID "sup" because none is configured in the chain specs
The arguments this nodes are started are:
--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333
@alexggh just to clarity, how is the listen address mapped to the public address (i.e., why dialing 35.204.131.85:32685 would reach the node)?
Is there some dnat / port forwarding rule in the firewall?
@BulatSaif Should be able to help you here ^^, I'm not from the infra team, so I'm not able to explain to you our networking setup.
Member
We use Kubernetes NodePort, which has various implementations, but the simplest one involves iptables rules. Any traffic received on the specific port by any VM in the cluster will be forwarded to the correct pod.
We had a discussion with @BulatSaif about validators' networking in kubernetes on spot instances, but it's still not completely clear how multiple external addresses end up in the authority discovery DHT record.
For sure, there are two issues:
--public-addr
used, but this doesn't happen for some reason. My wild guess is that DHT might be partitioned / put_record() requests do not propagate to all target peers, and there are multiple authority records present at the same time, all found by subp2p-explorer.I'm working on a fix https://github.com/paritytech/polkadot-sdk/pull/3657 that will resolve issue 1 once extended on authority discovery DHT records and all addresses, including external addresses not specified with --public-addr
. It could resolve issue 2 if the guess about DHT partitioning is wrong or something is off with the validator kubernetes networking.
@BulatSaif can we deploy up to date master to one of the validators to see what https://github.com/paritytech/polkadot-sdk/pull/3668 is reporting? This is needed to rule out that the validator is publishing rogue addresses into the DHT authority discovery record.
-lsub-authority-discovery=debug
flag is needed.
@BulatSaif can we deploy up to date master to one of the validators to see what #3668 is reporting? This is needed to rule out that the validator is publishing rogue addresses into the DHT authority discovery record.
-lsub-authority-discovery=debug
flag is needed.
rococo-validator-a*
updated
@bkchr root-caused this parts of this issue here: https://github.com/paritytech/polkadot-sdk/issues/3673#issuecomment-2009863496
I think because of this, we are going to truncate out the good addresses. https://github.com/paritytech/polkadot-sdk/blob/a6713c55fd5082d333518c3ca13f2a4294726fcc/substrate/client/authority-discovery/src/worker.rs#L571-L572
Looking at block times for rococo-asset-hub and others we can see that there are moment when they decide not to advertise their collection to validators, the logs point to that:
On rococo the validators groups size is 1 so there isn't a lot of redundancy, however that seems to happen way too often to be ignored, additionally the asset-hub has async backing enabled, however it does not seem to be related because other parachains seem to be affected as well.
Looking at the logs it seems like sometimes the collators can't connect to their assigned validators.
Next steps