Improve substarte-relay connection problems handling

bkontur commented 4 months ago

Investigate/check

[ ] Do we restart loops correctly in all kind of connection errors? E.g. RestartNeeded does it stop loop or restart? Or the only solution is to restart substrate-relay?

Possible improvement 1:

Now we are connected to the one exact node uri, e.g.:

  --source-uri wss://rococo-rpc.polkadot.io \
  --target-uri wss://bridge-hub-westend-rpc.dwellir.com \

If the node is down, or has some problem, we could configure list of uris, so when RestartNeeded, we rotate and try another uri, e.g.:

  --source-uri wss://rococo-rpc.polkadot.io 
  --source-uri wss://rococo-xyz1-rpc.polkadot.io 
  --source-uri wss://rococo-xyz2-rpc.polkadot.io 
  --target-uri wss://bridge-hub-westend-rpc.dwellir.com 
  --target-uri wss://bridge-hub-westend-xyz2-rpc.dwellir.com 
  --target-uri wss://bridge-hub-westend-xyz2-rpc.luckyfriday.com

So, if one node is overloaded, we just try another one.

Possible improvement 2 - connect substrate-relay to some "load balancer"

This "load balancer" would do routing to the live and not overloaded node, instead of handling this in our code.

Some logs from 2024-07-12/15

https://matrix.to/#/!FqmgUhjOliBGoncGwm:parity.io/$OjKXcX4aO9lkzM46fRLKXTMi-mf9vcpdJN_RDMgIn6o?via=parity.io

e.g.:

Polkadot client has failed to return its sync status: FailedToGetSystemHealth { chain: "Polkadot", error: RpcError(RestartNeeded(Transport(connection closed

2024-07-15T08:05:06Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-15 08:05:06 +00 WARN bridge Failed to read best Polkadot block: ChannelError("Background task of BridgeHubKusama client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for BridgeHubKusama has finished\"))")
2024-07-15T03:17:36Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-15 03:17:36 +00 WARN bridge Failed to read head of Polkadot parachain ParaId(1002) at BridgeHubKusama: FailedToReadStorageValue { chain: "BridgeHubKusama", hash: "0x181d…2a58", key: StorageKey([243, 240, 56, 234, 7, 239, 168, 105, 144, 9, 71, 27, 60, 48, 159, 184, 100, 28, 243, 91, 238, 116, 177, 147, 83, 37, 172, 214, 89, 235, 25, 203, 127, 32, 114, 84, 61, 57, 196, 82, 229, 51, 84, 40, 99, 135, 86, 81, 234, 3, 0, 0]), error: RpcError(RestartNeeded(Transport(connection closed
2024-07-15T00:47:50Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-15 00:47:50 +00 WARN bridge Polkadot client has failed to return its sync status: FailedToGetSystemHealth { chain: "Polkadot", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-14T23:18:10Z {} 2024-07-14 23:18:10 +00 ERROR bridge [BridgeHubKusama-to-BridgeHubPolkadot-on-demand-parachain] Failed to read relay data from BridgeHubPolkadot client: ChannelError("Background task of BridgeHubPolkadot client has exited with result: Err(ChannelError(\"Finalized headers subscription for BridgeHubPolkadot has finished\"))")
2024-07-14T23:04:57Z {} [Polkadot_to_BridgeHubKusama_Sync] 2024-07-14 23:04:57 +00 INFO bridge Call of PolkadotFinalityApi_free_headers_interval at BridgeHubKusama has failed with an error: FailedStateCall { chain: "BridgeHubKusama", hash: "0x8551…5ec9", method: "PolkadotFinalityApi_free_headers_interval", arguments: Bytes([]), error: RpcError(Call(ErrorObject { code: ServerError(4003), message: "Client error: Execution failed: Other: Exported method PolkadotFinalityApi_free_headers_interval is not found", data: None })) }. Treating as `None`
2024-07-14T23:04:57Z {} [Polkadot_to_BridgeHubKusama_Sync] 2024-07-14 23:04:57 +00 ERROR bridge Finality sync loop iteration has failed with error: Target(FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-14T23:04:57Z {} 2024-07-14 23:04:57 +00 ERROR bridge [Polkadot-to-BridgeHubKusama-on-demand-headers] Failed to read best finalized source header from target: FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-14T20:27:45Z {} 2024-07-14 20:27:45 +00 WARN bridge [Polkadot-to-BridgeHubKusama-on-demand-headers] Failed to scan mandatory Polkadot headers range ((21644741, 21647633)): FailedToReadHeaderHashByNumber { chain: "Polkadot", number: "21647633", error: RpcError(RestartNeeded(Transport(i/o error: Connection reset by peer (os error 104)
2024-07-14T00:35:31Z {} [Kusama_to_BridgeHubPolkadot_Parachains_1002] 2024-07-14 00:35:31 +00 WARN bridge Kusama client has failed to return its sync status: FailedToGetSystemHealth { chain: "Kusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T22:50:53Z {} [BridgeHubPolkadot_to_BridgeHubKusama_MessageLane_00000001] 2024-07-12 22:50:53 +00 ERROR bridge Error retrieving state from BridgeHubKusama node: FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T22:50:53Z {} [BridgeHubKusama_to_BridgeHubPolkadot_MessageLane_00000001] 2024-07-12 22:50:53 +00 ERROR bridge Error retrieving state from BridgeHubPolkadot node: FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T22:42:56Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-12 22:42:56 +00 WARN bridge Polkadot client has failed to return its sync status: FailedToGetSystemHealth { chain: "Polkadot", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T21:03:49Z {} [Kusama_to_BridgeHubPolkadot_Parachains_1002] 2024-07-12 21:03:49 +00 WARN bridge Kusama client has failed to return its sync status: FailedToGetSystemHealth { chain: "Kusama", error: RpcError(RestartNeeded(Transport(connection closed
2024-07-12T20:37:38Z {} [Kusama_to_BridgeHubPolkadot_Parachains_1002] 2024-07-12 20:37:38 +00 WARN bridge Failed to read best Kusama block: ChannelError("Background task of BridgeHubPolkadot client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for BridgeHubPolkadot has finished\"))")
2024-07-12T20:13:04Z {} [Polkadot_to_BridgeHubKusama_Parachains_1002] 2024-07-12 20:13:04 +00 WARN bridge Failed to read best Polkadot block: ChannelError("Background task of BridgeHubKusama client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for BridgeHubKusama has finished\"))")
2024-07-12T19:58:39Z {} 2024-07-12 19:58:39 +00 ERROR bridge [Polkadot-to-BridgeHubKusama-on-demand-headers] Failed to read best finalized source header from source: ChannelError("Background task of Polkadot client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for Polkadot has finished\"))")

bkontur commented 4 months ago

yes we do caching, but, also I would like to check RPC/runtime calls (and subscribtions) monitoring to see what and how often we do RPC/runtime calls, if there is any space for optimization.

Also maybe, the separate 6-relayer setup could help by itself

bkontur commented 4 months ago

again, this errors stop relaying finality:

[BridgeHubPolkadot_to_BridgeHubKusama_MessageLane_00000001] 2024-07-23 14:47:28 +00 ERROR bridge Error retrieving state from BridgeHubKusama node: FailedToGetSystemHealth { chain: "BridgeHubPolkadot", error: RpcError(RestartNeeded(Transport(connection closed
[Kusama_to_BridgeHubPolkadot_Sync] 2024-07-23 14:47:27 +00 ERROR bridge Finality sync loop iteration has failed with error: Source(ChannelError("Background task of Kusama client has exited with result: Err(ChannelError(\"Mandatory best headers subscription for Kusama has finished\"))"))
[Polkadot_to_BridgeHubKusama_Sync] 2024-07-23 14:47:23 +00 ERROR bridge Finality sync loop iteration has failed with error: Target(FailedToGetSystemHealth { chain: "BridgeHubKusama", error: RpcError(RestartNeeded(Transport(connection closed

paritytech / parity-bridges-common