Failure in relay chain rpc interface's connection attempt to external RPC servers

iulianbarbu commented 2 months ago

I am trying to start zombienet natively based on https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml. The collators always fail on my machine with the following error:

2024-08-22 18:51:42.758  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. index=0 url="ws://127.0.0.1:43283/"
2024-08-22 18:51:42.764  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. index=1 url="ws://127.0.0.1:37583/"
2024-08-22 18:51:42.776  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. index=2 url="ws://127.0.0.1:42159/"
2024-08-22 18:51:42.776 ERROR tokio-runtime-worker reconnecting-websocket-client: [Parachain] No valid RPC url found. Stopping RPC worker.
2024-08-22 18:51:42.777 ERROR tokio-runtime-worker sc_service::task_manager: [Parachain] Essential task `relay-chain-rpc-worker` failed. Shutting down service.    
thread 'main' panicked at cumulus/test/service/src/main.rs:144:18:
could not create Cumulus test service: Application(WorkerCommunicationError("RPC worker channel closed. This can hint and connectivity issues with the supplied RPC endpoints. Message: oneshot canceled"))
stack backtrace:
2024-08-22 18:51:42.774  INFO tokio-runtime-worker prometheus: [Relaychain] 〽️ Prometheus exporter started at 127.0.0.1:9616    
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::expect
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1059:23
   4: test_parachain::main
             at ./polkadot-sdk/cumulus/test/service/src/main.rs:103:44
   5: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

There are ways to fix this in zombienet through a better coordination when starting the nodes, but I feel it can lead to overengineering e.g. it asks for implementation of starting nodes based on nodes dependencies, which is something other providers have support for it in various ways (k8s/podman), but I personally still don't find it straightforward.

The alternative is to add retry logic in cumulus/client/relay-chain-rpc-interface logic, that is sufficiently reliable for most cases, and if it becomes relevant for cases other than zombienet related testing, we could expose it in node configuration for users to set it as they need.

skunert commented 2 months ago

The collators always fail on my machine with the following error:

That this test always fails is a bit strange, meaning that none of the three relay chain nodes given as connection came up. Did you check the relay chain nodes logs? Maybe it faces some issues and that is why the collators can not connect? Strange that it fails consistently for you, I ran this on my local machine just fine (and its also running in CI).

The alternative is to add retry logic in cumulus/client/relay-chain-rpc-interface logic

For this there is this issue: https://github.com/paritytech/polkadot-sdk/issues/4278 Would for sure increase the usability if the nodes would not crash in that scenario.

iulianbarbu commented 2 months ago

Did you check the relay chain nodes logs?

Both validators and RPC nodes look fine (no errors/warns), it is just that the RPC nodes come online in terms of the RPC API a bit later than when both collators attempt to connect to them. E.g. for the error I attached in the issue description, both collators attempt connections at similar times roughly (~3-4 seconds apart), the RPC nodes API came online ~20-30 seconds later.

Maybe it faces some issues and that is why the collators can not connect? Strange that it fails consistently for you, I ran this on my local machine just fine (and its also running in CI).

Given there are no errors/warns in the logs, do you suggest we should suspect my reproductions being deeper issues that degrade the start path for the RPC nodes and are visible only on my machine? To me it feels just like a slowness on the start path of the RPC nodes (reproducible for some reason on my machine, maybe because it is slower than others), but nothing serious (based on the logs for the RPC nodes and validators).

For this there is this issue: https://github.com/paritytech/polkadot-sdk/issues/4278

I see. Looks like a retry logic is relevant also for validators/RPC nodes coming offline. The RPC WS client tries reconnecting to the next external RPC server currently, but it will iterate through the entire external RPC servers list once (at worst) and then it will stop if all connections fail. The current issue is more related to the collator start up path, which ideally would be more insistent at the start through a basic retry logic, but collators can benefit from the same retry logic for RPC nodes coming offline (making the reconnections slightly better).

One curiosity I have do you have a rough idea how frequently all external RPC servers set for collators come offline for users in general? Asking just to make an idea how nuanced is this issue, and whether a configurable retry logic is a better fit as opposed to my initial suggestion of making the retry logic limited and hardcoded (which I thought is a good fit for the case where it is more relevant to testing scenarios rather than prod).

paritytech / polkadot-sdk

Failure in relay chain rpc interface's connection attempt to external RPC servers #5514