paritytech / zombienet

A cli tool to easily spawn ephemeral Polkadot/Substrate networks and perform tests against them.
https://paritytech.github.io/zombienet/
GNU General Public License v3.0
161 stars 91 forks source link

Spinning "Error fetching metrics" #1730

Closed Overkillus closed 8 months ago

Overkillus commented 8 months ago

Issue Description

When simply spawning networks it stalls at Error fetching metrics from: http://127.0.0.1:45667/metrics especially for bigger networks. Sometimes it never completes and aborts network creation.

It might be a natural consequence of starting a big network which just takes a while to boot up but then the error is not particularly telling of what's actually happening.

Steps to reproduce the issue

Steps to reproduce the issue

  1. zombienet spawn <> -p native

Sometimes after printing errors for 30s+ it goes through the finishes up booting the network. Generally for bigger networks it seems to stall for longer often going over the limit and aborting.

Config used: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/zombienet_tests/functional/0002-parachains-disputes.toml

It is not unique to the above one and I have pretty much the same effect on all other network definitions of that size.

Describe the results you received

Below is the standard output of a spawn but at one point it simply stalls at: Error fetching metrics from: http://127.0.0.1:xxxxx/metrics repeating it hundreds of times.

Full output: https://pastebin.com/HageqCnD

Describe the results you expected

Network spinning up or at least understanding what is it getting stuck at.

Zombienet version

1.3.92

Provider

Native

Upstream Latest Release

Yes

Additional environment details

Ubuntu, Thinkpad X1

Additional information

Alice log:

2024-02-23 23:37:24.819  INFO main sc_cli::runner: Parity Polkadot    
2024-02-23 23:37:24.819  INFO main sc_cli::runner: ✌️  version 1.7.0-cb1d2106fdf    
2024-02-23 23:37:24.819  INFO main sc_cli::runner: ❤️  by Parity Technologies <admin@parity.io>, 2017-2024    
2024-02-23 23:37:24.819  INFO main sc_cli::runner: 📋 Chain specification: Rococo Local Testnet    
2024-02-23 23:37:24.819  INFO main sc_cli::runner: 🏷  Node name: alice    
2024-02-23 23:37:24.819  INFO main sc_cli::runner: 👤 Role: AUTHORITY    
2024-02-23 23:37:24.819  INFO main sc_cli::runner: 💾 Database: RocksDb at /tmp/zombie-bebc765b2df92e6af53435ad7ab9fa35_-71219-O2ImRQRAiuHs/alice/data/chains/rococo_local_testnet/db/full    
2024-02-23 23:37:28.136  INFO main sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0xc6c6…d88a, header-hash: 0x4145…755f)    
2024-02-23 23:37:28.144 DEBUG main parachain::chain-selection: Using dispute aware relay-chain selection algorithm
2024-02-23 23:37:28.144  INFO main grandpa: 👴 Loading GRANDPA authority set from genesis on what appears to be first startup.    
2024-02-23 23:37:29.347  INFO main babe: 👶 Creating empty BABE epoch changes on what appears to be first startup.    
2024-02-23 23:37:29.646  INFO main polkadot_service: 🚀 Using prepare-worker binary at: "/home/overkillus/Parity/polkadot-sdk/target/testnet/polkadot-prepare-worker"    
2024-02-23 23:37:29.646  INFO main polkadot_service: 🚀 Using execute-worker binary at: "/home/overkillus/Parity/polkadot-sdk/target/testnet/polkadot-execute-worker"    
2024-02-23 23:37:29.646  INFO main sub-libp2p: 🏷  Local node identity is: 12D3KooWQCkBm1BYtkHpocxCwMgR8yjitEeHGx8spzcDLGt2gkBm    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 Operating system: linux    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 CPU architecture: x86_64    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 Target environment: gnu    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 CPU: 12th Gen Intel(R) Core(TM) i7-1260P    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 CPU cores: 12    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 Memory: 31792MB    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 Kernel: 6.5.0-21-generic    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 Linux distribution: Ubuntu 22.04.3 LTS    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 💻 Virtual machine: no    
2024-02-23 23:37:29.664  INFO main sc_service::builder: 📦 Highest known block at #0    
2024-02-23 23:37:29.664  INFO main sc_rpc_server: Running JSON-RPC server: addr=0.0.0.0:35933, allowed origins=["*"]    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 🏁 CPU score: 1.17 GiBs    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 🏁 Memory score: 16.94 GiBs    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 🏁 Disk score (seq. writes): 710.87 MiBs    
2024-02-23 23:37:29.664  INFO main sc_sysinfo: 🏁 Disk score (rand. writes): 339.95 MiBs    
2024-02-23 23:37:29.664  WARN main polkadot_service: ⚠️  The hardware does not meet the minimal requirements Failed checks: Seq Write(expected: 950.00 MiBs, found: 710.87 MiBs), Rnd Write(expected: 420.00 MiBs, found: 339.95 MiBs),  for role 'Authority' find out more at:
https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware    
2024-02-23 23:37:29.664  INFO tokio-runtime-worker substrate_prometheus_endpoint: 〽️ Prometheus exporter started at 0.0.0.0:34997    
2024-02-23 23:37:29.665 DEBUG tokio-runtime-worker parachain::pvf: starting PVF validation host config=Config { cache_path: "/tmp/zombie-bebc765b2df92e6af53435ad7ab9fa35_-71219-O2ImRQRAiuHs/alice/data/chains/rococo_local_testnet/db/full/pvf-artifacts", node_version: Some("1.7.0"), secure_validator_mode: true, prepare_worker_program_path: "/home/overkillus/Parity/polkadot-sdk/target/testnet/polkadot-prepare-worker", prepare_worker_spawn_timeout: 3s, prepare_workers_soft_max_num: 1, prepare_workers_hard_max_num: 1, execute_worker_program_path: "/home/overkillus/Parity/polkadot-sdk/target/testnet/polkadot-execute-worker", execute_worker_spawn_timeout: 3s, execute_workers_max_num: 2 }
2024-02-23 23:37:29.666  INFO                 main babe: 👶 Starting BABE Authorship worker    
2024-02-23 23:37:29.666  INFO tokio-runtime-worker parachain::approval-voting: Starting with an empty approval vote DB.
2024-02-23 23:37:29.666  INFO tokio-runtime-worker beefy: 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2024-02-23 23:37:29.669  INFO tokio-runtime-worker parachain::pvf: 👮‍♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines. 
More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
2024-02-23 23:37:34.667 DEBUG tokio-runtime-worker parachain::chain-selection: Prepared 0 stagnant entries for pruning up_to=1708641454 min_ts=0 max_ts=0
2024-02-23 23:37:34.667  INFO tokio-runtime-worker substrate: 💤 Idle (0 peers), best: #0 (0x4145…755f), finalized #0 (0x4145…755f), ⬇ 0 ⬆ 0    
2024-02-23 23:37:39.667 DEBUG tokio-runtime-worker parachain::chain-selection: Prepared 0 stagnant entries for pruning up_to=1708641459 min_ts=0 max_ts=0
2024-02-23 23:37:39.667  INFO tokio-runtime-worker substrate: 💤 Idle (0 peers), best: #0 (0x4145…755f), finalized #0 (0x4145…755f), ⬇ 0 ⬆ 0    
2024-02-23 23:37:44.667 DEBUG tokio-runtime-worker parachain::chain-selection: Prepared 0 stagnant entries for pruning up_to=1708641464 min_ts=0 max_ts=0
2024-02-23 23:37:44.668  INFO tokio-runtime-worker substrate: 💤 Idle (0 peers), best: #0 (0x4145…755f), finalized #0 (0x4145…755f), ⬇ 0 ⬆ 0    
2024-02-23 23:37:46.039 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Validation version=3 peer=PeerId("12D3KooWHmTccx3iQVudoetE7K4ANiproMfppKrXgYd2Gfd1UcTQ") role=Authority
2024-02-23 23:37:46.474 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Validation version=3 peer=PeerId("12D3KooWPKzmmE2uYgF3z13xjpbFTp63g9dZFag8pG6MgnpSLF4S") role=Authority
2024-02-23 23:37:47.347 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Validation version=3 peer=PeerId("12D3KooWKM1HeSv61ryZwAiBTZnqmass5pYM48k9Z7obzhTbnphm") role=Authority
2024-02-23 23:37:48.951 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Validation version=3 peer=PeerId("12D3KooWRkZhiRhsqmrQ28rt73K7V3aCBpqKrLGSXmZ99PTcTZby") role=Authority
2024-02-23 23:37:49.667 DEBUG tokio-runtime-worker parachain::chain-selection: Prepared 0 stagnant entries for pruning up_to=1708641469 min_ts=0 max_ts=0
2024-02-23 23:37:49.668  INFO tokio-runtime-worker substrate: 💤 Idle (4 peers), best: #0 (0x4145…755f), finalized #0 (0x4145…755f), ⬇ 6.4kiB/s ⬆ 6.9kiB/s    

Bob log:

**2024-02-23 23:37:30.319  INFO main sc_cli::runner: Parity Polkadot    
2024-02-23 23:37:30.319  INFO main sc_cli::runner: ✌️  version 1.7.0-cb1d2106fdf    
2024-02-23 23:37:30.319  INFO main sc_cli::runner: ❤️  by Parity Technologies <admin@parity.io>, 2017-2024    
2024-02-23 23:37:30.319  INFO main sc_cli::runner: 📋 Chain specification: Rococo Local Testnet    
2024-02-23 23:37:30.319  INFO main sc_cli::runner: 🏷  Node name: bob    
2024-02-23 23:37:30.319  INFO main sc_cli::runner: 👤 Role: AUTHORITY    
2024-02-23 23:37:30.319  INFO main sc_cli::runner: 💾 Database: RocksDb at /tmp/zombie-bebc765b2df92e6af53435ad7ab9fa35_-71219-O2ImRQRAiuHs/bob/data/chains/rococo_local_testnet/db/full    
2024-02-23 23:37:37.487  INFO main sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0xc6c6…d88a, header-hash: 0x4145…755f)    
2024-02-23 23:37:37.494 DEBUG main parachain::chain-selection: Using dispute aware relay-chain selection algorithm
2024-02-23 23:37:37.494  INFO main grandpa: 👴 Loading GRANDPA authority set from genesis on what appears to be first startup.    
2024-02-23 23:37:48.495  INFO main babe: 👶 Creating empty BABE epoch changes on what appears to be first startup.    
2024-02-23 23:37:48.883  INFO main polkadot_service: 🚀 Using prepare-worker binary at: "/home/overkillus/Parity/polkadot-sdk/target/testnet/polkadot-prepare-worker"    
2024-02-23 23:37:48.883  INFO main polkadot_service: 🚀 Using execute-worker binary at: "/home/overkillus/Parity/polkadot-sdk/target/testnet/polkadot-execute-worker"    
2024-02-23 23:37:48.884  INFO main sub-libp2p: 🏷  Local node identity is: 12D3KooWRkZhiRhsqmrQ28rt73K7V3aCBpqKrLGSXmZ99PTcTZby    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 Operating system: linux    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 CPU architecture: x86_64    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 Target environment: gnu    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 CPU: 12th Gen Intel(R) Core(TM) i7-1260P    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 CPU cores: 12    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 Memory: 31792MB    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 Kernel: 6.5.0-21-generic    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 Linux distribution: Ubuntu 22.04.3 LTS    
2024-02-23 23:37:48.909  INFO main sc_sysinfo: 💻 Virtual machine: no    
2024-02-23 23:37:48.909  INFO main sc_service::builder: 📦 Highest known block at #0    
2024-02-23 23:37:48.910  INFO tokio-runtime-worker substrate_prometheus_endpoint: 〽️ Prometheus exporter started at 0.0.0.0:43505    
2024-02-23 23:37:48.926  INFO                 main sc_rpc_server: Running JSON-RPC server: addr=0.0.0.0:45453, allowed origins=["*"]    
2024-02-23 23:37:48.926  INFO                 main sc_sysinfo: 🏁 CPU score: 1.02 GiBs    
2024-02-23 23:37:48.926  INFO                 main sc_sysinfo: 🏁 Memory score: 6.85 GiBs    
2024-02-23 23:37:48.926  INFO                 main sc_sysinfo: 🏁 Disk score (seq. writes): 519.48 MiBs    
2024-02-23 23:37:48.926  INFO                 main sc_sysinfo: 🏁 Disk score (rand. writes): 118.58 MiBs    
2024-02-23 23:37:48.926  WARN                 main polkadot_service: ⚠️  The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 6.85 GiBs), Seq Write(expected: 950.00 MiBs, found: 519.48 MiBs), Rnd Write(expected: 420.00 MiBs, found: 118.58 MiBs),  for role 'Authority' find out more at:
https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware    
2024-02-23 23:37:48.927  INFO                 main babe: 👶 Starting BABE Authorship worker    
2024-02-23 23:37:48.933 DEBUG tokio-runtime-worker parachain::pvf: starting PVF validation host config=Config { cache_path: "/tmp/zombie-bebc765b2df92e6af53435ad7ab9fa35_-71219-O2ImRQRAiuHs/bob/data/chains/rococo_local_testnet/db/full/pvf-artifacts", node_version: Some("1.7.0"), secure_validator_mode: true, prepare_worker_program_path: "/home/overkillus/Parity/polkadot-sdk/target/testnet/polkadot-prepare-worker", prepare_worker_spawn_timeout: 3s, prepare_workers_soft_max_num: 1, prepare_workers_hard_max_num: 1, execute_worker_program_path: "/home/overkillus/Parity/polkadot-sdk/target/testnet/polkadot-execute-worker", execute_worker_spawn_timeout: 3s, execute_workers_max_num: 2 }
2024-02-23 23:37:48.954  INFO tokio-runtime-worker parachain::approval-voting: Starting with an empty approval vote DB.
2024-02-23 23:37:48.965  INFO tokio-runtime-worker beefy: 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2024-02-23 23:37:48.977 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Validation version=3 peer=PeerId("12D3KooWQCkBm1BYtkHpocxCwMgR8yjitEeHGx8spzcDLGt2gkBm") role=Authority
2024-02-23 23:37:49.086  INFO tokio-runtime-worker parachain::pvf: 👮‍♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines. 
More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
2024-02-23 23:37:49.890 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Validation version=3 peer=PeerId("12D3KooWHmTccx3iQVudoetE7K4ANiproMfppKrXgYd2Gfd1UcTQ") role=Authority
2024-02-23 23:37:49.892 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Validation version=3 peer=PeerId("12D3KooWKM1HeSv61ryZwAiBTZnqmass5pYM48k9Z7obzhTbnphm") role=Authority
2024-02-23 23:37:49.892 DEBUG tokio-runtime-worker parachain::network-bridge-rx: action="PeerConnected" peer_set=Validation version=3 peer=PeerId("12D3KooWPKzmmE2uYgF3z13xjpbFTp63g9dZFag8pG6MgnpSLF4S") role=Authority
**

Eve log:

2024-02-23 23:37:33.616  INFO main sc_cli::runner: Parity Polkadot    
2024-02-23 23:37:33.617  INFO main sc_cli::runner: ✌️  version 1.7.0-cb1d2106fdf    
2024-02-23 23:37:33.617  INFO main sc_cli::runner: ❤️  by Parity Technologies <admin@parity.io>, 2017-2024    
2024-02-23 23:37:33.617  INFO main sc_cli::runner: 📋 Chain specification: Rococo Local Testnet    
2024-02-23 23:37:33.617  INFO main sc_cli::runner: 🏷  Node name: eve    
2024-02-23 23:37:33.617  INFO main sc_cli::runner: 👤 Role: AUTHORITY    
2024-02-23 23:37:33.617  INFO main sc_cli::runner: 💾 Database: RocksDb at /tmp/zombie-bebc765b2df92e6af53435ad7ab9fa35_-71219-O2ImRQRAiuHs/eve/data/chains/rococo_local_testnet/db/full    
2024-02-23 23:37:47.730  INFO main sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0xc6c6…d88a, header-hash: 0x4145…755f)    
2024-02-23 23:37:47.861 DEBUG main parachain::chain-selection: Using dispute aware relay-chain selection algorithm
2024-02-23 23:37:47.862  INFO main grandpa: 👴 Loading GRANDPA authority set from genesis on what appears to be first startup.    

Collator log:

2024-02-23 23:37:37.804  INFO main sc_cli::runner: Parity Zombienet/Undying    
2024-02-23 23:37:37.805  INFO main sc_cli::runner: ✌️  version 1.0.0    
2024-02-23 23:37:37.805  INFO main sc_cli::runner: ❤️  by Parity Technologies <admin@parity.io>, 2022-2024    
2024-02-23 23:37:37.805  INFO main sc_cli::runner: 📋 Chain specification: Rococo Local Testnet    
2024-02-23 23:37:37.805  INFO main sc_cli::runner: 🏷  Node name: collator    
2024-02-23 23:37:37.805  INFO main sc_cli::runner: 👤 Role: FULL    
2024-02-23 23:37:37.805  INFO main sc_cli::runner: 💾 Database: RocksDb at /tmp/zombie-bebc765b2df92e6af53435ad7ab9fa35_-71219-O2ImRQRAiuHs/collator/data/chains/rococo_local_testnet/db/full    
2024-02-23 23:37:37.805  INFO main test_parachain_undying_collator: PoV target size: 25000 bytes. Graveyard size: (159 x 159)    
2024-02-23 23:37:37.805  INFO main test_parachain_undying_collator: PVF time complexity: 1    
2024-02-23 23:37:48.154  INFO main sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0xc6c6…d88a, header-hash: 0x4145…755f)    
2024-02-23 23:37:48.167 DEBUG main parachain::chain-selection: Using dispute aware relay-chain selection algorithm
2024-02-23 23:37:48.167  INFO main grandpa: 👴 Loading GRANDPA authority set from genesis on what appears to be first startup.    

All other instances of logs are pretty much identical to one of those 4.

pepoviola commented 8 months ago

Hi @Overkillus, thanks for the feedback and reporting this issue. I added a new config key (in settings) that allow you to disable the verification of the nodes. e.g

[settings]
timeout = 1000
node_verifier = "None"

If is not set, the default behavior is the current one using prometheus server to query the startTime metric.

This is already available in the latest version v1.3.93.

Thx!