Determine impact of missing back-off

s0me0ne-unkn0wn commented 10 months ago

In #2510, block authoring backoff has been disabled for production networks. This is the tracking issue for tests where we try to determine how this change affects validator memory usage at different network scales and scenarios.

Validators are deployed from #2574 image with the following changes to the usual logic: 1) Backoff is disabled for all networks; 2) Dispute coordinator always fails to determine the undisputed chain, so finality never catches up.

s0me0ne-unkn0wn commented 10 months ago

20 validators without parachains are running for ~20 hours producing ~9000 unfinalized blocks. Memory usage growth is roughly linear and homogenous among all the validators (I'm not averaging the graphs because we have some outliers that would affect the average significantly). Taking one of the validators as a sample, we have an increase of allocated memory from 650 Mb to 1,27 Gb, roughly twice, in 20 hours.

Metrics used: polkadot_memory_allocated and polkadot_memory_resident, respectively.

immagine

Overkillus commented 10 months ago

Of out curiosity~ in the Memory Allocated what causes the sudden drops in memory used? (low priority question)

s0me0ne-unkn0wn commented 10 months ago

After this test, Versi validators were reset to the master branch. The finality lag was ~9000 blocks; it started to catch up quickly but stalled again at ~1000 unfinalized blocks, and the lag started to increase. Also, while the vast majority of validators were reporting the same finalized block height, some of them stalled with lower heights and didn't progress (see graph). We decided to reset the network for now, but that behavior should be further investigated later.

immagine

sandreim commented 9 months ago

We deployed a modified version of latest release on a Kusama validator we use to burn-in changes. The image we used is built by CI in https://github.com/paritytech/polkadot-sdk/pull/2649 .

Surprisingly the node didn't OOM as expect, instead it stopped allocating memory and went out of sync due to I/O wait times after starting to hammer rocksDB and use 100% of disk bandwidth for reading. Quoting @BulatSaif findings from a different ticket:

But 100% cpu usage was caused by iowait. Possible scenario: application run out of memory, which caused removing all free memory which was used as disk cache for in rocksdb, so basically node lost connection to rocksdb since all request will timeout. Interestingly, that application did not allocate more memory, if it did it would be just restarted.

CPU usage looked like this, most likely due to a greatly reduced page cache (around 600MB)

sandreim commented 9 months ago

First warnings appear: https://grafana.teleport.parity.io/goto/fOxsnBDSR?orgId=1

  11:45:48.315 WARN tokio-runtime-worker parachain::chain-selection: Missing block weight for new head. Skipping chain. hash=0xa930f8d2195e36d8d71f6e5bd2117e72c7972beef5d792817edb5586fd5b624c

  12:15:48.605 WARN tokio-runtime-worker parachain::dispute-distribution: error=Sender(Runtime(RuntimeRequest(NotSupported { runtime_api_name: "session_index_for_child" }))) ctx="on FromOrchestra"
  12:15:48.605 WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: Runtime code error: `:code` hash not found

 14:28:06.000  WARN tokio-runtime-worker babe: Unable to fetch auxiliary data for block 0xf6e46eceb6dd83d40529016fc0877227c7203794e5180a4b60f29f4d76d8a97b: Current state of blockchain has invalid authorities set

Runtime errors

Logs: https://grafana.teleport.parity.io/goto/O5174fvSg?orgId=1

16:59:06.350 ERROR tokio-runtime-worker runtime::system: Logic error: Unexpected underflow in reducing consumer

Block pinning fills the cache

Logs: https://grafana.teleport.parity.io/goto/7Csv4BvIg?orgId=1

WARN tokio-runtime-worker db::pin: Pinned block cache limit reached. Evicting value. hash = 0x89f3…0328

We only pin one block per session, so should be limited to 6 pinned blocks: https://github.com/paritytech/polkadot-sdk/blob/bc82eb6ec9888e0598df9ecdc2d17bcc1f3cd8d4/polkadot/node/subsystem-util/src/runtime/mod.rs#L132.

Trie cache warnings

Logs: https://grafana.teleport.parity.io/goto/_uTP4fvIR?orgId=1

03:52:16.141 WARN tokio-runtime-worker trie-cache: Timeout while trying to acquire a write lock for the shared trie cache

Libp2p kademlia errors

Logs: https://grafana.teleport.parity.io/goto/OT2zSBvSR?orgId=1

19:40:14.627 WARN tokio-runtime-worker libp2p_kad::behaviour: Locating closest peers for replication failed: Err(Timeout { key: Key(b"\x9a\x9b\x99G>\x13\xea\x16\x07\xd0\x0e9\x91\x91\xce\xa4G\xe6&6\x866R\xec\xda\0\xbf\xb8\xa2\xa2;%"), success: [], quorum: 20 })

subsystem task usage

https://grafana.teleport.parity.io/goto/DT1MIfDIg?orgId=1

Things look normal until 19:00, when there is a huge gap of missing metric data points. This is the exact point where we reach maximum mem usage for the node: https://grafana.teleport.parity.io/goto/uQ0IIBDSg?orgId=1 After the gap there are huge load spikes (1000% CPU usage) from basic-block-import-worker, authority-discovery and babe tasks.

This is explained by the latency introduced by DB reads not being serviced from cache.

IO wait

Next step:

root cause the DB access issue, is it ParachainDB or substrate db ?

paritytech / polkadot-sdk