Open s0me0ne-unkn0wn opened 10 months ago
20 validators without parachains are running for ~20 hours producing ~9000 unfinalized blocks. Memory usage growth is roughly linear and homogenous among all the validators (I'm not averaging the graphs because we have some outliers that would affect the average significantly). Taking one of the validators as a sample, we have an increase of allocated memory from 650 Mb to 1,27 Gb, roughly twice, in 20 hours.
Metrics used: polkadot_memory_allocated
and polkadot_memory_resident
, respectively.
Of out curiosity~ in the Memory Allocated
what causes the sudden drops in memory used? (low priority question)
After this test, Versi validators were reset to the master branch. The finality lag was ~9000 blocks; it started to catch up quickly but stalled again at ~1000 unfinalized blocks, and the lag started to increase. Also, while the vast majority of validators were reporting the same finalized block height, some of them stalled with lower heights and didn't progress (see graph). We decided to reset the network for now, but that behavior should be further investigated later.
We deployed a modified version of latest release on a Kusama validator we use to burn-in changes. The image we used is built by CI in https://github.com/paritytech/polkadot-sdk/pull/2649 .
Surprisingly the node didn't OOM as expect, instead it stopped allocating memory and went out of sync due to I/O wait times after starting to hammer rocksDB and use 100% of disk bandwidth for reading. Quoting @BulatSaif findings from a different ticket:
But 100% cpu usage was caused by iowait. Possible scenario: application run out of memory, which caused removing all free memory which was used as disk cache for in rocksdb, so basically node lost connection to rocksdb since all request will timeout. Interestingly, that application did not allocate more memory, if it did it would be just restarted.
CPU usage looked like this, most likely due to a greatly reduced page cache (around 600MB)
First warnings appear: https://grafana.teleport.parity.io/goto/fOxsnBDSR?orgId=1
11:45:48.315 WARN tokio-runtime-worker parachain::chain-selection: Missing block weight for new head. Skipping chain. hash=0xa930f8d2195e36d8d71f6e5bd2117e72c7972beef5d792817edb5586fd5b624c
12:15:48.605 WARN tokio-runtime-worker parachain::dispute-distribution: error=Sender(Runtime(RuntimeRequest(NotSupported { runtime_api_name: "session_index_for_child" }))) ctx="on FromOrchestra"
12:15:48.605 WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: Runtime code error: `:code` hash not found
14:28:06.000 WARN tokio-runtime-worker babe: Unable to fetch auxiliary data for block 0xf6e46eceb6dd83d40529016fc0877227c7203794e5180a4b60f29f4d76d8a97b: Current state of blockchain has invalid authorities set
Logs: https://grafana.teleport.parity.io/goto/O5174fvSg?orgId=1
16:59:06.350 ERROR tokio-runtime-worker runtime::system: Logic error: Unexpected underflow in reducing consumer
Logs: https://grafana.teleport.parity.io/goto/7Csv4BvIg?orgId=1
WARN tokio-runtime-worker db::pin: Pinned block cache limit reached. Evicting value. hash = 0x89f3…0328
We only pin one block per session, so should be limited to 6 pinned blocks: https://github.com/paritytech/polkadot-sdk/blob/bc82eb6ec9888e0598df9ecdc2d17bcc1f3cd8d4/polkadot/node/subsystem-util/src/runtime/mod.rs#L132.
Logs: https://grafana.teleport.parity.io/goto/_uTP4fvIR?orgId=1
03:52:16.141 WARN tokio-runtime-worker trie-cache: Timeout while trying to acquire a write lock for the shared trie cache
Logs: https://grafana.teleport.parity.io/goto/OT2zSBvSR?orgId=1
19:40:14.627 WARN tokio-runtime-worker libp2p_kad::behaviour: Locating closest peers for replication failed: Err(Timeout { key: Key(b"\x9a\x9b\x99G>\x13\xea\x16\x07\xd0\x0e9\x91\x91\xce\xa4G\xe6&6\x866R\xec\xda\0\xbf\xb8\xa2\xa2;%"), success: [], quorum: 20 })
https://grafana.teleport.parity.io/goto/DT1MIfDIg?orgId=1
Things look normal until 19:00, when there is a huge gap of missing metric data points. This is the exact point where we reach maximum mem usage for the node: https://grafana.teleport.parity.io/goto/uQ0IIBDSg?orgId=1
After the gap there are huge load spikes (1000% CPU usage) from basic-block-import-worker
, authority-discovery
and babe
tasks.
This is explained by the latency introduced by DB reads not being serviced from cache.
In #2510, block authoring backoff has been disabled for production networks. This is the tracking issue for tests where we try to determine how this change affects validator memory usage at different network scales and scenarios.
Validators are deployed from #2574 image with the following changes to the usual logic: 1) Backoff is disabled for all networks; 2) Dispute coordinator always fails to determine the undisputed chain, so finality never catches up.