If in active set, Kusama validator regularly runs out of memory.
This happens both on current master (153ed85bcc3dda1f6e1fc2cb0efcca33e8e49aa1) and on the refactor/notifications (b5834724b3f18b1d694051cae1afff06d1ee7652, PR #2269) branches, although it happens less regularly on the former (~ every hour or two) than the latter (~ every 10 minutes), but it also accumulates far fewer era points on the former (~ dozens per era), than on the latter (~ hundreds per era), so this may be due to the latter resolving networking issues and thereby enabling the validator to engage in consensus more actively.
This happens on a host with 64 GB RAM. If this is insufficient for operating a Kagome validator, please close this issue.
An example log from current master (153ed85bcc3dda1f6e1fc2cb0efcca33e8e49aa1):
Nov 17 06:44:37 v1-oslo kagome[557803]: 24.11.17 06:44:37.445020 worker.4 Info BlockStorage Added block #25815332 (0x17f2…2868) as child of #25815331 (0x2280…3aec)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.301360 main_runner Info BlockExecutor Imported block #25815332 (0x17f2…2868) within 1159 ms. (lag 2301 ms.)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.301406 main_runner Info Timeline Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.305339 statement-distr Info StatementDistribution Send my view. (peers_count=737)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.578040 main_runner Info Timeline Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.616411 main_runner Warning SyncProtocolObserver cannot find a requested block with id 22092206
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.657594 main_runner Info Timeline Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:40 v1-oslo kagome[557803]: 24.11.17 06:44:40.682522 main_runner Warning SyncProtocolObserver cannot find a requested block with id 25398643
Nov 17 06:44:41 v1-oslo kagome[557803]: 24.11.17 06:44:41.300233 main_runner Warning SyncProtocolObserver cannot find a requested block with id 25398643
Nov 17 06:44:42 v1-oslo kagome[557803]: 24.11.17 06:44:42.575388 worker.23 Info BlockStorage Added block #25815333 (0xa039…b617) as child of #25815332 (0x17f2…2868)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.511382 main_runner Info BlockExecutor Imported block #25815333 (0xa039…b617) within 1196 ms. (lag 1511 ms.)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.530897 statement-distr Info StatementDistribution Send my view. (peers_count=753)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.801411 main_runner Info Timeline Caught up block #25815333 (0xa039…b617)
Nov 17 06:44:45 v1-oslo kagome[557803]: 24.11.17 06:44:45.851714 main_runner Warning SyncProtocolObserver cannot find a requested block with id 25398643
Nov 17 06:44:48 v1-oslo kagome[557803]: 24.11.17 06:44:48.891030 worker.4 Info BlockStorage Added block #25815334 (0xf69e…46cc) as child of #25815333 (0xa039…b617)
Nov 17 06:44:49 v1-oslo kagome[557803]: 24.11.17 06:44:49.808740 main_runner Info BlockExecutor Imported block #25815334 (0xf69e…46cc) within 1419 ms. (lag 1808 ms.)
Nov 17 06:44:49 v1-oslo kagome[557803]: 24.11.17 06:44:49.816217 statement-distr Info StatementDistribution Send my view. (peers_count=773)
Nov 17 06:44:50 v1-oslo kagome[557803]: 24.11.17 06:44:50.217742 main_runner Warning SyncProtocolObserver cannot find a requested block with id 22092206
Nov 17 06:44:50 v1-oslo kagome[557803]: 24.11.17 06:44:50.901453 main_runner Info Timeline Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:52 v1-oslo kagome[557803]: 24.11.17 06:44:52.180790 main_runner Warning SyncProtocolObserver cannot find a requested block with id 25398643
Nov 17 06:44:55 v1-oslo kagome[557803]: 24.11.17 06:44:55.830957 main_runner Info Timeline Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:55 v1-oslo kagome[557803]: 24.11.17 06:44:55.845407 main_runner Info Timeline Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:56 v1-oslo kagome[557803]: 24.11.17 06:44:56.081477 main_runner Info Timeline Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:56 v1-oslo kagome[557803]: 24.11.17 06:44:56.281898 worker.25 Info BlockStorage Added block #25815335 (0xe1ab…0d05) as child of #25815334 (0xf69e…46cc)
Nov 17 06:44:57 v1-oslo kagome[557803]: 24.11.17 06:44:57.291405 main_runner Info BlockExecutor Imported block #25815335 (0xe1ab…0d05) within 1314 ms. (lag 3291 ms.)
Nov 17 06:44:57 v1-oslo kagome[557803]: 24.11.17 06:44:57.308741 statement-distr Info StatementDistribution Send my view. (peers_count=880)
Nov 17 06:44:58 v1-oslo kagome[557803]: 24.11.17 06:44:58.313495 main_runner Warning SyncProtocolObserver cannot find a requested block with id 22092206
Nov 17 06:44:59 v1-oslo kagome[557803]: 24.11.17 06:44:59.728689 main_runner Warning SyncProtocolObserver cannot find a requested block with id 25398643
Nov 17 06:45:00 v1-oslo kagome[557803]: 24.11.17 06:45:00.631230 main_runner Warning SyncProtocolObserver cannot find a requested block with id 25398643
Nov 17 06:45:00 v1-oslo kagome[557803]: 24.11.17 06:45:00.756659 worker.14 Info BlockStorage Added block #25815336 (0x1c59…59ec) as child of #25815335 (0xe1ab…0d05)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.609773 main_runner Info BlockExecutor Imported block #25815336 (0x1c59…59ec) within 1320 ms. (lag 1609 ms.)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.617510 statement-distr Info StatementDistribution Send my view. (peers_count=814)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.853707 main_runner Info Timeline Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:02 v1-oslo kagome[557803]: 24.11.17 06:45:02.015478 main_runner Info Timeline Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:02 v1-oslo kagome[557803]: 24.11.17 06:45:02.028220 main_runner Warning SyncProtocolObserver cannot find a requested block with id 22092206
Nov 17 06:45:06 v1-oslo kagome[557803]: 24.11.17 06:45:06.464939 main_runner Info Timeline Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:06 v1-oslo kagome[557803]: 24.11.17 06:45:06.951102 worker.2 Info BlockStorage Added block #25815337 (0x415a…0681) as child of #25815336 (0x1c59…59ec)
Nov 17 06:45:08 v1-oslo kagome[557803]: 24.11.17 06:45:08.013042 main_runner Info BlockExecutor Imported block #25815337 (0x415a…0681) within 1530 ms. (lag 2012 ms.)
Nov 17 06:45:08 v1-oslo kagome[557803]: 24.11.17 06:45:08.030263 statement-distr Info StatementDistribution Send my view. (peers_count=903)
Nov 17 06:45:10 v1-oslo kagome[557803]: 24.11.17 06:45:10.150924 main_runner Warning SyncProtocolObserver cannot find a requested block with id 25398643
Nov 17 06:45:10 v1-oslo systemd[1]: kagosama.service: A process of this unit has been killed by the OOM killer.
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Main process exited, code=killed, status=9/KILL
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Failed with result 'oom-kill'.
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Consumed 1h 7min 16.322s CPU time, received 35.8G IP traffic, sent 2.5G IP traffic.
Nov 17 06:45:42 v1-oslo systemd[1]: kagosama.service: Scheduled restart job, restart counter is at 10.
An example log from current refactor/notifications (b5834724b3f18b1d694051cae1afff06d1ee7652):
Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.777435 statement-distr Info StatementDistribution Fetch attested candidate success. (relay parent=0x8eeb…7431, candidate=0xe3e0…9547, group index=38, statements=4)
Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.777466 statement-distr Info StatementDistribution Not importable. (relay parent=0x8eeb…7431, candidate=0xe3e0…9547, group index=38)
Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.778373 main_runner Info ParachainProcessorImpl Candidate backed.(candidate=0xe3e0…9547, para id=11, relay_parent=0x8eeb…7431)
Nov 15 13:55:27 v1-oslo kagome[518209]: 24.11.15 13:55:27.379774 main_runner Info BlockTree Finalized block #25791090 (0xb5c5…9ce1)
Nov 15 13:55:29 v1-oslo kagome[518209]: 24.11.15 13:55:29.378418 grandpa Info GrandpaEnvironment Found best chain is longer than approved: #25791092 (0x8eeb…7431) > #25791090 (0xb5c5…9ce1); truncate it
Nov 15 13:55:31 v1-oslo kagome[518209]: 24.11.15 13:55:31.269221 worker.9 Info BlockStorage Added block #25791093 (0xcb20…2797) as child of #25791092 (0x8eeb…7431)
Nov 15 13:55:35 v1-oslo kagome[518209]: 24.11.15 13:55:35.104468 main_runner Info BlockExecutor Imported block #25791093 (0xcb20…2797) within 4589 ms. (lag 5103 ms.)
Nov 15 13:55:37 v1-oslo kagome[518209]: 24.11.15 13:55:37.459163 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:38.246090 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:39.634165 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:41.287427 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:43.217174 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:44.695502 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:45.461195 main_runner Info BlockTree Finalized block #25791091 (0xd078…f4c2)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:46.080193 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:46.866109 grandpa Info GrandpaEnvironment Found best chain is longer than approved: #25791093 (0xcb20…2797) > #25791091 (0xd078…f4c2); truncate it
Nov 15 13:55:49 v1-oslo kagome[518209]: 24.11.15 13:55:48.796014 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:50.125748 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:52.024332 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:52.081410 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:55.035874 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:57.375167 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:58.591698 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:00.915122 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:02.325781 statement-distr Info StatementDistribution Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0x5ac1…f9fc, group index=45, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:05.281818 main_runner Info ApprovalDistribution Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:06.386996 statement-distr Info StatementDistribution Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0x7ded…3ca9, group index=47, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:08.579577 main_runner Info BlockTree Finalized block #25791092 (0x8eeb…7431)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:09.463514 grandpa Info GrandpaEnvironment Found best chain is longer than approved: #25791093 (0xcb20…2797) > #25791092 (0x8eeb…7431); truncate it
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:11.829982 statement-distr Info StatementDistribution Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0xb80f…dd69, group index=54, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:18.836390 statement-distr Info StatementDistribution Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0xf22e…772f, group index=48, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Main process exited, code=killed, status=9/KILL
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Failed with result 'signal'.
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Consumed 7min 144ms CPU time, 57.8G memory peak, 0B memory swap peak, read 30.5G from disk, written 18.8G to disk, received 1.7G IP traffic, sent 1.0G IP traffic.
Expected Behavior
The validator does not OOM? ^^
System Information
NixOS 24.5 with kernel 6.11.5
Compiler: gcc 13.2.0
CMake: cmake version 3.25.3
Bug Summary
Validator regularly runs out of memory on Kusama
Bug Description
If in active set, Kusama validator regularly runs out of memory. This happens both on current
master
(153ed85bcc3dda1f6e1fc2cb0efcca33e8e49aa1
) and on therefactor/notifications
(b5834724b3f18b1d694051cae1afff06d1ee7652
, PR #2269) branches, although it happens less regularly on the former (~ every hour or two) than the latter (~ every 10 minutes), but it also accumulates far fewer era points on the former (~ dozens per era), than on the latter (~ hundreds per era), so this may be due to the latter resolving networking issues and thereby enabling the validator to engage in consensus more actively.This happens on a host with 64 GB RAM. If this is insufficient for operating a Kagome validator, please close this issue.
Steps to Reproduce
Mode: Validator number of nodes: 1 Command:
kagome --chain kusama -d [...] --validator --listen-addr [...] --public-addr [...] --name [...] --rpc-port [...] --telemetry-url [...] --telemetry-url [...] --node-key-file [...]
Effects of the Bug
The node gets culled by the OOM killer.
An example log from current
master
(153ed85bcc3dda1f6e1fc2cb0efcca33e8e49aa1
):An example log from current
refactor/notifications
(b5834724b3f18b1d694051cae1afff06d1ee7652
):Expected Behavior
The validator does not OOM? ^^
System Information
NixOS 24.5 with kernel 6.11.5 Compiler: gcc 13.2.0 CMake: cmake version 3.25.3
Built using flake from https://github.com/qdrvm/kagome/pull/2257
Relevant for this issue: host has 64 GB RAM.
Additional Context
Issue does not occur when running without
--validator
flag.