qdrvm / kagome

Kagome - C++20 implementation of Polkadot Host
https://kagome.readthedocs.io
Apache License 2.0
160 stars 34 forks source link

[Bug]: Validator regularly OOMs on Kusama #2276

Open Lederstrumpf opened 1 day ago

Lederstrumpf commented 1 day ago

Bug Summary

Validator regularly runs out of memory on Kusama

Bug Description

If in active set, Kusama validator regularly runs out of memory. This happens both on current master (153ed85bcc3dda1f6e1fc2cb0efcca33e8e49aa1) and on the refactor/notifications (b5834724b3f18b1d694051cae1afff06d1ee7652, PR #2269) branches, although it happens less regularly on the former (~ every hour or two) than the latter (~ every 10 minutes), but it also accumulates far fewer era points on the former (~ dozens per era), than on the latter (~ hundreds per era), so this may be due to the latter resolving networking issues and thereby enabling the validator to engage in consensus more actively.

This happens on a host with 64 GB RAM. If this is insufficient for operating a Kagome validator, please close this issue.

Steps to Reproduce

Mode: Validator number of nodes: 1 Command: kagome --chain kusama -d [...] --validator --listen-addr [...] --public-addr [...] --name [...] --rpc-port [...] --telemetry-url [...] --telemetry-url [...] --node-key-file [...]

Effects of the Bug

The node gets culled by the OOM killer.

An example log from current master (153ed85bcc3dda1f6e1fc2cb0efcca33e8e49aa1):

Nov 17 06:44:37 v1-oslo kagome[557803]: 24.11.17 06:44:37.445020  worker.4         Info      BlockStorage  Added block #25815332 (0x17f2…2868) as child of #25815331 (0x2280…3aec)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.301360  main_runner      Info      BlockExecutor  Imported block #25815332 (0x17f2…2868) within 1159 ms. (lag 2301 ms.)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.301406  main_runner      Info      Timeline  Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.305339  statement-distr  Info      StatementDistribution  Send my view. (peers_count=737)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.578040  main_runner      Info      Timeline  Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.616411  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 22092206
Nov 17 06:44:38 v1-oslo kagome[557803]: 24.11.17 06:44:38.657594  main_runner      Info      Timeline  Caught up block #25815332 (0x17f2…2868)
Nov 17 06:44:40 v1-oslo kagome[557803]: 24.11.17 06:44:40.682522  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:44:41 v1-oslo kagome[557803]: 24.11.17 06:44:41.300233  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:44:42 v1-oslo kagome[557803]: 24.11.17 06:44:42.575388  worker.23        Info      BlockStorage  Added block #25815333 (0xa039…b617) as child of #25815332 (0x17f2…2868)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.511382  main_runner      Info      BlockExecutor  Imported block #25815333 (0xa039…b617) within 1196 ms. (lag 1511 ms.)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.530897  statement-distr  Info      StatementDistribution  Send my view. (peers_count=753)
Nov 17 06:44:43 v1-oslo kagome[557803]: 24.11.17 06:44:43.801411  main_runner      Info      Timeline  Caught up block #25815333 (0xa039…b617)
Nov 17 06:44:45 v1-oslo kagome[557803]: 24.11.17 06:44:45.851714  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:44:48 v1-oslo kagome[557803]: 24.11.17 06:44:48.891030  worker.4         Info      BlockStorage  Added block #25815334 (0xf69e…46cc) as child of #25815333 (0xa039…b617)
Nov 17 06:44:49 v1-oslo kagome[557803]: 24.11.17 06:44:49.808740  main_runner      Info      BlockExecutor  Imported block #25815334 (0xf69e…46cc) within 1419 ms. (lag 1808 ms.)
Nov 17 06:44:49 v1-oslo kagome[557803]: 24.11.17 06:44:49.816217  statement-distr  Info      StatementDistribution  Send my view. (peers_count=773)
Nov 17 06:44:50 v1-oslo kagome[557803]: 24.11.17 06:44:50.217742  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 22092206
Nov 17 06:44:50 v1-oslo kagome[557803]: 24.11.17 06:44:50.901453  main_runner      Info      Timeline  Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:52 v1-oslo kagome[557803]: 24.11.17 06:44:52.180790  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:44:55 v1-oslo kagome[557803]: 24.11.17 06:44:55.830957  main_runner      Info      Timeline  Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:55 v1-oslo kagome[557803]: 24.11.17 06:44:55.845407  main_runner      Info      Timeline  Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:56 v1-oslo kagome[557803]: 24.11.17 06:44:56.081477  main_runner      Info      Timeline  Caught up block #25815334 (0xf69e…46cc)
Nov 17 06:44:56 v1-oslo kagome[557803]: 24.11.17 06:44:56.281898  worker.25        Info      BlockStorage  Added block #25815335 (0xe1ab…0d05) as child of #25815334 (0xf69e…46cc)
Nov 17 06:44:57 v1-oslo kagome[557803]: 24.11.17 06:44:57.291405  main_runner      Info      BlockExecutor  Imported block #25815335 (0xe1ab…0d05) within 1314 ms. (lag 3291 ms.)
Nov 17 06:44:57 v1-oslo kagome[557803]: 24.11.17 06:44:57.308741  statement-distr  Info      StatementDistribution  Send my view. (peers_count=880)
Nov 17 06:44:58 v1-oslo kagome[557803]: 24.11.17 06:44:58.313495  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 22092206
Nov 17 06:44:59 v1-oslo kagome[557803]: 24.11.17 06:44:59.728689  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:45:00 v1-oslo kagome[557803]: 24.11.17 06:45:00.631230  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:45:00 v1-oslo kagome[557803]: 24.11.17 06:45:00.756659  worker.14        Info      BlockStorage  Added block #25815336 (0x1c59…59ec) as child of #25815335 (0xe1ab…0d05)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.609773  main_runner      Info      BlockExecutor  Imported block #25815336 (0x1c59…59ec) within 1320 ms. (lag 1609 ms.)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.617510  statement-distr  Info      StatementDistribution  Send my view. (peers_count=814)
Nov 17 06:45:01 v1-oslo kagome[557803]: 24.11.17 06:45:01.853707  main_runner      Info      Timeline  Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:02 v1-oslo kagome[557803]: 24.11.17 06:45:02.015478  main_runner      Info      Timeline  Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:02 v1-oslo kagome[557803]: 24.11.17 06:45:02.028220  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 22092206
Nov 17 06:45:06 v1-oslo kagome[557803]: 24.11.17 06:45:06.464939  main_runner      Info      Timeline  Caught up block #25815336 (0x1c59…59ec)
Nov 17 06:45:06 v1-oslo kagome[557803]: 24.11.17 06:45:06.951102  worker.2         Info      BlockStorage  Added block #25815337 (0x415a…0681) as child of #25815336 (0x1c59…59ec)
Nov 17 06:45:08 v1-oslo kagome[557803]: 24.11.17 06:45:08.013042  main_runner      Info      BlockExecutor  Imported block #25815337 (0x415a…0681) within 1530 ms. (lag 2012 ms.)
Nov 17 06:45:08 v1-oslo kagome[557803]: 24.11.17 06:45:08.030263  statement-distr  Info      StatementDistribution  Send my view. (peers_count=903)
Nov 17 06:45:10 v1-oslo kagome[557803]: 24.11.17 06:45:10.150924  main_runner      Warning   SyncProtocolObserver  cannot find a requested block with id 25398643
Nov 17 06:45:10 v1-oslo systemd[1]: kagosama.service: A process of this unit has been killed by the OOM killer.
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Main process exited, code=killed, status=9/KILL
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Failed with result 'oom-kill'.
Nov 17 06:45:12 v1-oslo systemd[1]: kagosama.service: Consumed 1h 7min 16.322s CPU time, received 35.8G IP traffic, sent 2.5G IP traffic.
Nov 17 06:45:42 v1-oslo systemd[1]: kagosama.service: Scheduled restart job, restart counter is at 10.

An example log from current refactor/notifications (b5834724b3f18b1d694051cae1afff06d1ee7652):

Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.777435  statement-distr  Info      StatementDistribution  Fetch attested candidate success. (relay parent=0x8eeb…7431, candidate=0xe3e0…9547, group index=38, statements=4)
Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.777466  statement-distr  Info      StatementDistribution  Not importable. (relay parent=0x8eeb…7431, candidate=0xe3e0…9547, group index=38)
Nov 15 13:55:25 v1-oslo kagome[518209]: 24.11.15 13:55:25.778373  main_runner      Info      ParachainProcessorImpl  Candidate backed.(candidate=0xe3e0…9547, para id=11, relay_parent=0x8eeb…7431)
Nov 15 13:55:27 v1-oslo kagome[518209]: 24.11.15 13:55:27.379774  main_runner      Info      BlockTree  Finalized block #25791090 (0xb5c5…9ce1)
Nov 15 13:55:29 v1-oslo kagome[518209]: 24.11.15 13:55:29.378418  grandpa          Info      GrandpaEnvironment  Found best chain is longer than approved: #25791092 (0x8eeb…7431) > #25791090 (0xb5c5…9ce1); truncate it
Nov 15 13:55:31 v1-oslo kagome[518209]: 24.11.15 13:55:31.269221  worker.9         Info      BlockStorage  Added block #25791093 (0xcb20…2797) as child of #25791092 (0x8eeb…7431)
Nov 15 13:55:35 v1-oslo kagome[518209]: 24.11.15 13:55:35.104468  main_runner      Info      BlockExecutor  Imported block #25791093 (0xcb20…2797) within 4589 ms. (lag 5103 ms.)
Nov 15 13:55:37 v1-oslo kagome[518209]: 24.11.15 13:55:37.459163  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:38.246090  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:39.634165  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:41.287427  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:43.217174  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:44.695502  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:45.461195  main_runner      Info      BlockTree  Finalized block #25791091 (0xd078…f4c2)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:46.080193  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:55:47 v1-oslo kagome[518209]: 24.11.15 13:55:46.866109  grandpa          Info      GrandpaEnvironment  Found best chain is longer than approved: #25791093 (0xcb20…2797) > #25791091 (0xd078…f4c2); truncate it
Nov 15 13:55:49 v1-oslo kagome[518209]: 24.11.15 13:55:48.796014  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:50.125748  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:52.024332  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:52.081410  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:55.035874  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:57.375167  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:55:58.591698  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:00.915122  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:02.325781  statement-distr  Info      StatementDistribution  Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0x5ac1…f9fc, group index=45, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:05.281818  main_runner      Info      ApprovalDistribution  Sending an approval to peers. (block=0xcb20…2797, num peers=4)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:06.386996  statement-distr  Info      StatementDistribution  Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0x7ded…3ca9, group index=47, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:08.579577  main_runner      Info      BlockTree  Finalized block #25791092 (0x8eeb…7431)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:09.463514  grandpa          Info      GrandpaEnvironment  Found best chain is longer than approved: #25791093 (0xcb20…2797) > #25791092 (0x8eeb…7431); truncate it
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:11.829982  statement-distr  Info      StatementDistribution  Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0xb80f…dd69, group index=54, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:28 v1-oslo kagome[518209]: 24.11.15 13:56:18.836390  statement-distr  Info      StatementDistribution  Fetch attested candidate returned an error. (relay parent=0xcb20…2797, candidate=0xf22e…772f, group index=48, error=N6libp2p10connection6Stream5ErrorE(10) Stream: reset by remote peer)
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Main process exited, code=killed, status=9/KILL
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Failed with result 'signal'.
Nov 15 13:56:31 v1-oslo systemd[1]: kagosama.service: Consumed 7min 144ms CPU time, 57.8G memory peak, 0B memory swap peak, read 30.5G from disk, written 18.8G to disk, received 1.7G IP traffic, sent 1.0G IP traffic.

Expected Behavior

The validator does not OOM? ^^

System Information

NixOS 24.5 with kernel 6.11.5 Compiler: gcc 13.2.0 CMake: cmake version 3.25.3

Built using flake from https://github.com/qdrvm/kagome/pull/2257

Relevant for this issue: host has 64 GB RAM.

Additional Context

Issue does not occur when running without --validator flag.

kamilsa commented 15 hours ago

Thanks for reporting! We have few ideas what to check

kamilsa commented 13 minutes ago

Hey @Lederstrumpf can you please check if https://github.com/qdrvm/kagome/pull/2278 improves situation?