sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.94k stars 748 forks source link

4.6.0 beacon_node memory usage issue #5227

Open SjonHortensius opened 9 months ago

SjonHortensius commented 9 months ago

Description

I realize 4.6.0 contains #4918 a fix for a previous oom issue (which I never experienced) but ever since I upgraded, I've been getting OOMs with some pretty big numbers (between 20 and 50 GiB used) making my setup highly unstable

Version

latest stable Lighthouse v4.6.0-1be5253

Present Behaviour

I don't think my bn setup includes anything special but fwiw /usr/bin/lighthouse -d /var/lib/lighthouse/beacon beacon_node --validator-monitor-auto --checkpoint-sync-url http://XXX:5052 --staking --port 9000 --http-port 5052 --http-address 0.0.0.0 --execution-endpoint http://127.0.0.1:8551 --execution-jwt /var/lib/lighthouse/beacon/jwtsecret --builder http://localhost:18550 --builder-profit-threshold XXX

Frequent OOMs, roughly 5-10 per day with varying amounts allocated:

Out of memory: Killed process 2045154 (lighthouse) total-vm:49913604kB, anon-rss:7651908kB, file-rss:616kB, shmem-rss:0kB, UID:64470 pgtables:68484kB oom_score_adj:0
Out of memory: Killed process 2289480 (lighthouse) total-vm:38670388kB, anon-rss:7477340kB, file-rss:1576kB, shmem-rss:0kB, UID:64470 pgtables:47296kB oom_score_adj:0
Out of memory: Killed process 2309773 (lighthouse) total-vm:32929348kB, anon-rss:7625508kB, file-rss:524kB, shmem-rss:0kB, UID:64470 pgtables:37356kB oom_score_adj:0
Out of memory: Killed process 2310656 (lighthouse) total-vm:41775684kB, anon-rss:7218900kB, file-rss:2396kB, shmem-rss:0kB, UID:64470 pgtables:47284kB oom_score_adj:0
Out of memory: Killed process 2340820 (lighthouse) total-vm:35665580kB, anon-rss:7158488kB, file-rss:4144kB, shmem-rss:0kB, UID:64470 pgtables:36064kB oom_score_adj:0
Out of memory: Killed process 2345368 (lighthouse) total-vm:20267340kB, anon-rss:7031200kB, file-rss:1488kB, shmem-rss:0kB, UID:64470 pgtables:19156kB oom_score_adj:0
Out of memory: Killed process 2345709 (lighthouse) total-vm:46387020kB, anon-rss:8294852kB, file-rss:0kB, shmem-rss:0kB, UID:64470 pgtables:53688kB oom_score_adj:0
Out of memory: Killed process 2371985 (lighthouse) total-vm:42253384kB, anon-rss:7546744kB, file-rss:2524kB, shmem-rss:0kB, UID:64470 pgtables:56480kB oom_score_adj:0
Out of memory: Killed process 2414108 (lighthouse) total-vm:19549408kB, anon-rss:7417528kB, file-rss:2304kB, shmem-rss:0kB, UID:64470 pgtables:18764kB oom_score_adj:0
Out of memory: Killed process 2414413 (lighthouse) total-vm:35328152kB, anon-rss:7206648kB, file-rss:1300kB, shmem-rss:0kB, UID:64470 pgtables:43248kB oom_score_adj:0
Out of memory: Killed process 2426583 (lighthouse) total-vm:18473132kB, anon-rss:7055376kB, file-rss:0kB, shmem-rss:0kB, UID:64470 pgtables:18672kB oom_score_adj:0
Out of memory: Killed process 2426890 (lighthouse) total-vm:40653920kB, anon-rss:7911312kB, file-rss:2952kB, shmem-rss:0kB, UID:64470 pgtables:52820kB oom_score_adj:0
Out of memory: Killed process 2459040 (lighthouse) total-vm:38758536kB, anon-rss:7408472kB, file-rss:1948kB, shmem-rss:0kB, UID:64470 pgtables:48676kB oom_score_adj:0
Out of memory: Killed process 2487203 (lighthouse) total-vm:26128836kB, anon-rss:7450632kB, file-rss:632kB, shmem-rss:0kB, UID:64470 pgtables:20984kB oom_score_adj:0
Out of memory: Killed process 2487581 (lighthouse) total-vm:22814708kB, anon-rss:7217240kB, file-rss:1772kB, shmem-rss:0kB, UID:64470 pgtables:22144kB oom_score_adj:0
Out of memory: Killed process 2487874 (lighthouse) total-vm:21608876kB, anon-rss:7101656kB, file-rss:1976kB, shmem-rss:0kB, UID:64470 pgtables:21812kB oom_score_adj:0
Out of memory: Killed process 2488170 (lighthouse) total-vm:37179496kB, anon-rss:7037800kB, file-rss:2448kB, shmem-rss:0kB, UID:64470 pgtables:34700kB oom_score_adj:0
Out of memory: Killed process 2489135 (lighthouse) total-vm:50698824kB, anon-rss:7391844kB, file-rss:3028kB, shmem-rss:0kB, UID:64470 pgtables:62416kB oom_score_adj:0

Steps to resolve

Please describe the steps required to resolve this issue, if known.

michaelsproul commented 9 months ago

If you have debug logs from this machine during the OOM (check $datadir/beacon/logs) please DM them to me on Discord (@sproul) or email them to me ($surname@sigmaprime.io)

michaelsproul commented 9 months ago

It may be that the message dequeueing isn't happening fast enough, so https://github.com/sigp/lighthouse/pull/5175 will help.

michaelsproul commented 9 months ago

@SjonHortensius I've just noticed that the RSS for all of these crashes is in the 7GB range. You can ignore the higher total-vm number, that's not relevant.

I think this is probably still a bug on the Lighthouse side, we're looking into it. Logs would be great.

SjonHortensius commented 9 months ago

@michaelsproul you're right wrt the memory usage, I misinterpreted those.

I have relevant logs - but I am unwilling to publish them unscrubbed. I'll send some parts through mail

luarx commented 8 months ago

Execution layer: Erigon Network: Mainnet

Lighthouse params:

"--debug-level=info",
"--datadir=/beacondata",
"--network=mainnet",
"beacon_node",
"--disable-enr-auto-update",
"--enr-address=127.0.0.1",
"--enr-tcp-port=9000",
"--enr-udp-port=9000",
"--port=9000" ,
"--discovery-port=9000",
"--eth1",
"--http",
"--http-address=0.0.0.0",
"--http-port=5052",
"--metrics",
"--metrics-address=0.0.0.0",
"--metrics-port=5054",
"--listen-address=0.0.0.0",
"--target-peers=100",
"--http-allow-sync-stalled",
"--disable-packet-filter",
"--execution-endpoint=http://localhost:9545",
"--jwt-secrets=/tmp/jwtsecret",
"--disable-deposit-contract-sync",
"--checkpoint-sync-url=https://beaconstate-mainnet.chainsafe.io"

Adding here info from my personal case. Apart from memory spikes, I also see CPU ones (maybe they are related):

I think that the memory/cpu issue is not already fixed cc. @michaelsproul @AgeManning

pawanjay176 commented 8 months ago

@luarx could you share some of your debug logs? feel free to ping me over discord

luarx commented 8 months ago

@pawanjay176 discord user?

pawanjay176 commented 8 months ago

I'm pawan#7432. Should be able to find me on the SigmaPrime role on our discord