Closed alecalve closed 2 months ago
Compiling with --features "portable,gnosis,slasher-lmdb,slasher-mdbx,jemalloc" --profile "maxperf"
makes the application not run out of memory
Thanks for raising this! We can take a look into this.
Could you try removing or increasing the value for --state-cache-size=2
?
I haven't tried testing using a low state cache size since v5.2.0, but with the introduction of tree-states
, state caches are now 32x cheaper and we recommend to adjust this to a higher number (64 if you want the allocate equivalent of 2 state caches in 5.1.3), or just use the default of 128. Using a low state cache size may result in more cache misses and copying, and potentially more memory usage elsewhere, although I'm not very sure and need to take a deeper look.
Ah yes, the low value for --state-cache-size
was my initial attempt at trying to get memory under control until I realized it came from a much deeper issue. I'll remove it but I can say that the issue was occuring with the default value for that setting.
Do you have metrics tracking the memory usage of this instance? Would be helpful to see if it's a linear or spiky increase and at what rate. Also please send us debug logs if possible to dive deeper
Here's what memory usage was, it was OOM-killed each time it exceeded 64G:
I don't have debug logs available.
I wonder if maxperf
is causing the compiler to find some optimisation that prevents an OOM. I will try to repro and profile the mem usage of a release
build.
I couldn't repro the OOM with LH v5.2.0 compiled from source with release profile. I ran for about 2 hours and mem didn't bump above 5GB. I didn't run under Docker though.
I'm trying a long-range sync now, as one of the other OOM reports we had was a node syncing a month of history.
No luck with the long-range sync either. Mem usage didn't bump above 3GB
Compiling with
--features "portable,gnosis,slasher-lmdb,slasher-mdbx,jemalloc" --profile "maxperf"
makes the application not run out of memory
We have another report by @rz4884 on Discord that's facing the same issue, and the user confirmed that it is because he doesn't include --feature jemalloc
.
@michaelsproul mentions that the Dockerfile appears to not have enabled jemalloc
by default. We will fix this, thanks for reporting
Closed by #5995
It is happening again using 5.3.0 when building from source.
The image is built from commit d6ba8c397557f5c977b70f0d822a9228e98ca214 using
docker build --build-arg="PROFILE=maxperf" ...
@alecalve does the resulting binary show jemalloc as the allocator in ./lighthouse --version
?
It does:
$ lighthouse --version
Lighthouse v5.3.0-b11261e
BLS library: blst
BLS hardware acceleration: true
SHA256 hardware acceleration: true
Allocator: jemalloc
Profile: maxperf
Specs: mainnet (true), minimal (false), gnosis (false)
What's the memory usage getting to now on 5.3.0 when the OOM occurs?
It must be something other than the lack of jemalloc. Things to check:
curl http://localhost:5054/metrics
here or via DM (@sproul
on Discord). In particular it would be good to check beacon_fork_choice_process_attestation_seconds_*
to rule out the issue described here: https://github.com/sigp/lighthouse/issues/6206.--state-cache-size
, --reconstruct-historic-states
, etc.beacon_fork_choice_process_attestation_seconds_*
metrics look healthy, metrics are here beacon_node
--datadir=/opt/data
--http
--listen-address=0.0.0.0
--listen-address=::
--http-address=::
--http-allow-origin=*
--http-port=5052
--checkpoint-sync-url=https://mainnet-checkpoint-sync.attestant.io/
--reconstruct-historic-states
--target-peers=50
--slots-per-restore-point=256
--historic-state-cache-size=1
--prune-blobs=false
--execution-jwt=/secrets/jwtsecret
--execution-endpoint=http://geth-1-cluster:8551
--port=9000
--port6=9000
--network=mainnet
--metrics
--disable-deposit-contract-sync
The node has finished reconstructing states, we do see a lot of State cache missed
logs though.
we do see a lot of State cache missed logs though.
This sounds like the issue.
The state cache referred to in this log is for the unfinalized portion of the chain. It shouldn't frequently miss with the default --state-cache-size
of 128. It doesn't have anything to do with state reconstruction.
Can you post the output of curl http://localhost:5052/lighthouse/database/info | jq
? I want to check if the split.slot
is advancing normally and keeping the unfinalized part of the DB a manageable size.
Can you also provide some info on what's making state queries? How many requests per second, are they made concurrently, etc?
"split": {
"slot": "9760096",
"state_root": "0x19325b996b812c1c1d11728a0481f1c333c224e0fa6b10ebd9aefeddb34d9f44",
"block_root": "0x52ea319a5ff08c1ca9914952690dff649c59808028cee0e450c50274faad04dc"
}
But the node is way beyond that slot:
Aug 26 08:04:17.000 INFO Synced slot: 9819619, block: … empty, epoch: 306863, finalized_epoch: 306861, finalized_root: 0x732f…d236, exec_hash: 0x3837…f93b (verified), peers: 13, service: slot_notifier
Full output:
{
"schema_version": 21,
"config": {
"slots_per_restore_point": 256,
"slots_per_restore_point_set_explicitly": true,
"block_cache_size": 5,
"state_cache_size": 128,
"historic_state_cache_size": 1,
"compact_on_init": false,
"compact_on_prune": true,
"prune_payloads": true,
"prune_blobs": false,
"epochs_per_blob_prune": 1,
"blob_prune_margin_epochs": 0
},
"split": {
"slot": "9760096",
"state_root": "0x19325b996b812c1c1d11728a0481f1c333c224e0fa6b10ebd9aefeddb34d9f44",
"block_root": "0x52ea319a5ff08c1ca9914952690dff649c59808028cee0e450c50274faad04dc"
},
"anchor": null,
"blob_info": {
"oldest_blob_slot": "9483873",
"blobs_db": true
}
}
RPC wise, the only users are some L2 nodes, I don't have an insight into how frequently they query the node.
But the node is way beyond that slot
This is the issue. The state migration must be failing.
Do you see an error log like:
WARN Block pruning failed
There was an old issue prior to v4.6.0 that could cause DB corruption similar to this:
But I'm guessing that seeing as it happened in the last 8 days, there's a new cause :\
Please send me debug logs (@sproul
on Discord) and I'll also open a new issue once we have some idea of what the problem is
After some discussion and more troubleshooting, the issue was pinned to the node killing itself by loading too many states during DB consolidation/pruning while having a long period of non-finality.
Description
We build a Docker image from the Lighthouse v5.2.0
Dockerfile
with very minor changes:curl
inside the image directlyThe binary itself is built the same way as in this repo's Dockerfile, using
make
and the default values forFEATURES
,PROFILE
, etc..The node is run using these arguments:
Version
The Docker image is built from
v5.2.0
Present Behaviour
Once started, the application runs then finally runs out of memory after hitting the limit of 64GB we assigned to it.
This did not happen with the same modifications applied to previous versions of Lighthouse. The last one we had tested this with was
v5.0.0
.Using the image you provide (
sigp/lighthouse:v5.2.0
) with the same arguments on the same datadir results in a reasonable, stable, memory usage.Expected Behaviour
The application should have a stable memory footprint and not run out of memory.