v5.2.0 built from source runs out of memory

alecalve commented 3 months ago

Description

We build a Docker image from the Lighthouse v5.2.0 Dockerfile with very minor changes:

adding curl inside the image directly
some changes relating to how the binary is run

The binary itself is built the same way as in this repo's Dockerfile, using make and the default values for FEATURES, PROFILE, etc..

diff --git a/Dockerfile b/Dockerfile
index e0c48699b..da033a4bf 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -13,6 +13,21 @@ FROM ubuntu:22.04
 RUN apt-get update && apt-get -y upgrade && apt-get install -y --no-install-recommends \
   libssl-dev \
   ca-certificates \
-  && apt-get clean \
-  && rm -rf /var/lib/apt/lists/*
+  curl && \
+  apt-get clean && \
+  rm -rf /var/lib/apt/lists/*
+
 COPY --from=builder /usr/local/cargo/bin/lighthouse /usr/local/bin/lighthouse
+RUN useradd -m -u 1000 -s /bin/bash lighthouse
+RUN chown -R lighthouse:lighthouse /opt
+USER lighthouse
+RUN mkdir /opt/data
+WORKDIR /opt/lighthouse
+
+# Firewall configurations
+# P2P TCP, UDP
+EXPOSE 9001/TCP
+EXPOSE 9001/UDP
+# RPC
+EXPOSE 5053
+ENTRYPOINT [ "lighthouse" ]

The node is run using these arguments:

      --datadir=/opt/data
      --http
      --listen-address=0.0.0.0
      --listen-address=::
      --http-address=::
      --http-allow-origin=*
      --http-port=5052
      --target-peers=50
      --slots-per-restore-point=32
      --prune-blobs=false
      --execution-jwt=/secrets/jwtsecret
      --execution-endpoint=http://....:8551
      --port=9000
      --port6=9000
      --network=mainnet
      --disable-deposit-contract-sync
      --state-cache-size=2
      --metrics
      --metrics-address=::
      --metrics-port=8080
      --metrics-allow-origin=*

Version

The Docker image is built from v5.2.0

Present Behaviour

Once started, the application runs then finally runs out of memory after hitting the limit of 64GB we assigned to it.

This did not happen with the same modifications applied to previous versions of Lighthouse. The last one we had tested this with was v5.0.0.

Using the image you provide (sigp/lighthouse:v5.2.0) with the same arguments on the same datadir results in a reasonable, stable, memory usage.

Expected Behaviour

The application should have a stable memory footprint and not run out of memory.

alecalve commented 3 months ago

Compiling with --features "portable,gnosis,slasher-lmdb,slasher-mdbx,jemalloc" --profile "maxperf" makes the application not run out of memory

jimmygchen commented 3 months ago

Thanks for raising this! We can take a look into this.

Could you try removing or increasing the value for --state-cache-size=2?

I haven't tried testing using a low state cache size since v5.2.0, but with the introduction of tree-states, state caches are now 32x cheaper and we recommend to adjust this to a higher number (64 if you want the allocate equivalent of 2 state caches in 5.1.3), or just use the default of 128. Using a low state cache size may result in more cache misses and copying, and potentially more memory usage elsewhere, although I'm not very sure and need to take a deeper look.

alecalve commented 3 months ago

Ah yes, the low value for --state-cache-size was my initial attempt at trying to get memory under control until I realized it came from a much deeper issue. I'll remove it but I can say that the issue was occuring with the default value for that setting.

dapplion commented 3 months ago

Do you have metrics tracking the memory usage of this instance? Would be helpful to see if it's a linear or spiky increase and at what rate. Also please send us debug logs if possible to dive deeper

alecalve commented 3 months ago

Here's what memory usage was, it was OOM-killed each time it exceeded 64G:

I don't have debug logs available.

michaelsproul commented 3 months ago

I wonder if maxperf is causing the compiler to find some optimisation that prevents an OOM. I will try to repro and profile the mem usage of a release build.

michaelsproul commented 3 months ago

I couldn't repro the OOM with LH v5.2.0 compiled from source with release profile. I ran for about 2 hours and mem didn't bump above 5GB. I didn't run under Docker though.

michaelsproul commented 3 months ago

I'm trying a long-range sync now, as one of the other OOM reports we had was a node syncing a month of history.

michaelsproul commented 3 months ago

No luck with the long-range sync either. Mem usage didn't bump above 3GB

chong-he commented 3 months ago

Compiling with --features "portable,gnosis,slasher-lmdb,slasher-mdbx,jemalloc" --profile "maxperf" makes the application not run out of memory

We have another report by @rz4884 on Discord that's facing the same issue, and the user confirmed that it is because he doesn't include --feature jemalloc.

@michaelsproul mentions that the Dockerfile appears to not have enabled jemalloc by default. We will fix this, thanks for reporting

michaelsproul commented 2 months ago

Closed by #5995

alecalve commented 1 month ago

It is happening again using 5.3.0 when building from source.

The image is built from commit d6ba8c397557f5c977b70f0d822a9228e98ca214 using

docker build --build-arg="PROFILE=maxperf" ...

michaelsproul commented 1 month ago

@alecalve does the resulting binary show jemalloc as the allocator in ./lighthouse --version?

alecalve commented 1 month ago

It does:

$ lighthouse --version
Lighthouse v5.3.0-b11261e
BLS library: blst
BLS hardware acceleration: true
SHA256 hardware acceleration: true
Allocator: jemalloc
Profile: maxperf
Specs: mainnet (true), minimal (false), gnosis (false)

michaelsproul commented 1 month ago

What's the memory usage getting to now on 5.3.0 when the OOM occurs?

It must be something other than the lack of jemalloc. Things to check:

Are you running a slasher?
If you have metrics enabled, please share a dump of curl http://localhost:5054/metrics here or via DM (@sproul on Discord). In particular it would be good to check beacon_fork_choice_process_attestation_seconds_* to rule out the issue described here: https://github.com/sigp/lighthouse/issues/6206.
Are you running Lighthouse with any other flags, e.g. --state-cache-size, --reconstruct-historic-states, etc.
What's the hardware?

alecalve commented 1 month ago

We are not running a slasher
I enabled metrics, beacon_fork_choice_process_attestation_seconds_* metrics look healthy, metrics are here
We run the following arguments:

      beacon_node
      --datadir=/opt/data
      --http
      --listen-address=0.0.0.0
      --listen-address=::
      --http-address=::
      --http-allow-origin=*
      --http-port=5052
      --checkpoint-sync-url=https://mainnet-checkpoint-sync.attestant.io/
      --reconstruct-historic-states
      --target-peers=50
      --slots-per-restore-point=256
      --historic-state-cache-size=1
      --prune-blobs=false
      --execution-jwt=/secrets/jwtsecret
      --execution-endpoint=http://geth-1-cluster:8551
      --port=9000
      --port6=9000
      --network=mainnet
      --metrics
      --disable-deposit-contract-sync

The node has finished reconstructing states, we do see a lot of State cache missed logs though.

Hardware wise, it's in a k8s env, running on Linux. We allocate 32GB of RAM to that container.

michaelsproul commented 1 month ago

we do see a lot of State cache missed logs though.

This sounds like the issue.

The state cache referred to in this log is for the unfinalized portion of the chain. It shouldn't frequently miss with the default --state-cache-size of 128. It doesn't have anything to do with state reconstruction.

Can you post the output of curl http://localhost:5052/lighthouse/database/info | jq? I want to check if the split.slot is advancing normally and keeping the unfinalized part of the DB a manageable size.

Can you also provide some info on what's making state queries? How many requests per second, are they made concurrently, etc?

alecalve commented 1 month ago

  "split": {
    "slot": "9760096",
    "state_root": "0x19325b996b812c1c1d11728a0481f1c333c224e0fa6b10ebd9aefeddb34d9f44",
    "block_root": "0x52ea319a5ff08c1ca9914952690dff649c59808028cee0e450c50274faad04dc"
  }

But the node is way beyond that slot:

Aug 26 08:04:17.000 INFO Synced                                  slot: 9819619, block:    …  empty, epoch: 306863, finalized_epoch: 306861, finalized_root: 0x732f…d236, exec_hash: 0x3837…f93b (verified), peers: 13, service: slot_notifier

Full output:

{
  "schema_version": 21,
  "config": {
    "slots_per_restore_point": 256,
    "slots_per_restore_point_set_explicitly": true,
    "block_cache_size": 5,
    "state_cache_size": 128,
    "historic_state_cache_size": 1,
    "compact_on_init": false,
    "compact_on_prune": true,
    "prune_payloads": true,
    "prune_blobs": false,
    "epochs_per_blob_prune": 1,
    "blob_prune_margin_epochs": 0
  },
  "split": {
    "slot": "9760096",
    "state_root": "0x19325b996b812c1c1d11728a0481f1c333c224e0fa6b10ebd9aefeddb34d9f44",
    "block_root": "0x52ea319a5ff08c1ca9914952690dff649c59808028cee0e450c50274faad04dc"
  },
  "anchor": null,
  "blob_info": {
    "oldest_blob_slot": "9483873",
    "blobs_db": true
  }
}

alecalve commented 1 month ago

RPC wise, the only users are some L2 nodes, I don't have an insight into how frequently they query the node.

michaelsproul commented 1 month ago

But the node is way beyond that slot

This is the issue. The state migration must be failing.

Do you see an error log like:

WARN Block pruning failed

There was an old issue prior to v4.6.0 that could cause DB corruption similar to this:

https://github.com/sigp/lighthouse/issues/4773

But I'm guessing that seeing as it happened in the last 8 days, there's a new cause :\

michaelsproul commented 1 month ago

Please send me debug logs (@sproul on Discord) and I'll also open a new issue once we have some idea of what the problem is

alecalve commented 1 month ago

After some discussion and more troubleshooting, the issue was pinned to the node killing itself by loading too many states during DB consolidation/pruning while having a long period of non-finality.

sigp / lighthouse