sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.95k stars 757 forks source link

Lighthouse dying because of short memory spikes #5263

Open Aracki opened 9 months ago

Aracki commented 9 months ago

Description

Few times a day our Lighthouse containers are having short but high memory spikes. Beforehand it's showing a lot of:

ERRO Unable to validate attestation error: ObservedAttestationsError(SlotTooLow { slot: Slot(1035781), lowest_permissible_slot: Slot(1035797) }), peer_id: 16Uiu2HAkzJkSWstNAD1PR915MBhyByDbw8W1DJ5q7JJkMZa3saQd, type: "aggregated", slot: Slot(1035781), beacon_block_root: 0x74c201e1dfd6cc77dbc13ab8b93a791a0fc5f76aa0fcf42971cd0d4666aff3d1"

Besides this error, nothing useful can be seen (we haven't enabled DEBUG though).

Version

sigp/lighthouse:v4.6.0

It's interesting that it's happening only with Holesky network. We have LH for Mainnet & Sepolia as well, but we don't see these issues.

GirnaarNodes commented 9 months ago

we are also facing the same issue

michaelsproul commented 9 months ago

How much memory is Lighthouse using when it OOMs? If you look at sudo dmesg -T | grep killed then it will show the resident set size (RSS).

On Holesky, Lighthouse routinely needs ~8GB of RAM. We are working on optimising this, but Holesky is inherently more resource hungry than other networks due to the higher validator count.

We've also fixed some issues for v5.0.0 which should improve memory usage, see:

And related issue:

michaelsproul commented 9 months ago

The SlotTooLow error isn't related, that was a separate bug, also fixed for v5.0.0 but low impact:

Aracki commented 9 months ago

Mean usage is around ~8Gi yes, but it seems it hits our memory limit of 12Gi very often.

chong-he commented 9 months ago

Mean usage is around ~8Gi yes, but it seems it hits our memory limit of 12Gi very often.

12GB is tough. At least 16GB is required and a 32GB is recommended, particularly on Holesky where the number of validators is high

michaelsproul commented 9 months ago

This PR will also help: https://github.com/sigp/lighthouse/pull/5270

Once that's merged you could try running unstable with --state-cache-size 2. We'll release v5.1.0 with that change quite soon.

SimplyCorey commented 1 week ago

I'm seeing this issue at least 2-3 times a day. I was seeing this ~weekly before I upgraded to the latest release. I've tried setting the state-cache-size flag and it doesnt appear to help. Anything I can provide here to help debug?

michaelsproul commented 1 week ago

@SimplyCorey I was doing some memory debugging recently, and a large amount of Lighthouse's spikey behaviour comes from doing I/O. Do you have a fast NVME SSD?

michaelsproul commented 1 week ago

If you DM me debug logs I can take a closer look to rule out any other problem. I'm @sproul on Discord. You can find debug logs in $datadir/beacon/logs. Best to compress them before sending.