sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.92k stars 743 forks source link

core dumped #3703

Closed happysalada closed 1 year ago

happysalada commented 1 year ago

Description

I've been running lighthouse with erigon, and yesterday I started getting lighthouse stuck in coredump and restart here is the log

Bus error               (core dumped) /nix/store/bilkbnjl8zwlv4d1bwnm1hc21a9485j3-lighthouse-3.2.1/bin/lighthouse beacon_node --disable-upnp --disable-deposit-contract-sync --port 9000 --listen-address 0.0.0.0 --network mainnet --datadir /var/lib/lighthouse-beacon/mainnet --execution-endpoint http://localhost:8551 --execution-jwt /run/agenix/ERIGON_JWT --http --http-address 127.0.0.1 --http-port 5052"

I'm afraid it doesn't contain a lot of info, happy to try to get more if you know what you want

Version

3.2.1

Present Behaviour

stuck in a core dump loop

Expected Behaviour

no core_dump or possibly goes await after an automated restart

Steps to resolve

not sure

michaelsproul commented 1 year ago

likely a hardware issue, I'd run a memtest: https://memtest.org/

happysalada commented 1 year ago

Just to be sure, the system started syncing for about a day before that. After a reboot, sync is continuing, I'll leave this running for a while and will see if that happens again. I'll try to run a memtest this weekend

michaelsproul commented 1 year ago

ok cool

can also recommend checkpoint sync to avoid the agony of syncing from genesis: https://lighthouse-book.sigmaprime.io/checkpoint-sync.html

happysalada commented 1 year ago

oh, thanks a lot for that!

happysalada commented 1 year ago

I have a small question if you have a moment. I naively thought that lighthouse would consume 100% of CPU syncing, it seems to sit at 30%, is there a setting that I missed ? Or am I capped by network speed ?

happysalada commented 1 year ago

Tiny Feedback, once during sync from genesis, lighthouse got stuck in a loop

"Nov 10 13:09:31.429 ERRO Database write failed!                  error: DBError { message: \"Error { message: \\\"Corruption: bad block contents\\\" }\" }, msg: Restoring fork choice from disk, service: beacon"
"Nov 10 13:09:31.826 CRIT Beacon block processing error           error: DBError(DBError { message: \"Error { message: \\\"Corruption: bad block contents\\\" }\" }), service: beacon"
"Nov 10 13:09:31.826 WARN BlockProcessingFailure                  outcome: DBError(DBError { message: \"Error { message: \\\"Corruption: bad block contents\\\" }\" }), msg: unexpected condition in processing block."
"Nov 10 13:09:34.413 ERRO Database write failed!                  error: DBError { message: \"Error { message: \\\"Corruption: bad block contents\\\" }\" }, msg: Restoring fork choice from disk, service: beacon"

After restarting my server, the sync is back on.

michaelsproul commented 1 year ago

Hey @happysalada, I've never seen either of these messages Corruption: bad block contents or Bus error on healthy hardware. This is not a Lighthouse issue, but something specific to your hardware/software stack. Some hypervisors like Proxmox have broken implementations of hardware instruction pass-through (particularly AES) that can cause mysterious errors like this. Is your Nix setup using some kind of virtualisation?

I naively thought that lighthouse would consume 100% of CPU syncing, it seems to sit at 30%, is there a setting that I missed ? Or am I capped by network speed ?

There are some single threaded bottlenecks in block processing: although we can verify some things in parallel, we need to finish fully processing all prior blocks before processing a new block. So it's not as simple as being able to max out the CPU all the time (sync isn't embarrassingly parallel).

happysalada commented 1 year ago

No everything is running on baremetal

One more thing that i found surprising is that the sync rate slows down with time. It started at 40 slots per second then went down to 20 and now is at 5 Just a little surprising

michaelsproul commented 1 year ago

The reason for the slow down is the presence of more validators on the network as time goes on, so there are many more signatures to verify.

michaelsproul commented 1 year ago

No everything is running on baremetal

Have you tried running a memory test and a SMART disk check?

happysalada commented 1 year ago

Im running zfs, that could be the reason for the problem. Let me check smartctl just to be sure

michaelsproul commented 1 year ago

You might also have more luck with checkpoint sync. Syncing from genesis is not something we've optimised for because it is massively slower and no more secure than using checkpoint sync due to weak subjectivity. Our checkpoint sync implementation is optimised so you can sync in a few minutes, and there is now a diverse network of checkpoint providers to choose from: https://eth-clients.github.io/checkpoint-sync-endpoints/

michaelsproul commented 1 year ago

Closing due to inactivity and lack of any issue identified in Lighthouse. Let me know if you'd like it re-opened