Closed happysalada closed 1 year ago
likely a hardware issue, I'd run a memtest: https://memtest.org/
Just to be sure, the system started syncing for about a day before that. After a reboot, sync is continuing, I'll leave this running for a while and will see if that happens again. I'll try to run a memtest this weekend
ok cool
can also recommend checkpoint sync to avoid the agony of syncing from genesis: https://lighthouse-book.sigmaprime.io/checkpoint-sync.html
oh, thanks a lot for that!
I have a small question if you have a moment. I naively thought that lighthouse would consume 100% of CPU syncing, it seems to sit at 30%, is there a setting that I missed ? Or am I capped by network speed ?
Tiny Feedback, once during sync from genesis, lighthouse got stuck in a loop
"Nov 10 13:09:31.429 ERRO Database write failed! error: DBError { message: \"Error { message: \\\"Corruption: bad block contents\\\" }\" }, msg: Restoring fork choice from disk, service: beacon"
"Nov 10 13:09:31.826 CRIT Beacon block processing error error: DBError(DBError { message: \"Error { message: \\\"Corruption: bad block contents\\\" }\" }), service: beacon"
"Nov 10 13:09:31.826 WARN BlockProcessingFailure outcome: DBError(DBError { message: \"Error { message: \\\"Corruption: bad block contents\\\" }\" }), msg: unexpected condition in processing block."
"Nov 10 13:09:34.413 ERRO Database write failed! error: DBError { message: \"Error { message: \\\"Corruption: bad block contents\\\" }\" }, msg: Restoring fork choice from disk, service: beacon"
After restarting my server, the sync is back on.
Hey @happysalada, I've never seen either of these messages Corruption: bad block contents
or Bus error
on healthy hardware. This is not a Lighthouse issue, but something specific to your hardware/software stack. Some hypervisors like Proxmox have broken implementations of hardware instruction pass-through (particularly AES) that can cause mysterious errors like this. Is your Nix setup using some kind of virtualisation?
I naively thought that lighthouse would consume 100% of CPU syncing, it seems to sit at 30%, is there a setting that I missed ? Or am I capped by network speed ?
There are some single threaded bottlenecks in block processing: although we can verify some things in parallel, we need to finish fully processing all prior blocks before processing a new block. So it's not as simple as being able to max out the CPU all the time (sync isn't embarrassingly parallel).
No everything is running on baremetal
One more thing that i found surprising is that the sync rate slows down with time. It started at 40 slots per second then went down to 20 and now is at 5 Just a little surprising
The reason for the slow down is the presence of more validators on the network as time goes on, so there are many more signatures to verify.
No everything is running on baremetal
Have you tried running a memory test and a SMART disk check?
Im running zfs, that could be the reason for the problem. Let me check smartctl just to be sure
You might also have more luck with checkpoint sync. Syncing from genesis is not something we've optimised for because it is massively slower and no more secure than using checkpoint sync due to weak subjectivity. Our checkpoint sync implementation is optimised so you can sync in a few minutes, and there is now a diverse network of checkpoint providers to choose from: https://eth-clients.github.io/checkpoint-sync-endpoints/
Closing due to inactivity and lack of any issue identified in Lighthouse. Let me know if you'd like it re-opened
Description
I've been running lighthouse with erigon, and yesterday I started getting lighthouse stuck in coredump and restart here is the log
I'm afraid it doesn't contain a lot of info, happy to try to get more if you know what you want
Version
3.2.1
Present Behaviour
stuck in a core dump loop
Expected Behaviour
no core_dump or possibly goes await after an automated restart
Steps to resolve
not sure