trezor / blockbook

:blue_book: Trezor address/account balance backend
https://trezor.io
GNU Affero General Public License v3.0
640 stars 635 forks source link

[ETH]: Blockbook stuck and then Inconsistent DB state #1057

Open knaperek opened 3 months ago

knaperek commented 3 months ago

Hi, I'm having issues with the latest Ethereum Blockbook (v0.4.0)

After ~2 days of building its DB while connected to a local and fully synced erigon node, about an hour before fully caught up (my rough estimation), it suddenly can't connect to Erigon and Retries connections all over again, without luck.

After a while I try to stop it (systemctl stop blockbook-ethereum.service) and restart, but it wouldn't listen to SIGTERM and eventually gets KILLED by systemd (even tried increasing the default timeout in the unit file, didn't help).

Once restarted, the DB is broken and I have to wipe it all and start over. Only to end up with the same problem once it catches up. E0326 11:14:32.985938 31129 blockbook.go:222] internalState: database is in inconsistent state and cannot be used

Is there any way to truncate the last additions to the DB to allow Blockbook to pick up from the last known good state instead of starting from scratch please? I'd be happy to investigate the connection issues further but this is really impossible if I have to wait a couple days everytime. Thanks for any help!

System details: CPU: AMD Ryzen 9, 16 cores RAM: 128GiB NVMe: 7TB OS: Debian 10

martinboehm commented 3 months ago

Unfortunately there is no safe way to get out of the inconsistent state. You will have to delete database and start the sync again. As the sync takes a long time, I would recommend to stop the sync from time to time and backup the database. The stopping of the service will work better when there are no issues with the backend connection.

knaperek commented 3 months ago

I see... Well thanks anyway. Is there any way to contribute back by sharing some crash dumps? Or is this too rare to be worth the effort?

Perhaps it would be beneficial to take a closer look at the SIGTERM handler to make sure it closes the DB connection asap, with a special focus on backend connection issues and retries going on at the time, as these are likely blocking proper program termination.

knaperek commented 3 months ago

Happened again, just before getting finally synced up.

Blockbook

Coin    Ethereum
Host    eth-blockbook
Version / Commit / Build    0.4.0 / [fdc2b863-dirty](https://github.com/trezor/blockbook/commit/fdc2b863-dirty) / 2023-04-14T12:25:23+00:00
Synchronized    false
Last Block  19426362
Last Block Update   3 hours ago
Mempool in Sync false
Last Mempool Update 
Transactions in Mempool 0
Size On Disk    267327232450

Backend

Chain   mainnet
Version erigon/2.59.0/linux-amd64/go1.21.6
Consensus Version   Prysm/v5.0.1 (linux amd64)
Last Block  19534275
Difficulty  0

It's stuck and wildly printing these logs:

Mar 29 09:18:33 eth-blockbook blockbook[5852]: E0329 09:18:33.855531    5852 sync.go:339] getBlockWorker 1 connect block error Block not found. Retrying...
Mar 29 09:18:33 eth-blockbook blockbook[5852]: E0329 09:18:33.855563    5852 sync.go:339] getBlockWorker 3 connect block error Block not found. Retrying...
Mar 29 09:18:33 eth-blockbook blockbook[5852]: E0329 09:18:33.855637    5852 sync.go:339] getBlockWorker 7 connect block error Block not found. Retrying...
Mar 29 09:18:33 eth-blockbook blockbook[5852]: E0329 09:18:33.855651    5852 sync.go:339] getBlockWorker 0 connect block error Block not found. Retrying...
Mar 29 09:18:33 eth-blockbook blockbook[5852]: E0329 09:18:33.855690    5852 sync.go:339] getBlockWorker 2 connect block error Block not found. Retrying...

Backend (erigon) seems totally fine and healthy. When I stop it, Blockbook logs the following:

Mar 29 09:25:03 eth-blockbook blockbook[5852]: E0329 09:25:03.902686    5852 sync.go:339] getBlockWorker 4 connect block error hash 0x1bedb6db8951408ac3ba9cb98242e896a805757ad142088be070608513132431, height 19426594: dial tcp 127.0.0.1:8036: connect: connection refused. Retrying...
Mar 29 09:25:03 eth-blockbook blockbook[5852]: E0329 09:25:03.902821    5852 sync.go:339] getBlockWorker 1 connect block error hash 0x4409f5637295fd4aba370f5504a23d5ea1f8a55ec530ece9c40f6b02b596da79, height 19426589: dial tcp 127.0.0.1:8036: connect: connection refused. Retrying...
Mar 29 09:25:03 eth-blockbook blockbook[5852]: E0329 09:25:03.902921    5852 sync.go:339] getBlockWorker 7 connect block error hash 0x63a0ef1050655a7982408b779ad8034a92f085576439aab3cfbc9187d6565db1, height 19426588: dial tcp 127.0.0.1:8036: connect: connection refused. Retrying...

Instead of stopping Blockbook with systemctl (and risking a KILL after TimeoutStopSec), I've tried manually sending various signals to the main blockbook process, both while having erigon running and stopped. This is the list of signals I've tried:

SIGTERM
SIGINT
SIGQUIT
SIGUSR1
SIGUSR2
SIGCHLD

Tried them multiple times and in various combinations while watching blockbook logs. Absolutely no effect. Does it catch and silently ignore all POSIX signals?

martinboehm commented 3 months ago

Hmm, normally Blockbook handles stopping well, however it seems that it cannot get out of the block error Block not found loop, even though there is a code that should break it.

However, what interests me even more is how were you able to get to this situation at all. The block with the hash 0x1bedb6db8951408ac3ba9cb98242e896a805757ad142088be070608513132431 does not exist in Ethereum. Are you syncing Blockbook while still syncing Erigon? I think it is in general a bad idea as Blockbook in bulk mode is not able to handle reorgs.

knaperek commented 3 months ago

Erigon has been running fully synced and following the chain in real time. I have no idea why Blockbook's picking up those non-existent block hashes.

I did as you suggested with the incremental backups. Have a backup that is very close (~5 mins away) from the critical point. Seems like I can reproduce this 100% now.

Interestingly it is a moving target, but not much. Now it got stuck at

Blockbook: 19 426 377
Backend:   19 542 459

It is only 15 blocks more compared to the example described above. So the gap (up to the chain tip) has increased since.

knaperek commented 3 months ago

I'm wondering what I'm doing wrong still. On another (fresh) server I've built and set up the following services, hoping to get it working painlessly:

make all-ethereum-archive-consensus
make all-ethereum_archive

And installed all 3 deb packages:

backend-ethereum-archive-consensus_5.0.1-satoshilabs-1_amd64.deb
backend-ethereum-archive_2.58.2-satoshilabs-1_amd64.deb
blockbook-ethereum-archive_0.4.0_amd64.deb

It is revision b5cfbdfde and I did not change anything in the config this time (on purpose) - not even the blockbook URL.

It's been running for 6 days and still has not synced up. Once in a while I turn on Blockbook to see if Erigon is synced up already (as I can't see this easily on Erigon itself, unfortunately it's not as easy as geth where I was used to get the console (geth attach) and check eth.syncing etc..., but didn't figure out how to check the sync status easily on Erigon yet). Anyway, so I turn on Blockbook (systemctl start blockbook-ethereum-archive.service) every day to check the status, and for the last 2 days I can see the backend reporting the same Last block: 18 999 999.

When I look at the backend logs while Blockbook is turned on, they're flooded with

Mar 31 22:44:32 eth-blockbook-2 sh[88110]: [WARN] [03-31|22:44:32.859] [rpc] served                             method=eth_call reqid=1744013 t=185.029µs err="execution reverted"
Mar 31 22:44:32 eth-blockbook-2 sh[88110]: [WARN] [03-31|22:44:32.871] [rpc] served                             method=eth_call reqid=1744014 t=184.198µs err="execution reverted"
Mar 31 22:44:33 eth-blockbook-2 sh[88110]: [WARN] [03-31|22:44:33.103] [rpc] served                             method=eth_call reqid=1744077 t=169.76µs err="execution reverted"
Mar 31 22:44:33 eth-blockbook-2 sh[88110]: [WARN] [03-31|22:44:33.127] [rpc] served                             method=eth_call reqid=1744086 t=88.286µs err="execution reverted"

You mentioned earlier that "Blockbook in bulk mode is not able to handle reorgs" - this is interesting, I didn't know there was a "bulk mode" and I guess some other mode used to follow the chain once initially synced up? Is there any way to see which mode Blockbook currently runs in? And if the "bulk mode" is so dangerous while the backend's still syncing up, wouldn't it be safer to make Blockbook wait with the sync until it's safe? I guess a default behavior like this would be safer and more convenient - one could just start both services and wait until it's all ready instead of coming back and manually checking if the backend's already synced up before starting Blockbook. Just an idea...

knaperek commented 3 months ago

So I couldn't resist turning Blockbook on (on the second server) while seeing the backend (reportedly) not making any progress after reaching block 18 999 999. Now Blockbook has just caught up and it's all green, saying "Synchronized: TRUE". But the last block is still reported as 18 999 999, for both Blockbook and Backend. That's it, no further progress :-(

Could there be any problem with prysm? I'm using the built-in backend-ethereum-archive-consensus.service but wondering whether it could be better to just run erigon with --internalcl? Is there any reason for running a standalone consensul layer service (prysm) alongside Erigon? Thanks!