ledgerwatch / erigon

Ethereum implementation on the efficiency frontier
GNU Lesser General Public License v3.0
3.03k stars 1.06k forks source link

Upgrading to 2.58.1 leads to some missing historical blocks #9612

Open ryanschneider opened 4 months ago

ryanschneider commented 4 months ago

System information

Erigon version: erigon version 2.58.1 OS & Version: Linux

Erigon Command (with flags/config):

--db.size.limit=8TB --db.pagesize=4KB --snap.stop --chain=mainnet  --pprof --rpc.gascap 500000000 --nat=stun  --authrpc.addr=0.0.0.0 --authrpc.vhosts=* --http --http.addr 0.0.0.0 --http.corsdomain * --http.port 8545 --http.api eth,trace,net,engine,web3 --http.vhosts * --port 30303 --maxpeers 50 --txpool.nolocals --txpool.accountslots 32 --txpool.globalslots 8192 --txpool.accountqueue 128 --txpool.globalqueue 2048 --txpool.lifetime 3h0m0s --datadir /mnt/local/ethereum --ws    --authrpc.jwtsecret /mnt/local/ethereum/engine-jwt.key

Consensus Layer: Teku

--log-destination=console --data-base-path=/mnt/local/ethereum --network=mainnet --ee-endpoint=http://localhost:8551 --eth1-endpoint=http://localhost:8545 --ee-jwt-secret-file=/mnt/local/ethereum/engine-jwt.key --rest-api-enabled=true --rest-api-host-allowlist=* --rest-api-interface=0.0.0.0 --rest-api-port=5051 --metrics-enabled=true --metrics-host-allowlist=* --metrics-interface=0.0.0.0 --metrics-port=8008 --metrics-block-timing-tracking-enabled=true

Chain/Network: Mainnet

Expected behaviour

Upgrading from 2.52.5 to 2.58.1 should work without issue.

Actual behaviour

When upgrading from 2.52.5 to 2.58.1 our nodes routinely end up missing access to all blocks between 18500000 and 19198999.

Note however that the block ranges do show up in the snapshots directory. Actually, multiple overlapping files show up, for example I see files both in the range v1-018500-018600* as well as additional smaller files w/ a smaller range (v1-018500-018501*, 501-502* etc). In fact I see three overlapping files starting at 018500-:

-rw-r--r-- 1 erigon erigon 4.8K Mar  4 17:04 v1-018500-018501-bodies.idx
-rw-r--r-- 1 erigon erigon 239K Mar  4 17:04 v1-018500-018501-bodies.seg
-rw-r--r-- 1 erigon erigon 4.9K Mar  4 18:04 v1-018500-018501-headers.idx
-rw-r--r-- 1 erigon erigon 489K Mar  4 17:04 v1-018500-018501-headers.seg
-rw-r--r-- 1 erigon erigon 608K Mar  4 20:22 v1-018500-018501-transactions-to-block.idx
-rw-r--r-- 1 erigon erigon 814K Mar  4 20:22 v1-018500-018501-transactions.idx
-rw-r--r-- 1 erigon erigon  97M Mar  4 17:04 v1-018500-018501-transactions.seg
-rw-r--r-- 1 erigon erigon  36M Mar  5 22:12 v1-018500-018600-bodies.seg
-rw-r--r-- 1 erigon erigon  747 Mar  5 20:40 v1-018500-018600-bodies.seg.torrent
-rw-r--r-- 1 erigon erigon  47M Mar  5 21:23 v1-018500-018600-headers.seg
-rw-r--r-- 1 erigon erigon  868 Mar  5 20:40 v1-018500-018600-headers.seg.torrent
-rw-r--r-- 1 erigon erigon 9.5G Mar  5 22:29 v1-018500-018600-transactions.seg
-rw-r--r-- 1 erigon erigon  95K Mar  5 20:40 v1-018500-018600-transactions.seg.torrent
-rw-r--r-- 1 erigon erigon 2.7M Mar  5 17:15 v1-018500-019000-bodies.idx
-rw-r--r-- 1 erigon erigon 161M Mar  5 17:15 v1-018500-019000-bodies.seg
-rw-r--r-- 1 erigon erigon 2.1K Mar  5 17:08 v1-018500-019000-bodies.seg.torrent
-rw-r--r-- 1 erigon erigon 2.7M Mar  5 17:14 v1-018500-019000-headers.idx
-rw-r--r-- 1 erigon erigon 234M Mar  5 17:14 v1-018500-019000-headers.seg
-rw-r--r-- 1 erigon erigon 2.8K Mar  5 17:08 v1-018500-019000-headers.seg.torrent

Here we see files 018500-18600, 18500-19000 and 18500-18501. I'm not sure if this is expected or correct behavior.

Steps to reproduce the behaviour

Mar 05 17:09:52 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:09:52.639] [1/12 Snapshots] Fetching torrent files metadata
Mar 05 17:09:52 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:09:52.645] [snapshots] progress                     name=v1-018000-018500-transactions.seg progress=34.16%
Mar 05 17:10:02 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:10:02.455] [snapshots] progress                     name=v1-018000-018500-transactions.seg progress=59.24%
Mar 05 17:10:12 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:10:12.646] [1/12 Snapshots] downloading (or verifying) progress="95.75% 424.8GB/443.6GB" time-left=999hrs:99m total-time=20s download=0B/s upload=0B/s peers=3 files=>
Mar 05 17:10:22 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:10:22.454] [snapshots] progress                     name=v1-018000-018500-transactions.seg progress=95.36%
Mar 05 17:10:32 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:10:32.645] [1/12 Snapshots] downloading (or verifying) progress="99.52% 441.5GB/443.6GB" time-left=999hrs:99m total-time=40s download=0B/s upload=0B/s peers=4 files=>
Mar 05 17:10:52 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:10:52.645] [1/12 Snapshots] download finished       time=1m0.000217971s
Mar 05 17:11:03 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:11:03.964] [snapshots] Blocks Stat                  blocks=19198k indices=19198k alloc=3.8GB sys=4.1GB
Mar 05 17:11:03 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:11:03.974] [1/12 Snapshots] DONE                    in=1m11.492641314s
Mar 05 17:11:03 ip-10-0-66-234 erigon[184185]: [INFO] [03-05|17:11:03.980] [snapshots] Retire Blocks                range=19198k-19199k

It also seems to repeatedly emit PruneBlocks lines like this where none of the values seem to change:

Mar 06 16:32:57 ip-10-0-66-234 erigon[344772]: [INFO] [03-06|16:32:57.137] [snapshots] Prune Blocks                 to=18500000 limit=100
$ curl -H "Content-Type: application/json" -d '{"id": 1, "jsonrpc": "2.0", "method": "eth_getBlockByNumber", "params": ["0x11a49a0", false] }' http://10.0.66.234:8545/
{"jsonrpc":"2.0","id":1,"result":null}

$curl -H "Content-Type: application/json" -d '{"id": 1, "jsonrpc": "2.0", "method": "eth_getBlockByNumber", "params": ["0x124f417", false] }' http://10.0.66.234:8545/
{"jsonrpc":"2.0","id":1,"result":null}

Whereas blocks 18499999 (0x11a499f) and 19199000 (0x124f418) both return data as expected.

ryanschneider commented 4 months ago

BTW I scanned our logs and the log Some blocks are not in snapshots and not in db from WaitForDownloader does not appear in the logs.

Is there a way for me to determine the value of RoSnapshots.SegmentsMin/Max() and RoSnapshots.IndicesMax() on demand? I do see this when restarting the node:

Mar 06 16:32:17 ip-10-0-66-234 erigon[344772]: [INFO] [03-06|16:32:17.265] [snapshots:download] Blocks Stat         blocks=18500k indices=18500k alloc=3.5GB sys=4.7GB
Mar 06 16:32:17 ip-10-0-66-234 erigon[344772]: [INFO] [03-06|16:32:17.265] [snapshots:missed-idx] Blocks Stat       blocks=18500k indices=18500k alloc=3.5GB sys=4.7GB
Mar 06 16:32:17 ip-10-0-66-234 erigon[344772]: [INFO] [03-06|16:32:17.272] [snapshots] Prune Blocks                 to=18500000 limit=100
AskAlexSharov commented 4 months ago

ok you see gap in files:

-rw-r--r-- 1 erigon erigon  95K Mar  5 20:40 v1-018500-018600-transactions.seg.torrent
-rw-r--r-- 1 erigon erigon 2.7M Mar  5 17:15 v1-018500-019000-bodies.idx

try:

ryanschneider commented 4 months ago

@AskAlexSharov I tried this on another node that was exhibiting similar issues (the one I posted yesterday is currently in use for another task), however it didn't seem to help. This particular node is missing blocks 18400000 to 19265999, but I didn't see mention of any of the blocks during the restart:

Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.306] [downloader] local file hash does not match known file=v1-015500-016000-transactions.seg local=cc17ea1005feaf50a9acbfc9703a4c958d674b0e known=d80f4fffaa0e6dfc746be5d78c43ca7df12a5601
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.307] [downloader] local file hash does not match known file=v1-015000-015500-transactions.seg local=a5c9edb8b0f3531263d671e2afba5716ccfcb097 known=6e1058c74fa7d83a480fa6e864bc2d852d1f71a9
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.307] [downloader] local file hash does not match known file=v1-017000-017500-bodies.seg local=071372e228cf8e1d656242115424de5d030ab4e1 known=eb0b538a5d7c1eacfc0bc8cf1214c2e381ce3087
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.307] [downloader] local file hash does not match known file=v1-015000-015500-headers.seg local=f25e21b64a62a4c77fdcacc41e73b7103dfee4af known=276a668d2b8d6491b7a01c5cc06d32c6106ffe33
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.307] [downloader] local file hash does not match known file=v1-015500-016000-headers.seg local=0f438dc59f666369bb39e571812084db72cff577 known=1abeaf2936b61b1c2661ef2e7f42644db5aaa564

Mar 07 16:05:31 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:05:31.674] [1/12 Snapshots] Requesting downloads
Mar 07 16:05:36 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:05:36.573] [snapshots:download] Blocks Stat         blocks=18400k indices=18400k alloc=3.4GB sys=4.6GB
Mar 07 16:05:36 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:05:36.574] [snapshots:missed-idx] Blocks Stat       blocks=18400k indices=18400k alloc=3.4GB sys=4.6GB
Mar 07 16:05:36 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:05:36.575] [snapshots] Prune Blocks                 to=18400000 limit=100
Mar 07 16:06:06 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:06:06.520] [NewPayload] Handling new payload        height=19384372 hash=0x3373800af95927dc61179ba6e5f839585ff5a04734af053aca7454c2b8ba19fa

Note that these nodes are running with --snap.stop due to previous issues, would that prevent the files from being re-downloaded?

ryanschneider commented 4 months ago

Ok, I just removed --snap.stop and now I'm seeing some indexing!

Mar 07 16:17:29 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:17:29.977] [snapshots:download] Blocks Stat         blocks=18400k indices=18400k alloc=3.0GB sys=4.4GB
Mar 07 16:17:29 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:17:29.977] [snapshots:missed-idx] Blocks Stat       blocks=18400k indices=18400k alloc=3.0GB sys=4.4GB
Mar 07 16:17:50 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:17:50.014] [1/12 Snapshots] Indexing                progress="v1-018400-018500-transactions.seg=3%, v1-018500-018600-transactions.seg=2%, v1-018600-018700-transactions.seg=2%, v1-018700-018800-transactions.seg=2%, v1-018800-018900-transactions.seg=2%, v1-018900-019000-transactions.seg=3%" total-indexing-time=20s alloc=4.9GB sys=6.8GB
ryanschneider commented 4 months ago

Nevermind, removing --snap.stop caused even more blocks to now be missing:

It started out promising:

Mar 07 16:24:30 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:24:30.015] [1/12 Snapshots] Indexing                progress="v1-018600-018700-transactions.seg=50%, v1-018700-018800-transactions.seg=50%, v1-018900-019000-transactions.seg=50%" total-indexing-time=7m0s alloc=4.2GB sys=9.2GB
Mar 07 16:24:46 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:24:46.042] [snapshots:missed-idx:reopen] Blocks Stat blocks=19266k indices=19266k alloc=5.0GB sys=9.2GB
Mar 07 16:24:46 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:24:46.045] [1/12 Snapshots] DONE                    in=7m20.657725855s
Mar 07 16:24:46 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:24:46.048] [snapshots] Prune Blocks                 to=19266000 limit=100

But then hit this error:

Mar 07 16:25:49 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:25:49.434] [snapshots] Prune Blocks                 to=19266000 limit=100
Mar 07 16:25:56 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:25:56.312] [snapshots:retire] Blocks Stat           blocks=18000k indices=18000k alloc=3.4GB sys=9.2GB
Mar 07 16:26:01 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:01.187] [NewPayload] Handling new payload        height=19384477 hash=0x7f0af19a41e5608d8878a947a8221dd77ec7e9e931
b8823643f3cf6711f33a13
Mar 07 16:26:01 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:01.647] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
Mar 07 16:26:02 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:02.089] RPC Daemon notified of new headers       from=19384476 to=19384477 hash=0x7f0af19a41e5608d8878a947a8221dd7
7ec7e9e931b8823643f3cf6711f33a13 header sending=16.072µs log sending=217ns
Mar 07 16:26:02 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:02.089] head updated                             hash=0x7f0af19a41e5608d8878a947a8221dd77ec7e9e931b8823643f3cf6711
f33a13 number=19384477
Mar 07 16:26:02 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:02.089] [snapshots] Prune Blocks                 to=18000000 limit=100
...
Mar 07 16:26:38 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:38.210] [snapshots] Prune Blocks                 to=18000000 limit=100
Mar 07 16:26:39 ip-10-0-66-205 erigon[512525]: [WARN] [03-07|16:26:39.079] [snapshots] retire blocks                err="TransactionsIdx: at=19000000-19100000, pre index building, expect: 17082580, got 0"
...
Mar 07 16:26:48 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:48.653] [snapshots] Prune Blocks                 to=18000000 limit=100
Mar 07 16:26:48 ip-10-0-66-205 erigon[512525]: [WARN] [03-07|16:26:48.657] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18000
000,  hash=95b198e154acbfc64109dfd22d8224fe927fd8dfdedfae01587674482ba4baf3"

And now all blocks between 18000000 and 19266000 seem to be unavailable. Do I need to make sure my CL is completely stopped while reindexing?

ryanschneider commented 4 months ago

If I stop my CL and remove --snap.stop it does repair the missing blocks up to 18999999, but is still missing blocks 19000000 to 19265999 with this error:

Mar 07 17:12:25 ip-10-0-66-205 erigon[519508]: [INFO] [03-07|17:12:25.518] [snapshots:download] Blocks Stat         blocks=19000k indices=19000k alloc=3.7GB sys=4.8GB
Mar 07 17:12:25 ip-10-0-66-205 erigon[519508]: [INFO] [03-07|17:12:25.518] [snapshots:missed-idx] Blocks Stat       blocks=19000k indices=19000k alloc=3.7GB sys=4.8GB
Mar 07 17:12:25 ip-10-0-66-205 erigon[519508]: [EROR] [03-07|17:12:25.709] Could not start execution service        err="[1/12 Snapshots] can't build missed indices: TransactionsIdx: at=19000000-19100000, pre index building, expect: 17082580, got 0"
toolchainx commented 4 months ago

@AskAlexSharov when upgrading to v2.58.1 I came across the same block not found issue with error:

[WARN] [03-08|04:15:55.952] [rpc] served                             conn=10.80.0.96:40984 method=eth_getLogs reqid=6 t=881.934µs err="block not found 19061365"

the eth_getLogs requests was coming from prysm node

AskAlexSharov commented 4 months ago

pre index building, expect: 17082580, got 0 - i working on it now. temporary workaround --snap.stop

muyinliu commented 3 months ago

Same issue. Keep logging [WARN] [03-18|14:12:49.764] [snapshots] retire blocks err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000, hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"(see #6044) after command ./build/bin/integration stage_headers --reset --datadir $DATA_DIR --chain mainnet executed.

Fix with following steps:

  1. check directory $DATA_DIR/snapshots
  2. find the conflict block-number ranges(e.g. v1-018500-019000-* & v1-018500-018600-*), there might be multiple conflict files, use https://github.com/ledgerwatch/erigon-snapshot/blob/main/mainnet.toml to verify the snapshot files
  3. stop erigon
  4. delete v1-018500-019000-*(or just move to other directory)
  5. start erigon with option --downloader.verify
  6. wait for erigon to complete snapshot verify and sync

Note: this bug might cause by the commit between v2.55.1 and v2.58.1(NOT sure which commit)

muyinliu commented 3 months ago

@AskAlexSharov missing following torrents from d.torrentClient.Torrents() in source code file erigon-lib/downloader/downloader.go(why ?):

v1-018300-018400-bodies.seg
v1-018300-018400-headers.seg
v1-018300-018400-transactions.seg
v1-018400-018500-bodies.seg
v1-018400-018500-headers.seg
v1-018400-018500-transactions.seg

After recover these files to directory snapshots/, erigon works fine.

I download these files with a brand new erigon instance for Ethereum(with different --datadir), after the Snapshots stage. The process takes about 24 hour to finished(about 580.9GB data to download). It's there any other way to download these missing snapshots files?

Note: it looks like the first time erigon download snapshots with a different way other than erigon already synced once.

soad003 commented 3 months ago

Hi everyone, I see the same issues on block range 19210000 - 19335999 after updating from 2.55.1 to 2.58.2. Is there any update on this?

> erigon_1      | [INFO] [03-27|16:19:33.739] [snapshots] Prune Blocks                 to=19210000 limit=100
> erigon_1      | [WARN] [03-27|16:19:33.741] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in 
> db: block_num=19210000,  hash=d5275585c7083b702c82e5b8473a57adcd317e65c519e557fcd057a1479fe0b1"
curl localhost:8545 -X POST -H "Content-Type: application/json" --data '{"method":"eth_getBlockByNumber","params":["19335000",false],"id":1,"jsonrpc":"2.0"}'
{"jsonrpc":"2.0","id":1,"result":null}

Happy to provide more details if needed.

muyinliu commented 3 months ago

Upgrade to v2.59.3 NOT fix the issue(NOT fix the corrupted database). Erigon works fine when synced, but the same issue happened after about 2 hours later...

IMPORTANT NOTE: sync from start with Erigon v2.59.3 is fine, no block is missing.

sazhiv commented 3 months ago

Hi guys, i have the same issue. Is there a solution to this problem ? Erigon version 2.59.3

battlmonstr commented 2 weeks ago

This is a bug, but unfortunately there's not enough information to be able to work on a fix.

Workaround 1: For people who want to have a synced node and can afford waiting - use the latest erigon version, delete the datadir (except nodekey file) and resync from scratch.

Workaround 2: For people who want to migrate the data with some downtime: 1) delete the headers from the database, 2) delete the bad snapshot files 3) resync from the last snapshot file and database state. The comment from @muyinliu has the required commands (stage_headers --reset and snapshots/ cleanup).

For people who can't afford any downtime, please provide us more debug info:

taratorio commented 3 days ago

@ryanschneider @toolchainx @muyinliu @soad003 @sazhiv hey, can you please let us know what your current status is?

I just tried to reproduce these issues on a v2.59.3 instance that we have which was sync-ed from scratch. I do not see these issues on it and everything seems to be ok.

My guess after looking at the logs is that we've had a non-backward compatible/breaking change to the snapshot retirement logic between v2.55.1 and v2.58.1 which has now been addressed in v2.59.3 and higher. I suggest using latest v2.60.2 and above when they get released as v2.60.* versions are more stable.

We recently improved our testing infrastructure a bit so hopefully there will be considerably less breaking changes like this going forward.

If you guys have all upgraded successfully using a fresh sync from scratch then I suggest we move on. Will wait for a week to hear back from you all before closing this.

ryanschneider commented 2 days ago

Hey @taratorio thanks for the update, I've changed roles so no longer have access to the erigon nodes I originally saw this issue on, but IIRC we did something very similar to what @battlmonstr suggested to recover the missing data without doing a full resync.