Open ryanschneider opened 4 months ago
BTW I scanned our logs and the log Some blocks are not in snapshots and not in db
from WaitForDownloader
does not appear in the logs.
Is there a way for me to determine the value of RoSnapshots.SegmentsMin/Max()
and RoSnapshots.IndicesMax()
on demand? I do see this when restarting the node:
Mar 06 16:32:17 ip-10-0-66-234 erigon[344772]: [INFO] [03-06|16:32:17.265] [snapshots:download] Blocks Stat blocks=18500k indices=18500k alloc=3.5GB sys=4.7GB
Mar 06 16:32:17 ip-10-0-66-234 erigon[344772]: [INFO] [03-06|16:32:17.265] [snapshots:missed-idx] Blocks Stat blocks=18500k indices=18500k alloc=3.5GB sys=4.7GB
Mar 06 16:32:17 ip-10-0-66-234 erigon[344772]: [INFO] [03-06|16:32:17.272] [snapshots] Prune Blocks to=18500000 limit=100
ok you see gap in files:
-rw-r--r-- 1 erigon erigon 95K Mar 5 20:40 v1-018500-018600-transactions.seg.torrent
-rw-r--r-- 1 erigon erigon 2.7M Mar 5 17:15 v1-018500-019000-bodies.idx
try:
rm /mnt/erigon/snapshots/*.lock
, rm /mnt/erigon/snapshots/snapshot-lock.json
@AskAlexSharov I tried this on another node that was exhibiting similar issues (the one I posted yesterday is currently in use for another task), however it didn't seem to help. This particular node is missing blocks 18400000
to 19265999
, but I didn't see mention of any of the blocks during the restart:
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.306] [downloader] local file hash does not match known file=v1-015500-016000-transactions.seg local=cc17ea1005feaf50a9acbfc9703a4c958d674b0e known=d80f4fffaa0e6dfc746be5d78c43ca7df12a5601
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.307] [downloader] local file hash does not match known file=v1-015000-015500-transactions.seg local=a5c9edb8b0f3531263d671e2afba5716ccfcb097 known=6e1058c74fa7d83a480fa6e864bc2d852d1f71a9
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.307] [downloader] local file hash does not match known file=v1-017000-017500-bodies.seg local=071372e228cf8e1d656242115424de5d030ab4e1 known=eb0b538a5d7c1eacfc0bc8cf1214c2e381ce3087
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.307] [downloader] local file hash does not match known file=v1-015000-015500-headers.seg local=f25e21b64a62a4c77fdcacc41e73b7103dfee4af known=276a668d2b8d6491b7a01c5cc06d32c6106ffe33
Mar 07 16:05:07 ip-10-0-66-205 erigon[510781]: [WARN] [03-07|16:05:07.307] [downloader] local file hash does not match known file=v1-015500-016000-headers.seg local=0f438dc59f666369bb39e571812084db72cff577 known=1abeaf2936b61b1c2661ef2e7f42644db5aaa564
Mar 07 16:05:31 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:05:31.674] [1/12 Snapshots] Requesting downloads
Mar 07 16:05:36 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:05:36.573] [snapshots:download] Blocks Stat blocks=18400k indices=18400k alloc=3.4GB sys=4.6GB
Mar 07 16:05:36 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:05:36.574] [snapshots:missed-idx] Blocks Stat blocks=18400k indices=18400k alloc=3.4GB sys=4.6GB
Mar 07 16:05:36 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:05:36.575] [snapshots] Prune Blocks to=18400000 limit=100
Mar 07 16:06:06 ip-10-0-66-205 erigon[510781]: [INFO] [03-07|16:06:06.520] [NewPayload] Handling new payload height=19384372 hash=0x3373800af95927dc61179ba6e5f839585ff5a04734af053aca7454c2b8ba19fa
Note that these nodes are running with --snap.stop
due to previous issues, would that prevent the files from being re-downloaded?
Ok, I just removed --snap.stop
and now I'm seeing some indexing!
Mar 07 16:17:29 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:17:29.977] [snapshots:download] Blocks Stat blocks=18400k indices=18400k alloc=3.0GB sys=4.4GB
Mar 07 16:17:29 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:17:29.977] [snapshots:missed-idx] Blocks Stat blocks=18400k indices=18400k alloc=3.0GB sys=4.4GB
Mar 07 16:17:50 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:17:50.014] [1/12 Snapshots] Indexing progress="v1-018400-018500-transactions.seg=3%, v1-018500-018600-transactions.seg=2%, v1-018600-018700-transactions.seg=2%, v1-018700-018800-transactions.seg=2%, v1-018800-018900-transactions.seg=2%, v1-018900-019000-transactions.seg=3%" total-indexing-time=20s alloc=4.9GB sys=6.8GB
Nevermind, removing --snap.stop
caused even more blocks to now be missing:
It started out promising:
Mar 07 16:24:30 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:24:30.015] [1/12 Snapshots] Indexing progress="v1-018600-018700-transactions.seg=50%, v1-018700-018800-transactions.seg=50%, v1-018900-019000-transactions.seg=50%" total-indexing-time=7m0s alloc=4.2GB sys=9.2GB
Mar 07 16:24:46 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:24:46.042] [snapshots:missed-idx:reopen] Blocks Stat blocks=19266k indices=19266k alloc=5.0GB sys=9.2GB
Mar 07 16:24:46 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:24:46.045] [1/12 Snapshots] DONE in=7m20.657725855s
Mar 07 16:24:46 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:24:46.048] [snapshots] Prune Blocks to=19266000 limit=100
But then hit this error:
Mar 07 16:25:49 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:25:49.434] [snapshots] Prune Blocks to=19266000 limit=100
Mar 07 16:25:56 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:25:56.312] [snapshots:retire] Blocks Stat blocks=18000k indices=18000k alloc=3.4GB sys=9.2GB
Mar 07 16:26:01 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:01.187] [NewPayload] Handling new payload height=19384477 hash=0x7f0af19a41e5608d8878a947a8221dd77ec7e9e931
b8823643f3cf6711f33a13
Mar 07 16:26:01 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:01.647] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
Mar 07 16:26:02 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:02.089] RPC Daemon notified of new headers from=19384476 to=19384477 hash=0x7f0af19a41e5608d8878a947a8221dd7
7ec7e9e931b8823643f3cf6711f33a13 header sending=16.072µs log sending=217ns
Mar 07 16:26:02 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:02.089] head updated hash=0x7f0af19a41e5608d8878a947a8221dd77ec7e9e931b8823643f3cf6711
f33a13 number=19384477
Mar 07 16:26:02 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:02.089] [snapshots] Prune Blocks to=18000000 limit=100
...
Mar 07 16:26:38 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:38.210] [snapshots] Prune Blocks to=18000000 limit=100
Mar 07 16:26:39 ip-10-0-66-205 erigon[512525]: [WARN] [03-07|16:26:39.079] [snapshots] retire blocks err="TransactionsIdx: at=19000000-19100000, pre index building, expect: 17082580, got 0"
...
Mar 07 16:26:48 ip-10-0-66-205 erigon[512525]: [INFO] [03-07|16:26:48.653] [snapshots] Prune Blocks to=18000000 limit=100
Mar 07 16:26:48 ip-10-0-66-205 erigon[512525]: [WARN] [03-07|16:26:48.657] [snapshots] retire blocks err="DumpBlocks: DumpBodies: header missed in db: block_num=18000
000, hash=95b198e154acbfc64109dfd22d8224fe927fd8dfdedfae01587674482ba4baf3"
And now all blocks between 18000000
and 19266000
seem to be unavailable. Do I need to make sure my CL is completely stopped while reindexing?
If I stop my CL and remove --snap.stop
it does repair the missing blocks up to 18999999
, but is still missing blocks 19000000
to 19265999
with this error:
Mar 07 17:12:25 ip-10-0-66-205 erigon[519508]: [INFO] [03-07|17:12:25.518] [snapshots:download] Blocks Stat blocks=19000k indices=19000k alloc=3.7GB sys=4.8GB
Mar 07 17:12:25 ip-10-0-66-205 erigon[519508]: [INFO] [03-07|17:12:25.518] [snapshots:missed-idx] Blocks Stat blocks=19000k indices=19000k alloc=3.7GB sys=4.8GB
Mar 07 17:12:25 ip-10-0-66-205 erigon[519508]: [EROR] [03-07|17:12:25.709] Could not start execution service err="[1/12 Snapshots] can't build missed indices: TransactionsIdx: at=19000000-19100000, pre index building, expect: 17082580, got 0"
@AskAlexSharov when upgrading to v2.58.1 I came across the same block not found
issue with error:
[WARN] [03-08|04:15:55.952] [rpc] served conn=10.80.0.96:40984 method=eth_getLogs reqid=6 t=881.934µs err="block not found 19061365"
the eth_getLogs requests was coming from prysm node
pre index building, expect: 17082580, got 0
- i working on it now. temporary workaround --snap.stop
Same issue. Keep logging [WARN] [03-18|14:12:49.764] [snapshots] retire blocks err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000, hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
(see #6044) after command ./build/bin/integration stage_headers --reset --datadir $DATA_DIR --chain mainnet
executed.
Fix with following steps:
$DATA_DIR/snapshots
v1-018500-019000-*
& v1-018500-018600-*
), there might be multiple conflict files, use https://github.com/ledgerwatch/erigon-snapshot/blob/main/mainnet.toml to verify the snapshot filesv1-018500-019000-*
(or just move to other directory)--downloader.verify
Note: this bug might cause by the commit between v2.55.1 and v2.58.1(NOT sure which commit)
@AskAlexSharov missing following torrents from d.torrentClient.Torrents()
in source code file erigon-lib/downloader/downloader.go
(why ?):
v1-018300-018400-bodies.seg
v1-018300-018400-headers.seg
v1-018300-018400-transactions.seg
v1-018400-018500-bodies.seg
v1-018400-018500-headers.seg
v1-018400-018500-transactions.seg
After recover these files to directory snapshots/
, erigon works fine.
I download these files with a brand new erigon instance for Ethereum(with different --datadir
), after the Snapshots
stage. The process takes about 24 hour to finished(about 580.9GB data to download). It's there any other way to download these missing snapshots files?
Note: it looks like the first time erigon download snapshots with a different way other than erigon already synced once.
Hi everyone, I see the same issues on block range 19210000 - 19335999 after updating from 2.55.1 to 2.58.2. Is there any update on this?
> erigon_1 | [INFO] [03-27|16:19:33.739] [snapshots] Prune Blocks to=19210000 limit=100
> erigon_1 | [WARN] [03-27|16:19:33.741] [snapshots] retire blocks err="DumpBlocks: DumpBodies: header missed in
> db: block_num=19210000, hash=d5275585c7083b702c82e5b8473a57adcd317e65c519e557fcd057a1479fe0b1"
curl localhost:8545 -X POST -H "Content-Type: application/json" --data '{"method":"eth_getBlockByNumber","params":["19335000",false],"id":1,"jsonrpc":"2.0"}'
{"jsonrpc":"2.0","id":1,"result":null}
Happy to provide more details if needed.
Upgrade to v2.59.3 NOT fix the issue(NOT fix the corrupted database). Erigon works fine when synced, but the same issue happened after about 2 hours later...
IMPORTANT NOTE: sync from start with Erigon v2.59.3 is fine, no block is missing.
Hi guys, i have the same issue. Is there a solution to this problem ? Erigon version 2.59.3
This is a bug, but unfortunately there's not enough information to be able to work on a fix.
Workaround 1: For people who want to have a synced node and can afford waiting - use the latest erigon version, delete the datadir (except nodekey
file) and resync from scratch.
Workaround 2: For people who want to migrate the data with some downtime: 1) delete the headers from the database, 2) delete the bad snapshot files 3) resync from the last snapshot file and database state. The comment from @muyinliu has the required commands (stage_headers --reset
and snapshots/
cleanup).
For people who can't afford any downtime, please provide us more debug info:
integration print_stages --log.console.verbosity=5
and send us logs--rpc.dbg.single=true
and dbg: true
(see here) and send us logsls -al
of the snapshots folder, erigon version before migration and after migration.@ryanschneider @toolchainx @muyinliu @soad003 @sazhiv hey, can you please let us know what your current status is?
I just tried to reproduce these issues on a v2.59.3 instance that we have which was sync-ed from scratch. I do not see these issues on it and everything seems to be ok.
My guess after looking at the logs is that we've had a non-backward compatible/breaking change to the snapshot retirement logic between v2.55.1 and v2.58.1 which has now been addressed in v2.59.3 and higher. I suggest using latest v2.60.2 and above when they get released as v2.60.* versions are more stable.
We recently improved our testing infrastructure a bit so hopefully there will be considerably less breaking changes like this going forward.
If you guys have all upgraded successfully using a fresh sync from scratch then I suggest we move on. Will wait for a week to hear back from you all before closing this.
Hey @taratorio thanks for the update, I've changed roles so no longer have access to the erigon nodes I originally saw this issue on, but IIRC we did something very similar to what @battlmonstr suggested to recover the missing data without doing a full resync.
System information
Erigon version:
erigon version 2.58.1
OS & Version: LinuxErigon Command (with flags/config):
Consensus Layer: Teku
Chain/Network: Mainnet
Expected behaviour
Upgrading from
2.52.5
to2.58.1
should work without issue.Actual behaviour
When upgrading from
2.52.5
to2.58.1
our nodes routinely end up missing access to all blocks between18500000
and19198999
.Note however that the block ranges do show up in the
snapshots
directory. Actually, multiple overlapping files show up, for example I see files both in the rangev1-018500-018600*
as well as additional smaller files w/ a smaller range (v1-018500-018501*
,501-502*
etc). In fact I see three overlapping files starting at018500-
:Here we see files
018500-18600
,18500-19000
and18500-18501
. I'm not sure if this is expected or correct behavior.Steps to reproduce the behaviour
2.52.5
, ensure all blocks are available.2.58.1
, also enable--snap.stop
(tried both doing this and not doing this)Snapshots
stage files:It also seems to repeatedly emit
PruneBlocks
lines like this where none of the values seem to change:18500000
(0x11a49a0
) and19198999
(0x124f417
) inclusive returnnull
(e.g. in response toeth_getBlockByNumber
):Whereas blocks
18499999
(0x11a499f
) and19199000
(0x124f418
) both return data as expected.