maticnetwork / bor

Official repository for the Polygon Blockchain
https://polygon.technology/
GNU Lesser General Public License v3.0
1k stars 491 forks source link

Genesis mismatch after running bor snapshot prune-block #1274

Closed alecalve closed 3 months ago

alecalve commented 3 months ago

System information

Bor client version: 1.3.3

OS & Version: Linux

Environment: Polygon Mainnet

Command used:

            - server
            - --chain=mainnet
            - --datadir=/opt/data
            - --syncmode=full
            - --gcmode=full
            - --bor.heimdall=https://heimdall-api.polygon.technology
            - --bor.logs
            - --http
            - --http.addr=0.0.0.0
            - --http.api=eth,web3,net,debug,bor
            - --http.vhosts=*
            - --ipcdisable
            - --snapshot=false
            - --txlookuplimit=0
            - --nat
            - extip:$(EXTERNAL_IP)

Overview of the problem

After running the new bor snapshot prune-block command, the node won't start:

INFO [06-24|08:52:39.305] GRPC Server started                      addr=[::]:3131
INFO [06-24|08:52:39.306] Set global gas cap                       cap=50,000,000
INFO [06-24|08:52:39.306] Allocated trie memory caches             clean=255.00MiB dirty=256.00MiB
INFO [06-24|08:52:43.317] Using leveldb as the backing database
INFO [06-24|08:52:43.317] Allocated cache and file handles         database=/opt/data/bor/chaindata cache=512.00MiB handles=524,288 compactionTableSize=0 compactionTableSizeMultiplier=0.000 compactionTotalSize=0 compactionTotalSizeMultiplier=0.000
INFO [06-24|08:53:08.162] Using LevelDB as the backing database
INFO [06-24|08:53:08.162] Found legacy ancient chain path          location=/opt/data/bor/chaindata/ancient
INFO [06-24|08:53:08.268] Opened ancient database                  database=/opt/data/bor/chaindata/ancient readonly=false
Chain metadata
  databaseVersion: 8 (0x8)
  headBlockHash: 0xb5487b24a6716327bc9a63976aee60f623359a184a06336f4af8914d4a494a42
  headFastBlockHash: 0xb5487b24a6716327bc9a63976aee60f623359a184a06336f4af8914d4a494a42
  headHeaderHash: 0xb5487b24a6716327bc9a63976aee60f623359a184a06336f4af8914d4a494a42
  lastPivotNumber: <nil>
  len(snapshotSyncStatus): 0 bytes
  snapshotDisabled: false
  snapshotJournal: 0 bytes
  snapshotRecoveryNumber: <nil>
  snapshotRoot: 0x0000000000000000000000000000000000000000000000000000000000000000
  txIndexTail: 0 (0x0)
  fastTxLookupLimit: <nil>

genesis mismatch: 0xa9c28ce2141b56c474f1dc504bee9b01eb1bd7d1a507580d5519d4437a97de1b (leveldb) != 0xbde6bf03b73ea78b97bc72c3d0d98ab1f59822f87e0739656ad80fab6532cb7c (ancients)
manav2401 commented 3 months ago

Hi @alecalve, can you share the exact command you used to run the ancient pruner?

Also, can you also send the result of bor snapshot inspect-ancient-db --datadir <datadir> --datadir.ancient <ancient_dir> to debug further. Thanks!

alecalve commented 3 months ago

I ran:

bor snapshot prune-block --datadir=/opt/data --block-amount-reserved=16000000

The node was not caught up and using a lot of disk space, but I didn't want to prune blocks we hadn't yet processed hence the high value for reserved blocks.

Here's the output of the inspect command:

+--------------------------------+----------+
|             FIELD              |  ITEMS   |
+--------------------------------+----------+
| Start block number of          | 22853632 |
| ancientDB (offset)             |          |
| End block number of ancientDB  | 38853631 |
| Remaining items in ancientDB   | 16000000 |
+--------------------------------+----------+
|    ANCIENTSTORE INFORMATION    |
+--------------------------------+----------+
manav2401 commented 3 months ago

I see, thanks for the info. I am afraid, a wrong value of offset is being set / used which is causing this. Can you run the following script and send the results back? It'll be really helpful to debug. Thanks!

https://gist.github.com/manav2401/157a102434eaa5b28983a9a477caa78d (You might want to create a new go project and run this file - main.go)

manav2401 commented 3 months ago

And it'll be helpful if you can share the logs while the ancient pruner was running (full logs will be better to spot errors if any). Thanks!

alecalve commented 3 months ago

Unfortunately I won't have the logs but I remember seeing no errors.

alecalve commented 3 months ago

Here's the output of your script:

offsetOfCurrentAncientFreezer: 22853632
offsetOfLastAncientFreezer: 0
alecalve commented 3 months ago

Ah I do have the logs, one thing that happened is that the docker container that ran the script, once it was finished, was restarted in a loop, could it explain it? On the further retries it logged:

Backup old ancientDB error               err="the number of old blocks is the same to reserved blocks, ancientItems=16000000"

The first run ended with:

Backup old ancientDB done                "current start blockNumber in ancientDB"=22,853,632
manav2401 commented 3 months ago

Thanks. This is fine I guess as it didn't prune the second time. I guess I know which code path is causing the issue but still need to validate it first. How's your setup like? Do you run via published packages, or can you run a new bor branch?

alecalve commented 3 months ago

We use Docker and run the official image but we can build an image from any source.

manav2401 commented 3 months ago

Alright, can you please deploy this branch (which is cut off from 1.3.3) on your setup and try restarting bor and send logs across? It doesn't fix anything but just adds logs which will be very helpful for debugging. I know that this is not the ideal way to debug but I don't think this issue is directly reproducible.

https://github.com/maticnetwork/bor/tree/manav/ancient-pruner-debug

Thanks!

alecalve commented 3 months ago

Oh I think the issue was that I deployed 1.3.2 over the pruned data dir.

Your branch is working fine and 1.3.3 too.

Sorry for the trouble!

manav2401 commented 3 months ago

Phew. Closing this issue for now. Feel free to re-open if needed. Thanks!

alecalve commented 3 months ago

I may have found a follow up issue: https://github.com/maticnetwork/bor/issues/1275