streamingfast / go-ethereum

Official Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
9 stars 5 forks source link

Polygon misses "statesync" transaction if heimdall is unreachable #4

Closed sduchesneau closed 1 year ago

sduchesneau commented 1 year ago

When Heimdall is unreachable for a while, the "SprintEnd" blocks that should contain system transactions are produced, with the correct block hash, but they do not contain the StateSync transactions.

Example log sequence for block 40335152 that should contain system transactions:

########## BAD BLOCK #########
Chain config: {ChainID: 137 ..... bor}

Number: 40335152
Hash: 0xf13b363616ae2ea40b2c9173a0e7393c2b67dd5d000d063542bd547d4e06a378
...
Error: invalid merkle root (remote: 5132c05119a8afa2e7c47c1d96a8c613777c30979a8624a50be5fe0c5ddf8f57 local: 3679c655048318d563c817eee56664e4b2af42d0fc7dcef395e570fb9363299c)
#############################

INFO [03-14|12:14:34.954] Fetching state updates from Heimdall     fromID=2,612,092 to=2023-03-14T12:10:44Z
INFO [03-14|12:14:34.954] Fetching state sync events               queryParams="from-id=2612092&to-time=1678795844&limit=50"
INFO [03-14|12:14:34.956] StateSyncData                            gas=0         number=40,335,152 lastStateID=2,612,091 total records=0  fetch time=2       process time=0

^ these messages get repeated during the period where heimdall is unavailable.

The next "BAD BLOCK" message is preceded by INFO [03-14|12:14:34.991] Skip duplicated bad block number=40,335,152 hash=f13b36..06a378

Then eventually we get:

INFO [03-14|12:18:09.505] Fetching state updates from Heimdall     fromID=2,612,092 to=2023-03-14T12:10:44Z
INFO [03-14|12:18:09.505] Fetching state sync events               queryParams="from-id=2612092&to-time=1678795844&limit=50"
INFO [03-14|12:18:09.515] → committing new state                   eventRecord="id 2612092, contract 0xD9c7C4ED4B66858301D0cb28Cc88bf655Fe34861, data: 0x000000000000000000000000519e9d6846e0bdbb715d4bd8d2261863740bce5d0000000000000000000000007d1afa7b718fb893db30a3abc0cfc608aacfebb000000000000000000000000000000000000000000000000013e62837b17b62000000000000000000000000000000000000000000000000000000000019e1fa02, txHash: 0x913c29409976126a62df552986fc1a0c2baf039e78b091628865daee1b3df7df, logIndex: 156, chainId: 137, time 2023-03-14T12:10:39Z, gasUsed 74888"
INFO [03-14|12:18:09.517] → committing new state                   eventRecord="id 2612093, contract 0xA6FA4fB5f76172d178d61B04b0ecd319C5d1C0aa, data: 0x87a7811f4bfedea3d341ad165680ae306b01aaeacc205d227629cf157dd9f821000000000000000000000000000000000000000000000000000000000000004000000000000000000000000000000000000000000000000000000000000000a0000000000000000000000000072f16ecddd7ebe1c166c3141b4e75484e65a401000000000000000000000000eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee0000000000000000000000000000000000000000000000000000000000000060000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000e35fa931a00000, txHash: 0x157e21ca2dded88b93fe1da9efd2a1e1ede4204aa4c697b967943fbe9e7b49a6, logIndex: 418, chainId: 137, time 2023-03-14T12:10:39Z, gasUsed 61470"
INFO [03-14|12:18:09.520] → committing new state                   eventRecord="id 2612094, contract 0xA6FA4fB5f76172d178d61B04b0ecd319C5d1C0aa, data: 0x87a7811f4bfedea3d341ad165680ae306b01aaeacc205d227629cf157dd9f821000000000000000000000000000000000000000000000000000000000000004000000000000000000000000000000000000000000000000000000000000000a0000000000000000000000000f31f2d41b7639ef06c7638a8046f95edb92580910000000000000000000000007fc66500c84a76ad7e9c93437bfc5ac33e2ddae900000000000000000000000000000000000000000000000000000000000000600000000000000000000000000000000000000000000000000000000000000020000000000000000000000000000000000000000000000000033cb036c5391c00, txHash: 0x5d0dcdec6968213bf7add2d47369e3d4360ff7c1aea33d105dc19554a11886f1, logIndex: 97, chainId: 137, time 2023-03-14T12:10:39Z, gasUsed 57715"
INFO [03-14|12:18:09.523] → committing new state                   eventRecord="id 2612095, contract 0xA6FA4fB5f76172d178d61B04b0ecd319C5d1C0aa, data: 0x87a7811f4bfedea3d341ad165680ae306b01aaeacc205d227629cf157dd9f821000000000000000000000000000000000000000000000000000000000000004000000000000000000000000000000000000000000000000000000000000000a0000000000000000000000000dd81387444364a9cc93156da7b686e4171b187e8000000000000000000000000a0b86991c6218b36c1d19d4a2e9eb0ce3606eb48000000000000000000000000000000000000000000000000000000000000006000000000000000000000000000000000000000000000000000000000000000200000000000000000000000000000000000000000000000000000000199109f3e, txHash: 0x2f0939444385456c1f34da2f936973612234f5bf60ec54b0c4f26774fdab1b6a, logIndex: 157, chainId: 137, time 2023-03-14T12:10:39Z, gasUsed 57439"
INFO [03-14|12:18:09.523] StateSyncData                            gas=251,512   number=40,335,152 lastStateID=2,612,095 total records=4  fetch time=8       process time=9
INFO [03-14|12:18:12.307] Deep froze chain segment                 blocks=3   elapsed=4.606ms     number=40,245,154 hash=542859..04a222
sduchesneau commented 1 year ago

The committing new state' message comes fromfunc (*GenesisContractsClient) CommitState`

I believe the block (40335152) was produced into firehose, then this happened:

https://github.com/streamingfast/go-ethereum/blob/release/polygon-0.x-fh2/core/blockchain.go#L1834 -> fmt.Errorf("invalid merkle root (remote: %x local: %x)"...)

if the block was already "closed" at that moment, then we won't create the same block again in firehose.. right? investigating...

sduchesneau commented 1 year ago

system transactions come from:

(core/state_processor) Process() (between finalizeblock and endblock) ->
  (bor) Finalize() ->
    (bor) CommitStates() ->
      (bor/GenesisContractsClient) CommitState() -> 
         (bor/statefull) ApplyMessage() -> 
              (firehose/context) StartTransactionRaw()

however, the validation of the block is done after that! so the block would be ended completely, THEN it would be rejected. THEN it would be "produced again" ... but since it has the same hash, the incomplete version of the block is kept.

sduchesneau commented 1 year ago

After catching another similar event with the full FIRE logs activated, I see that the block 40343056 is emitted mulitple times WITHOUT the system transactions while the heimdall is not up-to-date, failing "after" the END_BLOCK statement.

When the heimdall is back in-sync, the block is emitted again, this time with the system transactions (between "FIRE FINALIZE_BLOCK 40343056" and "FIRE END_BLOCK 40343056"). the console reader, will only take the first block into account (mostly because it does not ovewritee the one-block-files on disk), so we end up with the invalid block.

Solution will be to call END_BLOCK only after the block is validated, and trigger a "failure" if it is invalid.

sduchesneau commented 1 year ago

Fixed