status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain
https://nimbus.guide
Other
543 stars 233 forks source link

Corrupted deposit history detected / Resyncing loop #5145

Open kiss81 opened 1 year ago

kiss81 commented 1 year ago

@arnetheduck thank you for looking at the issue, but it closed a bit too soon I think :) Follow of issue #5136 as it got closed.

I do:

sudo -u nimbus rm -rf /var/lib/nimbus/db
sudo -u nimbus /usr/bin/nimbus_beacon_node trustedNodeSync --network=mainnet  --trusted-node-url=https://checkpointz.pietjepuk.net --data-dir=/var/lib/nimbus --network=mainnet --backfill=false

then start the node with systemd

After it does the backfill for a few hours all is good for a few minutes and then this happens:

ait=2m39s939ms95us565ns nextAttestationSlot=6767405 nextProposalSlot=-1 syncCommitteeDuties=none head=4242829b:6767391
Jun 29 11:58:47 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:58:47.000+02:00 Slot start                                 topics="beacnde" slot=6767392 epoch=211481 sync=synced peers=17 head=4242829b:6767391 finalized=211478:6dcec773 delay=875us179ns
Jun 29 11:58:48 ethVal2 nimbus_beacon_node[182]: WRN 2023-06-29 11:58:48.741+02:00 Discovered new external address but ENR auto update is off topics="eth p2p discv5" majority=some(87.209.245.227:9000) previous=none(Address)
Jun 29 11:58:51 ethVal2 nimbus_beacon_node[182]: ERR 2023-06-29 11:58:51.114+02:00 Corrupted deposits history detected        topics="elmon" ourDepositsCount=831007 targetDepositsCount=831007 ourDepositsRoot=62a4356e targetDepositsRoot=c5384abd
Jun 29 11:58:55 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:58:55.912+02:00 State replayed                             topics="chaindag" blocks=0 slots=32 current=4242829b:6767391@6767392 ancestor=180843f4:6767392 target=180843f4:6767392@6767424 ancestorStateRoot=97bfca88 targetStateRoot=23214641 found=true assignDur=9ms906us965ns replayDur=728ms555us875ns
Jun 29 11:58:55 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:58:55.952+02:00 Slot end                                   topics="beacnde" slot=6767392 nextActionWait=2m27s47ms658us680ns nextAttestationSlot=6767405 nextProposalSlot=-1 syncCommitteeDuties=none head=180843f4:6767392
Jun 29 11:58:59 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:58:59.000+02:00 Slot start                                 topics="beacnde" slot=6767393 epoch=211481 sync=synced peers=17 head=180843f4:6767392 finalized=211479:2ad226d8 delay=703us691ns
Jun 29 11:59:07 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:59:07.062+02:00 Slot end                                   topics="beacnde" slot=6767393 nextActionWait=2m15s937ms431us342ns nextAttestationSlot=6767405 nextProposalSlot=-1 syncCommitteeDuties=none head=5f0842fe:6767393
Jun 29 11:59:11 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:59:11.000+02:00 Slot start                                 topics="beacnde" slot=6767394 epoch=211481 sync=synced peers=17 head=5f0842fe:6767393 finalized=211479:2ad226d8 delay=774us194ns
Jun 29 11:59:19 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:59:19.035+02:00 Slot end                                   topics="beacnde" slot=6767394 nextActionWait=2m3s964ms743us682ns nextAttestationSlot=6767405 nextProposalSlot=-1 syncCommitteeDuties=none head=e3c4099c:6767394
Jun 29 11:59:23 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:59:23.000+02:00 Slot start                                 topics="beacnde" slot=6767395 epoch=211481 sync=synced peers=16 head=e3c4099c:6767394 finalized=211479:2ad226d8 delay=664us693ns
Jun 29 11:59:28 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:59:28.877+02:00 Eth1 sync progress                         topics="elmon" blockNumber=11185311 depositsProcessed=1
Jun 29 11:59:29 ethVal2 nimbus_beacon_node[182]: INF 2023-06-29 11:59:29.697+02:00 Eth1 sync progress   

After half an hour it's in sync again and after a few minutes the process starts all over again in a loop.

When I look in the besu log I see this:

Jun 29 11:58:25 ethVal2 besu[181]: 2023-06-29 11:58:25.998+02:00 | vert.x-worker-thread-0 | INFO  | AbstractEngineForkchoiceUpdated | VALID for fork-choice-update: head: 0x81ec6b00dd566d41e414d9ad87d3110a766f11a82c0cd4689749416b1e31d048, finalized: 0x2bd7fb93d354bfa0242b58adf88ffe6ab24e57676a69f31820dc11b4e1354cca, safeBlockHash: 0x273737b152fc88f8ab9fe661f437b61ba8a3e8fd1d1b678d01c33362c4d8e701
Jun 29 11:58:38 ethVal2 besu[181]: 2023-06-29 11:58:38.058+02:00 | vert.x-worker-thread-0 | INFO  | AbstractEngineNewPayload | Imported #17,584,214 / 132 tx / 16 ws / base fee 14.94 gwei / 13,434,162 (44.8%) gas / (0x75eef592929135d56fd15d3f02d64575102f1532ddc6010144c61410a610833c) in 0.740s. Peers: 25
>>> when it starts resyncing I see this message:

AbstractEngineForkchoiceUpdated | VALID for fork-choice-update: head: 0x276e8191a6a248f7215ee7b82782a960c4ed9928f6dc2b1ea10df144a09445de, finalized: 0x273737b152fc88f8ab9fe661f437b61ba8a3e8fd1d1b678d01c33362c4d8e701, safeBlockHash: 0xe2ed3902def82f09983ee6f67f123475e6d64353e2f3c5e63a6bba0fba6b0c30

Now I found the "Corrupted deposits history" it seems I am not alone: #4669 @ibhagwan (found a solution yet?)

edit: now testing the clear .caches folder from besu.

kiss81 commented 1 year ago

Update: clearing the suggested ~/.besu/caches directory didn't work for me: I still see this error and the deposit sync starts all over again:

Jun 29 13:47:37 ethVal2 nimbus_beacon_node[184]: ERR 2023-06-29 13:47:37.333+02:00 Corrupted deposits history detected

kiss81 commented 1 year ago

Really like to use nimbus as my beacon node as it's so nice on resources, but right now it makes the system run hot. Would it be an idea to go back and try history:archive instead of prune mode or maybe skip the trusted sync? Any ideas are welcome as I have no clue.

the729 commented 7 months ago

Same problem here.

Besu version: 24.3 Nimbus verison: 24.3.0

log:

NTC 2024-04-14 12:13:10.358+00:00 Attestation failed to match head           topics="chaindag" epoch=276750 validator=87a3ef28
INF 2024-04-14 12:13:11.001+00:00 Slot start                                 topics="beacnde" head=627179d7:8856063 delay=1ms94us524ns finalized=276749:ff359206 peers=158 slot=8856064 sync=synced epoch=276752
INF 2024-04-14 12:13:11.048+00:00 Prepared beacon proposers                  topics="rest_validatorapi" numUpdatedFeeRecipients=0 numRefreshedFeeRecipients=1
ERR 2024-04-14 12:13:15.661+00:00 Corrupted deposits history detected        topics="elchain" ourDepositsCount=1434821 targetDepositsCount=1434829 ourDepositsRoot=1f385454 targetDepositsRoot=379589eb

I've tried deleting besu cache, or disabling besu cache with --auto-log-bloom-caching-enabled=false, no difference.

I found the query to get deposits was like the following:

{"jsonrpc":"2.0","method":"eth_getLogs","params":[{"fromBlock":"0x12a7000","toBlock":"0x12a8000","address":"0x00000000219ab540356cbb839cbe05303d7705fa","topics":["0x649bbc62d0e31342afea4e5cd82d4049e7e1ee912fc0889aa790803be39038c5"]}], "id":1}

So I tried to compare my besu node's response with infura's. So I wrote a script to compare all the response from 0x10a0000 to the latest, and found no difference at all. That made me to think it is nimbus's problem.

etan-status commented 7 months ago

If could be that the initially imported deposit snapshot was incorrect (doing a new checkpoint sync should fix it).

the729 commented 7 months ago

If could be that the initially imported deposit snapshot was incorrect (doing a new checkpoint sync should fix it).

  • Did this start happening after an update, or on a fresh sync?
  • Is this on a single-EL or on a multi-EL setup?
  • Was one of the EL restarted while Nimbus was running?

Thanks for your reply.

  1. I think this started to happen roughly around 2024.4.6 (according to CPU usage metrics). I was using nimbus 24.2.2 by then, and upgraded to 24.3.0 later. I noted the problem because of the upgrade, but according to the very similar CPU usage patterns, it is highly probable that it started before the upgrade.
  2. I have one instance of besu and nimbus and validator each. That is single-EL?
  3. I don't know what happened around 2024.4.6. The upgrade did involve restarting both nimbus and besu. After noticing the problem, I did multiple restart of both besu and nimbus.

Another thing that might be interesting:

The difference between 'ourDepositsCount' and 'targetDepositsCount' does not stay fixed. The difference is fixed for a given targetCount, but as epochs advance and targetCount changes, the difference between these two will change. I have recorded the following logs:

ourDepositsCount=1432476 targetDepositsCount=1432497 ourDepositsRoot=a6630d57 targetDepositsRoot=c3c073e3
ourDepositsCount=1433043 targetDepositsCount=1433056 ourDepositsRoot=941e37dd targetDepositsRoot=a98323d9
ourDepositsCount=1434821 targetDepositsCount=1434829 ourDepositsRoot=1f385454 targetDepositsRoot=379589eb
ourDepositsCount=1456631 targetDepositsCount=1456656 ourDepositsRoot=1bf0fec3 targetDepositsRoot=6786bb2e

I'm not very familiar with the machanism, but I think if there are some fixed deposits that are missing, the difference should stay a fixed number.

the729 commented 6 months ago

I did resynced deposit snapshot. It worked for now.

etan-status commented 6 months ago

I did resynced deposit snapshot. It worked for now.

Great to hear that this is working now! Do you know whether the previous deposit snapshot was correct? If there is a way to reproduce the issue, we could investigate it more thoroughly.

the729 commented 6 months ago

Luckily I have a copy of the previous deposit snapshot, in two format.

From API:

$ curl -X GET http://localhost:5052/eth/v1/beacon/deposit_snapshot
{"data":{"finalized":["0xed851189578e43fe30657d1e6193b45bef7ee7a6e226eaebc9675d801081dd6d","0x51d304594b583192b6579834499c62376a0fca2233b96aee21e87c6fd5c28637","0xdd0ab193e510ea7ddf508fc9c2900c2098eae2fc49a10d7c25293975fa7c3ca9","0x77f61de440cb93a2c0222f8241f9a6726c75a5e919341abe02a30e0331b0c355","0xdbf9383ebffa839826fc89f993f1ef289c808a951310ff376866c46952c2c619","0xe7bb3cc87417966cf117c124f5ce0394e834d5350a3f3a699cd13271a657b01b","0xcd51fb3413c81cc4c88adca8bf399e96e3fdd3443b0c48f74bde393e92c44edd","0xcc1bc98fe5b78d66581d4f7ea317e642947806215095c738ee6263a6064b0923","0x78545fc1d24b409a8278a0f87985a792a430be3d63f2d94d4fcc17bee0e6b12e","0x102068d3bffe793aadd25cd44bfb6b840b0743c1772538dbbf79eec54c7aa756","0xe58dd9471eb26bee450525e5a4cf0f5141417a1bc10ec16f2cc91ea45188e242","0x8e3ef7a9c1604f777b9acb267b0a5b07bae59068bab7afe5158a9d2f5b4bbd0d","0xd47c6b785cafa0975c9edf49ff18f403a216c2259137c0e7d45f32a789d678f8"],"deposit_root":"0xedc61e5bb048448f184a0003560a33db848104623ea4055e287206b1d9292290","deposit_count":"780148","execution_block_hash":"0xd162f42047b00d9712b021bd8c7c80e299360126adb246237e64fd4e1a0f0d92","execution_block_height":"17403122"}}

From DB:

sqlite> select hex(value) from key_values where hex(key)='0E';
C808F4C202D162F42047B00D9712B021BD8C7C80E299360126ADB246237E64FD4E1A0F0D92D012949E8E973D86D7768D5636BD582787720E90A3E6C870F0C339E6A1EC0A27B9F77D5736CF432A404A7F6F682E06990DB70BF02636E30DB4CEA1A6FF0088E4D47C6B785CAFA0975C9EDF49FF18F403A216C2259137C0E7D45F32A789D678F8C4519D62E9133429F908DCF552C407F1A7019A2E3A14B22539E567DA6DDA35328E3EF7A9C1604F777B9ACB267B0A5B07BAE59068BAB7AFE5158A9D2F5B4BBD0DE58DD9471EB26BEE450525E5A4CF0F5141417A1BC10EC16F2CC91EA45188E242102068D3BFFE793AADD25CD44BFB6B840B0743C1772538DBBF79EEC54C7AA7569AAB14144DD209C2EDEE2DA32154BE858FE5003CFFDDA9312A80CB4BA801161078545FC1D24B409A8278A0F87985A792A430BE3D63F2D94D4FCC17BEE0E6B12ECC1BC98FE5B78D66581D4F7EA317E642947806215095C738EE6263A6064B0923CD51FB3413C81CC4C88ADCA8BF399E96E3FDD3443B0C48F74BDE393E92C44EDD87AA29529266D9F8813B779FB3DB5783627C21FB7D72DE768E79D3CD9B60C479018432F487C9B364F560DFAE337ED6E1B53A7D7F1A77C1C0EE26EAC9B0778FB8E7BB3CC87417966CF117C124F5CE0394E834D5350A3F3A699CD13271A657B01BDBF9383EBFFA839826FC89F993F1EF289C808A951310FF376866C46952C2C61977F61DE440CB93A2C0222F8241F9A6726C75A5E919341ABE02A30E0331B0C355DD0AB193E510EA7DDF508FC9C2900C2098EAE2FC49A10D7C25293975FA7C3CA951D304594B583192B6579834499C62376A0FCA2233B96AEE21E87C6FD5C28637D47152335D9460F2B6FB7ABA05CED32A52E9F46659CCD3DAA2059661D75A6308ED851189578E43FE30657D1E6193B45BEF7EE7A6E226EAEBC9675D801081DD6D0000000000000000000000000000000000000000000000000000000000000000000000FE1600FE1600FE1600FE1600FE1600E61600280BE774F28C090100000000

sqlite> select hex(value) from key_values where hex(key)='08';
C008F4C202D162F42047B00D9712B021BD8C7C80E299360126ADB246237E64FD4E1A0F0D92D012949E8E973D86D7768D5636BD582787720E90A3E6C870F0C339E6A1EC0A27B9F77D5736CF432A404A7F6F682E06990DB70BF02636E30DB4CEA1A6FF0088E4D47C6B785CAFA0975C9EDF49FF18F403A216C2259137C0E7D45F32A789D678F8C4519D62E9133429F908DCF552C407F1A7019A2E3A14B22539E567DA6DDA35328E3EF7A9C1604F777B9ACB267B0A5B07BAE59068BAB7AFE5158A9D2F5B4BBD0DE58DD9471EB26BEE450525E5A4CF0F5141417A1BC10EC16F2CC91EA45188E242102068D3BFFE793AADD25CD44BFB6B840B0743C1772538DBBF79EEC54C7AA7569AAB14144DD209C2EDEE2DA32154BE858FE5003CFFDDA9312A80CB4BA801161078545FC1D24B409A8278A0F87985A792A430BE3D63F2D94D4FCC17BEE0E6B12ECC1BC98FE5B78D66581D4F7EA317E642947806215095C738EE6263A6064B0923CD51FB3413C81CC4C88ADCA8BF399E96E3FDD3443B0C48F74BDE393E92C44EDD87AA29529266D9F8813B779FB3DB5783627C21FB7D72DE768E79D3CD9B60C479018432F487C9B364F560DFAE337ED6E1B53A7D7F1A77C1C0EE26EAC9B0778FB8E7BB3CC87417966CF117C124F5CE0394E834D5350A3F3A699CD13271A657B01BDBF9383EBFFA839826FC89F993F1EF289C808A951310FF376866C46952C2C61977F61DE440CB93A2C0222F8241F9A6726C75A5E919341ABE02A30E0331B0C355DD0AB193E510EA7DDF508FC9C2900C2098EAE2FC49A10D7C25293975FA7C3CA951D304594B583192B6579834499C62376A0FCA2233B96AEE21E87C6FD5C28637D47152335D9460F2B6FB7ABA05CED32A52E9F46659CCD3DAA2059661D75A6308ED851189578E43FE30657D1E6193B45BEF7EE7A6E226EAEBC9675D801081DD6D0000000000000000000000000000000000000000000000000000000000000000000000FE1600FE1600FE1600FE1600FE1600E61600080BE774

I see the snapshot was quite old. EL block 17403122 is roughly around the time when I setup the node. It seems it was not updated ever.

I remember every time when I restart nimbus before, it would sweep the deposits from that snapshot point, and became running quietly after the sweep. It was since recent weeks that I noticed it keep sweeping and showing the error after each time.