ybstaked commented 2 years ago

🐞 Bug Report

Description

prysm 3.0.0 beacon data gets corrupted while running

🔬 Minimal Reproduction

running prysm beacon 3.0.0

beacon-chain --accept-terms-of-use --datadir /home/prysm/data --rpc-host 0.0.0.0 --rpc-port 4000 --grpc-gateway-host 0.0.0.0 --grpc-gateway-port 4001 --log-file /tmp/ethereum2.log --block-batch-limit=512 --http-web3provider=http://el-geth-mainnet-4-node-svc.eth.svc.cluster.local:8551/ --jwt-secret=/home/prysm/data/jwt/jwt-secret --checkpoint-sync-url=XXX --genesis-beacon-api-url=XXX --mainnet --suggested-fee-recipient=XXX

🔥 Error

Before node is broken

time="2022-08-26 13:22:59" level=info msg="Peer summary" activePeers=45 inbound=0 outbound=45 prefix=p2p                                                                                                                                   │
│ time="2022-08-26 13:23:00" level=warning msg="Recovering state summary for finalized root: 0x6a097ddb6aa3" prefix=db                                                                                                                       │
│ time="2022-08-26 13:23:00" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0x6a097ddb6aa3: missing state summary for checkpoint root" prefix=init │
│ time="2022-08-26 13:23:00" level=info msg="Processing block batch of size 495 starting from  0x23e46bac... 4554913/4558013 - estimated time remaining 2m5s" blocksPerSecond=24.8 peers=45 prefix=initial-sync

Upon a restart after the previous state

time="2022-08-26 14:11:37" level=info msg="Peer summary" activePeers=43 inbound=0 outbound=43 prefix=p2p
time="2022-08-26 14:11:38" level=info msg="Processing block batch of size 498 starting from  0x4bdc9494... 4556449/4558256 - estimated time remaining 13s" blocksPerSecond=138.8 peers=43 prefix=initial-sync
time="2022-08-26 14:11:38" level=warning msg="Batch is not processed" error="beacon node doesn't have a parent in db with root: 0xf80913bd90d52370de2df89c98cdf9bc4d40ab07365d29ff2c8ce99e45a6b215 (in processBatchedBlocks, slot=4556449)" prefix=initial-sync
time="2022-08-26 14:11:47" level=info msg="Processing block batch of size 495 starting from  0x23e46bac... 4554913/4558257 - estimated time remaining 20s" blocksPerSecond=163.6 peers=46 prefix=initial-sync

Once node is broken

time="2022-08-26 03:21:45" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main

🌍 Your Environment

Operating System:

  
Kubernetes

What version of Prysm are you running? (Which release)

  
v3.0.0

Anything else relevant (validator index / public key)?

Caused missed attestations for:

beejiujitsu commented 2 years ago

Confirming I am having similar issues using Image: gcr.io/prysmaticlabs/prysm/beacon-chain:v3.0.0:

 time="2022-08-29 15:31:16" level=warning msg="Running on Ethereum Mainnet" prefix=flags
│ time="2022-08-29 15:31:16" level=info msg="Checking DB" database-path="/var/lib/prysm/beacon-node/beaconchaindata" prefix=node
│ time="2022-08-29 15:31:16" level=info msg="Opening Bolt DB at /var/lib/prysm/beacon-node/beaconchaindata/beaconchain.db" prefix=db
│ time="2022-08-29 15:31:16" level=info msg="Deposit contract: 0x00000000219ab540356cbb839cbe05303d7705fa" prefix=node
│ time="2022-08-29 15:31:16" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main
│ Stream closed EOF for int/erigon-2 (prysm-beacon)

I run multiple erigon nodes, and this only happens on one of the nodes (each node has k8s Pod with a erigon container and prysm-beacon container).

I was able to workaround this problem by scaling erigon replicas down below the bad state prysm-beacon, deleting the PersistentVolumeClaim, PersistentVolume, and the Disk in GCP, then scaling erigon replicas back up.

prestonvanloon commented 2 years ago

Thanks for the report. I'll admit, this error is a bit unhelpful as there are three places that unmarshalling errors could have returned. I have added some additional context in PR #11342.

We also run in kubernetes with go-ethereum and have never experienced this issue. Would you be willing to share a corrupted database privately?

ybstaked commented 2 years ago

Unfortunately, no, we had to cleanup the pvcs and restart from checkpoint so we didn't continue to miss attestations and lose rewards.

prestonvanloon commented 2 years ago

Unfortunately, no, we had to cleanup the pvcs and restart from checkpoint so we didn't continue to miss attestations and lose rewards.

OK, no worries. #11342 should help in v3.0.1. We will keep monitoring and investigation open until we can pin point the issue.

If anyone stumbles upon this issue and hasn't deleted their beaconchain.db, please save a copy and reach out to me preston@prysmaticlabs.com or discord. It would greatly help to see what bytes are failing to unmarshal.

ybstaked commented 2 years ago

See another set of errors in another node here:

time="2022-08-30 15:09:28" level=info msg="Processing block batch of size 199 starting from  0x8042c314... 4586753/4587345 - estimated time remaining 59s" blocksPerSecond=9.9 peers=46 prefix=initial-sync
time="2022-08-30 15:09:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=46 prefix=p2p
time="2022-08-30 15:09:37" level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain
time="2022-08-30 15:10:07" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:10:07" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:10:07" level=info msg="Processing block batch of size 199 starting from  0x8042c314... 4586753/4587348 - estimated time remaining 59s" blocksPerSecond=9.9 peers=45 prefix=initial-sync
time="2022-08-30 15:10:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=46 prefix=p2p
time="2022-08-30 15:10:47" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:10:47" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:10:47" level=info msg="Processing block batch of size 507 starting from  0x8042c314... 4586753/4587352 - estimated time remaining 23s" blocksPerSecond=25.4 peers=46 prefix=initial-sync
time="2022-08-30 15:11:29" level=info msg="Peer summary" activePeers=45 inbound=0 outbound=45 prefix=p2p
time="2022-08-30 15:11:54" level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain
time="2022-08-30 15:12:27" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:12:27" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:12:27" level=info msg="Processing block batch of size 507 starting from  0x8042c314... 4586753/4587360 - estimated time remaining 23s" blocksPerSecond=25.4 peers=44 prefix=initial-sync
time="2022-08-30 15:12:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=44 prefix=p2p
time="2022-08-30 15:13:29" level=info msg="Peer summary" activePeers=44 inbound=0 outbound=44 prefix=p2p

Don't have direct access to this mainnet beaconchain.db, so I cannot send it to you.

ybstaked commented 2 years ago

Maybe this is an issue when restoring from checkpoint sync?

prestonvanloon commented 2 years ago

@ybstaked Does this error occur right away when starting a node with checkpoint sync?

time="2022-08-30 15:10:47" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync

Edit: Could you share what flags you are using for this node and the node serving the checkpoint?

ybstaked commented 2 years ago

No, this error does not occur right away when starting a node with checkpoint sync. I shared our node's flags before, but here it is again. We are using infura currently as the checkpoint node.

beacon-chain --accept-terms-of-use --datadir /home/prysm/data --rpc-host 0.0.0.0 --rpc-port 4000 --grpc-gateway-host 0.0.0.0 --grpc-gateway-port 4001 --log-file /tmp/ethereum2.log --block-batch-limit=512 --http-web3provider=http://el-geth-mainnet-4-node-svc.eth.svc.cluster.local:8551/ --jwt-secret=/home/prysm/data/jwt/jwt-secret --checkpoint-sync-url=XXX@eth2-beacon-mainnet.infura.io --genesis-beacon-api-url=XXX@eth2-beacon-mainnet.infura.io --mainnet --suggested-fee-recipient=XXX

prestonvanloon commented 2 years ago

Thanks for that context. We are still investigating both of these issues. The first one may be related to an improper shutdown of the client. Kubernetes podSpec default value for terminationGracePeriodSeconds is 30. This may be too short for Prysm to shutdown properly. I would recommend increasing it to 120 and ensuring that you are using a solid state disk for faster writes.

ybstaked commented 2 years ago

Ok noted, but in this case, the first error I sent you here was before any restarts or shutdowns: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231810731

ybstaked commented 2 years ago

@prestonvanloon I emailed you the beaconchain.db specifically from: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231810731

prestonvanloon commented 2 years ago

Thanks @ybstaked. I was able to reproduce this issue. Still investigating...

potuz commented 2 years ago

I believe this issue is solved:

time="2022-08-30 15:10:47" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db

But this other one I do not really know how it happens. Would it be possible to check with the current develop images?

time="2022-08-26 03:21:45" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main

ybstaked commented 2 years ago

thanks @potuz, for this error, @prestonvanloon mentioned that it may be related to an improper shutdown, i.e.: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231843716 and also said:

"Later this week, I'll try some experimentation with SIGKILL to see if I can replicate the state you were in. We may not be able to prevent improper shutdowns, but maybe we can repair damaged databases automatically."

We will on our end increase the terminationGracePeriodSeconds, but can this also be made better in the software?

time="2022-08-26 03:21:45" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main

ybstaked commented 2 years ago

Btw I just checked and we use a terminationGracePeriodSeconds of 180 seconds (this is what was set when I posted that error; more than your recommendation of 120 seconds)

rauljordan commented 1 year ago

hi @ybstaked wondering if you are still running into this issue post-merge

nisdas commented 1 year ago

@rauljordan This is still relevant, I am currently looking at it

leohhhn commented 1 year ago

having the same error. Firstly Prysm synced nicely to the provided checkpoint and was following the chain head. Shut it down in order to set up a systems service, tried starting Prysm with the same command, and now getting ERROR main: could not get ancestor state: slot 5616191 not in db due to checkpoint sync: cannot retrieve data for slot

prysmaticlabs / prysm

prysm 3.0.0 beacon data gets corrupted #11327

🐞 Bug Report

Description

🔬 Minimal Reproduction

🔥 Error

🌍 Your Environment