Open ybstaked opened 2 years ago
Confirming I am having similar issues using Image: gcr.io/prysmaticlabs/prysm/beacon-chain:v3.0.0
:
time="2022-08-29 15:31:16" level=warning msg="Running on Ethereum Mainnet" prefix=flags
│ time="2022-08-29 15:31:16" level=info msg="Checking DB" database-path="/var/lib/prysm/beacon-node/beaconchaindata" prefix=node
│ time="2022-08-29 15:31:16" level=info msg="Opening Bolt DB at /var/lib/prysm/beacon-node/beaconchaindata/beaconchain.db" prefix=db
│ time="2022-08-29 15:31:16" level=info msg="Deposit contract: 0x00000000219ab540356cbb839cbe05303d7705fa" prefix=node
│ time="2022-08-29 15:31:16" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main
│ Stream closed EOF for int/erigon-2 (prysm-beacon)
I run multiple erigon nodes, and this only happens on one of the nodes (each node has k8s Pod with a erigon container and prysm-beacon container).
I was able to workaround this problem by scaling erigon replicas down below the bad state prysm-beacon, deleting the PersistentVolumeClaim, PersistentVolume, and the Disk in GCP, then scaling erigon replicas back up.
Thanks for the report. I'll admit, this error is a bit unhelpful as there are three places that unmarshalling errors could have returned. I have added some additional context in PR #11342.
We also run in kubernetes with go-ethereum and have never experienced this issue. Would you be willing to share a corrupted database privately?
Unfortunately, no, we had to cleanup the pvcs and restart from checkpoint so we didn't continue to miss attestations and lose rewards.
Unfortunately, no, we had to cleanup the pvcs and restart from checkpoint so we didn't continue to miss attestations and lose rewards.
OK, no worries. #11342 should help in v3.0.1. We will keep monitoring and investigation open until we can pin point the issue.
If anyone stumbles upon this issue and hasn't deleted their beaconchain.db, please save a copy and reach out to me preston@prysmaticlabs.com or discord. It would greatly help to see what bytes are failing to unmarshal.
See another set of errors in another node here:
time="2022-08-30 15:09:28" level=info msg="Processing block batch of size 199 starting from 0x8042c314... 4586753/4587345 - estimated time remaining 59s" blocksPerSecond=9.9 peers=46 prefix=initial-sync
time="2022-08-30 15:09:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=46 prefix=p2p
time="2022-08-30 15:09:37" level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain
time="2022-08-30 15:10:07" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:10:07" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:10:07" level=info msg="Processing block batch of size 199 starting from 0x8042c314... 4586753/4587348 - estimated time remaining 59s" blocksPerSecond=9.9 peers=45 prefix=initial-sync
time="2022-08-30 15:10:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=46 prefix=p2p
time="2022-08-30 15:10:47" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:10:47" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:10:47" level=info msg="Processing block batch of size 507 starting from 0x8042c314... 4586753/4587352 - estimated time remaining 23s" blocksPerSecond=25.4 peers=46 prefix=initial-sync
time="2022-08-30 15:11:29" level=info msg="Peer summary" activePeers=45 inbound=0 outbound=45 prefix=p2p
time="2022-08-30 15:11:54" level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain
time="2022-08-30 15:12:27" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:12:27" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:12:27" level=info msg="Processing block batch of size 507 starting from 0x8042c314... 4586753/4587360 - estimated time remaining 23s" blocksPerSecond=25.4 peers=44 prefix=initial-sync
time="2022-08-30 15:12:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=44 prefix=p2p
time="2022-08-30 15:13:29" level=info msg="Peer summary" activePeers=44 inbound=0 outbound=44 prefix=p2p
Don't have direct access to this mainnet beaconchain.db, so I cannot send it to you.
Maybe this is an issue when restoring from checkpoint sync?
@ybstaked Does this error occur right away when starting a node with checkpoint sync?
time="2022-08-30 15:10:47" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
Edit: Could you share what flags you are using for this node and the node serving the checkpoint?
No, this error does not occur right away when starting a node with checkpoint sync. I shared our node's flags before, but here it is again. We are using infura currently as the checkpoint node.
beacon-chain --accept-terms-of-use --datadir /home/prysm/data --rpc-host 0.0.0.0 --rpc-port 4000 --grpc-gateway-host 0.0.0.0 --grpc-gateway-port 4001 --log-file /tmp/ethereum2.log --block-batch-limit=512 --http-web3provider=http://el-geth-mainnet-4-node-svc.eth.svc.cluster.local:8551/ --jwt-secret=/home/prysm/data/jwt/jwt-secret --checkpoint-sync-url=XXX@eth2-beacon-mainnet.infura.io --genesis-beacon-api-url=XXX@eth2-beacon-mainnet.infura.io --mainnet --suggested-fee-recipient=XXX
Thanks for that context. We are still investigating both of these issues. The first one may be related to an improper shutdown of the client. Kubernetes podSpec default value for terminationGracePeriodSeconds
is 30. This may be too short for Prysm to shutdown properly. I would recommend increasing it to 120 and ensuring that you are using a solid state disk for faster writes.
Ok noted, but in this case, the first error I sent you here was before any restarts or shutdowns: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231810731
@prestonvanloon I emailed you the beaconchain.db specifically from: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231810731
Thanks @ybstaked. I was able to reproduce this issue. Still investigating...
I believe this issue is solved:
time="2022-08-30 15:10:47" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
But this other one I do not really know how it happens. Would it be possible to check with the current develop images?
time="2022-08-26 03:21:45" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main
thanks @potuz, for this error, @prestonvanloon mentioned that it may be related to an improper shutdown, i.e.: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231843716 and also said:
"Later this week, I'll try some experimentation with SIGKILL to see if I can replicate the state you were in. We may not be able to prevent improper shutdowns, but maybe we can repair damaged databases automatically."
We will on our end increase the terminationGracePeriodSeconds, but can this also be made better in the software?
time="2022-08-26 03:21:45" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main
Btw I just checked and we use a terminationGracePeriodSeconds of 180 seconds (this is what was set when I posted that error; more than your recommendation of 120 seconds)
hi @ybstaked wondering if you are still running into this issue post-merge
@rauljordan This is still relevant, I am currently looking at it
having the same error. Firstly Prysm synced nicely to the provided checkpoint and was following the chain head. Shut it down in order to set up a systems service, tried starting Prysm with the same command, and now getting
ERROR main: could not get ancestor state: slot 5616191 not in db due to checkpoint sync: cannot retrieve data for slot
🐞 Bug Report
Description
prysm 3.0.0 beacon data gets corrupted while running
🔬 Minimal Reproduction
running prysm beacon 3.0.0
🔥 Error
Before node is broken
Upon a restart after the previous state
Once node is broken
🌍 Your Environment
Operating System:
What version of Prysm are you running? (Which release)
Anything else relevant (validator index / public key)?
Caused missed attestations for: