prysmaticlabs / prysm

Go implementation of Ethereum proof of stake
https://www.offchainlabs.com/
GNU General Public License v3.0
3.41k stars 969 forks source link

Bug: Prysm broadcasts blocks with invalid attestations (Antithesis Experiment) #13336

Open qu0b opened 7 months ago

qu0b commented 7 months ago

Describe the bug

While looking through an antithesis experiment I noticed that Prysm nodes seem to be keeping up with head, but were broadcasting blocks with invalid attestations. Moments after the blocks were broadcast by the beacon nodes the nodes very own validator client complained about invalid attestations being included and rejecting the already broadcast block.

[  1109.067258] [system;capturefs;files;/service_prysm-nethermind-0--prysm-bn]        [I] time="2023-11-17 12:18:34" level=debug msg="Computed state root" beaconStateRoot=0x9afaa99daee24634f948270a31d40fece2978e301577a3177da5689014462075 prefix="rpc/validator"
[  1109.067267] [system;capturefs;files;/service_prysm-nethermind-0--prysm-bn]        [I] time="2023-11-17 12:18:34" level=info msg="Finished building block" prefix="rpc/validator" sinceSlotStartTime=491.01603ms slot=90 validator=43
[  1109.069214] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] time="2023-11-17 12:18:34" level=debug msg="gRPC request finished." backend=[] duration=180.255652ms method="/ethereum.eth.v1alpha1.BeaconNodeValidator/GetBeaconBlock"
[  1109.069303] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] time="2023-11-17 12:18:34" level=debug msg="gRPC request finished." backend=[] duration=74.506µs method="/ethereum.eth.v1alpha1.BeaconNodeValidator/DomainData"
[  1109.071286] [system;capturefs;files;/service_prysm-nethermind-0--prysm-bn]        [I] time="2023-11-17 12:18:34" level=debug msg="Broadcasting block" blockRoot=5805216f04d000597b5e52509db8d12696147dc5c2dc5d9a11ff4b5e6c774035 prefix="rpc/validator"
[  1109.519298] [system;capturefs;files;/service_prysm-nethermind-0--prysm-bn]        [I] time="2023-11-17 12:18:34" level=debug msg="Latest eth1 chain event" blockHash=0x68add428e08ca0c7c5733709313760e08a7c386cdbadcfcff85b6e421ec2bf19 blockNumber=52 prefix=powchain
[  1110.178996] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] time="2023-11-17 12:18:35" level=debug msg="gRPC request finished." backend=[] duration=1.108721852s method="/ethereum.eth.v1alpha1.BeaconNodeValidator/ProposeBeaconBlock"
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] time="2023-11-17 12:18:35" level=error msg="Failed to propose block" blockSlot=90 error="rpc error: code = Unknown desc = could not process beacon block: failed to validate consensus state transition function: could not batch verify signature: some signatures are invalid. details:
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] signature 'attestation signature' is invalid. signature: 0xa0ff3aba55cef9175ddd50a497d54e9a729d18f718f18a88fbfe490f678703e30796af2a6dc29ac8d68bccf89ede83e808faa37205088b9f14c4274c023c959fa85aba150eaa16c8b682d3cb7bb4a7f54aca3d31de151608a7e64666624d8481, public key: 0xaefa4ece060bbd6555580c8e91d43d19ca0b7d3b46c1f5c9e92ddb05f1086afa778a77eae0e8decde0730339589106a5, message: 0x79a3bf6ace9ec3c0aa9791e899da49362ddad5fa61530615abe63105b2d4480c
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] signature 'attestation signature' is invalid. signature: 0x8686edd562bc3b545882ee55c5a2dcd9bbabedc85db5430e93ab726ac192f7f64b906b8d2cb1f5a1ce80308635f5e9b013704b9968631aa51024e1f10f7db4d99a9cf3beb0b6e1db61f031313d61e09dddaed87b4b5c3d824c3bec00b006a9b4, public key: 0x96bb8296644e0138b94da3af5ea34465199781daa931ba815d24c698621686eec279773eeca3e86f8101d83ade3fb4be, message: 0xe23467b2d126606f7d7f658ce40c05b1863eab19c81419bff88ca41740e69570
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] signature 'attestation signature' is invalid. signature: 0x8c7cc5f759291c9af006fce708b3ce381f14568b498f61376109ab2cbdbc4aea032f9c04095ff913eeb6a8e28b40b5b2010220e63c585e6c4f175daba69b871b1da6e1f610f3b75799ab11d243687bf9de996ce7da72c3e0e94106808f84c985, public key: 0x804a4521e71fc5d02c5e89259c2fcbb3c9db648b2c4a2c8f0457484bf7e86bc7c348bd13d87004eb126099252abd3b4f, message: 0x69683d85ae0bd9732ab91966981f02529138cb3025373ad590d2b9a8c3fe6c90
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] signature 'attestation signature' is invalid. signature: 0x997a44cf4e25f45ee111d0ba206b007934ed86c6740035198a62bee1fa662c8a2083ee0cf186d8974fed2cbd1c9d37fe00d64bfbb20528c368a2629ba417af0de052a540df055af455f84b6ff0e919d43b73eb9c1fda9377509ef923331be9ec, public key: 0xab311ad1160538324c9dd24f36c37a005f053cd8a0477064b1c8675a13189a605d427a7d6279db300a3e419088ec0b85, message: 0xa7c316363b6f9ca99f0a34b4021e293af4dd07ec2805bdf9c82c3d24a28d7ec1
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] signature 'attestation signature' is invalid. signature: 0x80be1c51828849d70bf5c1be8a57c9a5ee00d1d8c2ea868735080b653b0ac32f34e9f6bcbff23d03f4e3a070b3170da70ec261bbdbbeb8da0f2e613109194e9631eb5cb948f3a66f518b8c6ac32751aae34bfc80696aa39aba5acc7a29a85c8f, public key: 0x817e70689a7f980a72013952b32b0fafdd76797adc93655ca7384e5b0d105780207bc444f9b9f18c9766a793ee65bc22, message: 0xce05249417f4cb96a8ffe934915252fb967c5fee19f96021305affaf4338f15d
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] signature 'attestation signature' is invalid. signature: 0x8d891da023008d8915099b4e1ad92c1afefb48eea4f2c65245745592c5156428297a8f2b13333288fdbfd43601ef34c808c0a354dc871801075e2504373f74b630c071f45ef65e9597e98e5280c4a55ad326f2c862c3eb2d97ecd5e6ca6c2546, public key: 0xb4e8db046344e749d788b88b5513a8231c6cf04f6322bc1f4a29c995c6c2a9d6bb9358e356e6fae356dab5cdba7e525a, message: 0x065bc08dce2d90a42c305dc801377ccef5a4db736bbdf790e75e96bd73b21a6b
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] signature 'attestation signature' is invalid. signature: 0xb62f412895aaf9e1481403e6f6f3322bcf5a159c1efbc1c40284315856ed3b26164931b2c37659d5eb3d5f7e5433feab0106c337983942a4d554ba04cbbd7f782eb11663011d191baed90235912857e0b28ececcbb4bef1b74dfa82d8b896b6b, public key: 0xb216d6589db7136aa2f9063da57d4400e47d2f1dc1847c8670657af0340bcd580f0c67f808e289f22f198cf79cc5573e, message: 0xf03584643d15d9b3c22d47e76bf1e9780fc9a512073b14ea5e57c66a2736416b
[  1110.180478] [system;capturefs;files;/service_prysm-nethermind-0--prysm-vc]        [I] signature 'attestation signature' is invalid. signature: 0xb8aeeac4fcde7dbee173db4e855dacb73e0c7c376be8f220cfc863972fd8a16e669b5f858e7c9028819fc523dd9fda1b160b5fc797cec7308ed794522d2ecb4f24ebfbc39393a78512d51c89fcc71052fc8d9c3b25472ed450701a074d0ac90c, public key: 0x906ebc9061329e5032001ed39e3bdc9c547420546b5c18a4c7ff29fff3ac2ea897db835cfdb25443bdf92c0d6f2ede9b, message: 0xc975360833320ed2d941dad14c5cf6e9543ba9c6040c1acf5dddee37f474afc4" prefix=validator pubKey=0x8f2f44f075cd
[  1110.447913] [system;capturefs;files;/service_prysm-nethermind-0--prysm-bn]        [I] time="2023-11-17 12:18:35" level=debug msg="Latest eth1 chain event" blockHash=0x68add428e08ca0c7c5733709313760e08a7c386cdbadcfcff85b6e421ec2bf19 blockNumber=52 prefix=powchain

This behaviour seemed to continue until the first block was finalized. However, this is just an observed correlation and does not necessitate the cause.

Using Teku (and the hex dump logs) I decoded all of the invalid blocks and discovered that only prysm was at fault for proposing invalid blocks:

for file in /blocks/slot_{77,80,90,97,99,101,103,107,110,116,121}.hex; do echo $file; docker run --rm -v $PWD/blocks:/blocks b5715a74040f teku debug-tools pp DENEB SignedBeaconBlock $file | jq .message.body.graffiti | xxd -r -p | iconv -f UTF-8; echo; done
/blocks/slot_77.hex   prysm-geth-0
/blocks/slot_80.hex   prysm-nethermind-0
/blocks/slot_90.hex   prysm-nethermind-0
/blocks/slot_97.hex   prysm-nethermind-0
/blocks/slot_99.hex   prysm-nethermind-0
/blocks/slot_101.hex prysm-nethermind-0
/blocks/slot_103.hex prysm-geth-0
/blocks/slot_107.hex prysm-nethermind-0
/blocks/slot_110.hex prysm-nethermind-0
/blocks/slot_116.hex prysm-geth-0

Looking at the validator indices that submitted the invalid attestations there was nothing interesting to draw from that:

validator index: validator pubkey

42: 0xaefa4ece060bbd6555580c8e91d43d19ca0b7d3b46c1f5c9e92ddb05f1086afa778a77eae0e8decde0730339589106a5
50: 0x96bb8296644e0138b94da3af5ea34465199781daa931ba815d24c698621686eec279773eeca3e86f8101d83ade3fb4be
16: 0x817e70689a7f980a72013952b32b0fafdd76797adc93655ca7384e5b0d105780207bc444f9b9f18c9766a793ee65bc22
62: 0xab311ad1160538324c9dd24f36c37a005f053cd8a0477064b1c8675a13189a605d427a7d6279db300a3e419088ec0b85
39: 0x804a4521e71fc5d02c5e89259c2fcbb3c9db648b2c4a2c8f0457484bf7e86bc7c348bd13d87004eb126099252abd3b4f
23: 0xb4e8db046344e749d788b88b5513a8231c6cf04f6322bc1f4a29c995c6c2a9d6bb9358e356e6fae356dab5cdba7e525a
3: 0xb216d6589db7136aa2f9063da57d4400e47d2f1dc1847c8670657af0340bcd580f0c67f808e289f22f198cf79cc5573e
7: 0x906ebc9061329e5032001ed39e3bdc9c547420546b5c18a4c7ff29fff3ac2ea897db835cfdb25443bdf92c0d6f2ede9b

validator index distribution in the experiment:

0 - 10    prysm-geth
10 - 20   teku-geth
20 - 30   nimbus-geth
30 - 40   lighthouse-besu
40 - 50   prysm-nethermind
50 - 60   lighthouse-nethermind
60 - 70   nimbus-nethermind

Has this worked before in a previous version?

I believe this was introduced by optimizations that rely on finalization.

🔬 Minimal Reproduction

Run an experiment with network faults starting from genesis with the devnet-dencun-11 images.

Error

Antithesis Report

Platform(s)

Linux (x86)

What version of Prysm are you running? (Which release)

Prysm/v4.1.0/2850f4d989cde6e96ae841e9f77dfd494d22274c

Anything else relevant (validator index / public key)?

No response

parithosh commented 7 months ago

Also note that there was an extremely deep reorg (depth 28) seen on this network. There shouldn't be any straightforward case in which a reorg of such depth happens on a small network that close to genesis - so could be some other factors involved as well.

nisdas commented 7 months ago

Hey, I posted this offline too but for clarity will post the same answer here:

On why we broadcasted and then process an invalid block, this is how our workflow is:

So it fails on the second step, the reason we do it is to allow blocks to be propagated as fast as possible across the network

We have seen this issue before, it happens due to deep reorgs and attestations from a different shuffling are in the same pool. Which is what causes the invalid blocks to be produced, there isn't an easy solution to it yet because it would involve deep changes to the attestation pool that we are vary of.

A solution to this would be to make our attestation pool re-org aware with respect to changed shufflings and just purge attestations from the different(old) branch. On whether this would be a good idea maybe @terencechain @potuz might have more thoughts on it

qu0b commented 7 months ago

@parithosh The chain reorgs are both after the chain managed to finalize:

last slot with invalid attestations: 116

First reorg is from slot 143 -> 124 Second is from 150 -> 137

[  1804.570738] [system;capturefs;files;/service_prysm-nethermind-0--prysm-bn]        [I] time="2023-11-17 12:30:09" level=info msg="Chain reorg occurred" commonAncestorRoot=0x81ba77b368441be27d63f495eb42242fa22bb70faa3f08f6d4f044df03a542fe depth=28 distance=37 newRoot=0x3373e44863ea4045b8b8c8710cb3f238e11119650cf839566d96ce2f71c0481e newSlot=124 newWeight=0 oldRoot=0xa836cb36e853cbeae02bae40768e2e3b802db5630fa20710438321eee7cc4383 oldSlot=143 oldWeight=0 prefix=blockchain
[  1850.807387] [system;capturefs;files;/service_prysm-geth-0--prysm-bn]              [I] time="2023-11-17 12:30:56" level=info msg="Chain reorg occurred" commonAncestorRoot=0x8a8574988b86c24cc8eb3c2b82ce84b22077c21b2ee058e6ff86f19f65a8fa30 depth=15 distance=17 newRoot=0x4a2035133aafbab2a4139864a414c89163f71aa7b09bf35fd5ae14a8d4d7a497 newSlot=137 newWeight=0 oldRoot=0x9088c8169555338677bdada4ad9735e7f33ff7d75e95552105b40d9e634da871 oldSlot=150 oldWeight=0 prefix=blockchain

We also wait with faults to the network until around slot 120.

parithosh commented 7 months ago

I tried reproducing the issue elsewhere and unfortunately wasn't able to :/

But @nisdas if the reorg happens after the invalid sigs were seen, then they're likely unrelated right?