Private blockchain stalling when less than 1/3 of validators are down - BftProcessorExecutor-QBFT-0 | INFO | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block

kkeaton-rcp commented 8 months ago

Description

I have a private blockchain with 7 validators, I take down 2 validators and the chain seems to freeze and is unable to produce blocks.

Acceptance Criteria

Blockchain is able to handle up to 1/3 of the validators going down (should be able to handle 2 validators going offline)

Steps to Reproduce (Bug)

I have tested this a few times and it is inconsistent. Sometimes the chain does not stall, other times the chain does stall.

Shut down two validators and then trigger events to happen on the blockchain.

Expected behavior: [What you expect to happen] Blockchain works with 2 validators down.

Actual behavior: [What actually happens] The blockchain will stall and gives BftProcessorExecutor-QBFT-0 | INFO | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block errors along with EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.

Frequency: [What percentage of the time does it occur?] This happens about half the times I have run testing.

Logs (if a bug)

2024-02-23 20:03:53.753+00:00 | nioEventLoopGroup-3-6 | DEBUG | EthPeer | handleDisconnect - EthPeer PeerId 0x9ee..., reputation PeerReputation 109, validated? true, disconnected? true, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 131.... with peer 0x9ee... inboundInitiated false initAt 1708716292348], enode enode://9ee...
2024-02-23 20:03:53.753+00:00 | nioEventLoopGroup-3-6 | DEBUG | EthPeers | Disconnected EthPeer PeerId 0x9ee..., reputation PeerReputation 109, validated? true, disconnected? true, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 1311626227 with peer 0x9ee... inboundInitiated false initAt 170...], enode enode://9e...
2024-02-23 20:03:53.754+00:00 | nioEventLoopGroup-3-6 | DEBUG | EthProtocolManager | Disconnect - Inbound - 0x08 CLIENT_QUITTING - 0x9e... - 9 peers left
9 EthPeers {
PeerId 0x0ff..., reputation PeerReputation 101, validated? true, disconnected? false, client: besu/v23.4.4/linux-x86_64/openjdk-java-17, connection [Connection with hashCode -1461... with peer 0x0ff... inboundInitiated true initAt 170...], enode enode://0ff...?discport=0,
PeerId 0x95..., reputation PeerReputation 101, validated? true, disconnected? false, client: besu/v23.4.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 134... with peer 0x95... inboundInitiated true initAt 170...], enode enode://95...?discport=0,

PeerId 0x0c7..., reputation PeerReputation 102, validated? true, disconnected? false, client: besu/v23.4.4/linux-x86_64/openjdk-java-17, connection [Connection with hashCode -166... with peer 0x0c7... inboundInitiated true initAt 170...], enode enode://0c7...?discport=0,

PeerId 0x31..., reputation PeerReputation 103, validated? true, disconnected? false, client: besu/v23.10.1/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 577... with peer 0x31... inboundInitiated true initAt 170...], enode enode://31...?discport=0,
PeerId 0xa5..., reputation PeerReputation 106, validated? true, disconnected? false, client: besu/v23.10.2/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 392... with peer 0xa5... inboundInitiated true initAt 170...], enode enode://a5...?discport=0,
PeerId 0x2f..., reputation PeerReputation 107, validated? true, disconnected? false, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 114... with peer 0x2f... inboundInitiated false initAt 170...], enode enode://2f...,
PeerId 0xbe..., reputation PeerReputation 107, validated? true, disconnected? false, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode -133... with peer 0xbe... inboundInitiated false initAt 170...], enode enode://be...,
PeerId 0x035..., reputation PeerReputation 111, validated? true, disconnected? false, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 202... with peer 0x035... inboundInitiated false initAt 170...], enode enode://035...,
PeerId 0x5a..., reputation PeerReputation 126, validated? true, disconnected? false, client: besu/v23.10.0/linux-x86_64/openjdk-java-17, connection [Connection with hashCode 317... with peer 0x5a... inboundInitiated false initAt 170...], enode enode://5a}
2024-02-23 20:03:54.353+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftRound | Importing proposed block to chain. round=ConsensusRoundIdentifier{Sequence=129..., Round=1}, hash=0xcfc....
2024-02-23 20:03:54.565+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftBesuControllerBuilder | Imported #12,904,735 / 4 tx / 0 pending / 21,993,755 (0.0%) gas / (0xcfc...)
2024-02-23 20:03:54.656+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftBesuControllerBuilder | Imported #12,904,736 / 0 tx / 0 pending / 0 (0.0%) gas / (0xc9b...)
2024-02-23 20:03:57.950+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:02.950+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:04.595+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftRound | Importing proposed block to chain. round=ConsensusRoundIdentifier{Sequence=129..., Round=1}, hash=0xd47...
2024-02-23 20:04:04.734+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | QbftBesuControllerBuilder | Imported #12,904,737 / 4 tx / 0 pending / 12,986,827 (0.0%) gas / (0xd47...)
2024-02-23 20:04:04.735+00:00 | EthScheduler-Workers-0 | INFO  | PersistBlockTask | Block 129... (0xd47...) is already imported
2024-02-23 20:04:07.951+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:12.951+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:17.465+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block
2024-02-23 20:04:17.952+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:22.952+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:27.953+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:32.953+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:33.592+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block
2024-02-23 20:04:37.953+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:42.954+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:47.954+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:52.955+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:04:57.955+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:05:02.956+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:05:05.711+00:00 | BftProcessorExecutor-QBFT-0 | INFO  | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block
2024-02-23 20:05:07.956+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:05:12.957+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.
2024-02-23 20:05:17.957+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected

Versions (Add all that apply)

Software version: [besu --version]
besu/v23.10.0/linux-x86_64/openjdk-java-17
Java version: [java -version]
openjdk 17.0.9 2023-10-17
OS Name & Version: [cat /etc/*release]
Ubuntu 22.04.2 LTS
Kernel Version: [uname -a]
Linux 22.04.1-Ubuntu SMP Wed Jan 10 22:54:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cloud VM, type, size: [Amazon Web Services I3-large]
AWS T2-Large

non-fungible-nelson commented 8 months ago

@matthew1001 any insight here?

matthew1001 commented 8 months ago

Can you confirm a couple of details please?

It looks like you have 9 peers currently, so 10 nodes in total that are available. I'm assuming some of those are non-validators (i.e. regular JSON/RPC nodes)?
You've got a variety of Besu versions from the looks of it. Is it possible to try running with a more recent version across all the nodes?
- 23.10.2 for all nodes would be a useful check if it's possible
Can you confirm that the 7 nodes you think are validators are in fact showing up when you issue qbft_getValidatorsByBlockNumber("latest") to some of the currently active nodes?
I'm assuming you haven't submitted any qbft_proposeValidatorVote calls around the time of stopping/starting the nodes?

kkeaton-rcp commented 8 months ago

@matthew1001

Correct - 7 are validators and the rest are regular JSON/RPC nodes
I confirmed this, all 7 are showing up
Correct

I upgraded all nodes to 23.10.2 - I'm getting a bit different behavior so there is a new exception I noticed in the logs (this is when the testing runs successfully)

java.util.concurrent.TimeoutException
    at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
    at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
    at org.hyperledger.besu.ethereum.blockcreation.txselection.BlockTransactionSelector.timeLimitedSelection(BlockTransactionSelector.java:181)
    at org.hyperledger.besu.ethereum.blockcreation.txselection.BlockTransactionSelector.buildTransactionListForBlock(BlockTransactionSelector.java:162)
    at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.selectTransactions(AbstractBlockCreator.java:372)
    at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.createBlock(AbstractBlockCreator.java:206)
    at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.createBlock(AbstractBlockCreator.java:154)
    at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.createBlock(AbstractBlockCreator.java:140)
    at org.hyperledger.besu.consensus.qbft.statemachine.QbftRound.createAndSendProposalMessage(QbftRound.java:130)
    at org.hyperledger.besu.consensus.qbft.statemachine.QbftBlockHeightManager.handleBlockTimerExpiry(QbftBlockHeightManager.java:136)
    at org.hyperledger.besu.consensus.common.bft.statemachine.BaseBftController.handleBlockTimerExpiry(BaseBftController.java:167)
    at org.hyperledger.besu.consensus.common.bft.EventMultiplexer.handleBftEvent(EventMultiplexer.java:65)
    at java.base/java.util.Optional.ifPresent(Optional.java:178)
    at org.hyperledger.besu.consensus.common.bft.BftProcessor.run(BftProcessor.java:65)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

but I still had it fail, but I am no longer getting 2024-02-23 20:04:17.465+00:00 | BftProcessorExecutor-QBFT-0 | INFO | ProposalValidator | Invalid Proposal Payload: Latest Prepared Metadata blockhash does not align with proposed block it is just all 2024-02-23 20:05:07.956+00:00 | EthScheduler-Timer-0 | DEBUG | WaitForPeerTask | Waiting for new peer connection. 9 peers currently connected.

matthew1001 commented 8 months ago

Sorry for the slow reply @kkeaton-rcp . I haven't been able to look into it in detail yet, but another contributor @Brindrajsinh-Chauhan might have some time to do so. I'll assign it over to him for now.

Brindrajsinh-Chauhan commented 8 months ago

Github does not seem to let me assign it to myself, would be grateful if @non-fungible-nelson you could take that action. Will start investigating on this piece.

matthew1001 commented 8 months ago

Looks like I can @Brindrajsinh-Chauhan - just assigned it to you

kkeaton-rcp commented 7 months ago

Just wondering if there was any update on this?

Brindrajsinh-Chauhan commented 7 months ago

Hey sorry, have not been able to recreate this based on the scenarios you provided.

kkeaton-rcp commented 7 months ago

Did you have a smart contract that took in some data and triggered that while the nodes were down?

Brindrajsinh-Chauhan commented 7 months ago

yes send a bunch of transactions to a smart contract with a few nodes down and then bringing them back up

kkeaton-rcp commented 6 months ago

Could a large amount of data going onto a smart contract cause this error? This error has now happened without any validator nodes being down. @Brindrajsinh-Chauhan @matthew1001

kkeaton-rcp commented 6 months ago

Or are there any known issues that deploying a smart contract with truffle (that is now deprecated) could cause? Would love to hear others opinions on this..

wonawunmi commented 5 months ago

@kkeaton-rcp, I suggested a solution on a similar issue #6732 .

In your case, I suspect the surviving validators are unable to produce blocks for your transactions within the set value for requesttimeoutseconds. Default recommendation for requesttimeoutseconds is 2x your blockperiodseconds, however the hardware resources of your remaining validators play a role in ensuring prompt block creation, especially, disk IO. I suggest you review your disk IO for performance. Try increasing your requesttimeoutseconds to something higher can help reduce the occurrence of this issue.

hyperledger / besu