Recover geth and story clients when geth client is upgraded after the upgrade height

limengformal commented 6 days ago

Description and context

In the event of a geth hard fork upgrade, if a node is not upgraded after the upgrade block height, the story client may panic since it may not reach consensus with the rest of the nodes that already have their geth clients upgraded.

Only upgrading the geth client at this point doesn't help since the story client already verified/proposed a block that is in conflict. Node at this point can only remove data folder and sync from genesis block which takes a long time.

Definition of done

Node can roll back incorrect block and restart with correct block in the event of a late geth upgrade.

limengformal commented 4 days ago

How to reproduce:

versions
- before upgrade: story client version / geth version
- story: v0.10.0
- geth: v0.9.2
- after upgrade, normal node: story client version / geth version
- story: v0.10.0
- geth: v0.9.3
- after upgrade, not-upgrading node: story client version / geth version
- story: v0.10.0
- geth: v0.9.2
upgrade procedure
- Before the upgrade height (you may specify height by using --override.nostoi ${BLOCK_NUMBER})
- Stop story
- Stop old geth
- Start new geth with upgrade height by using --override.nostoi ${BLOCK_NUMBER}
- Start story
Trigger the issue
- Leave some nodes not upgraded to geth v0.9.3 after specified ${BLOCK_NUMBER}
- Send a tx to call EIP-7212 precompile as follows:

Install Cast (if not installed yet):

curl -L https://foundry.paradigm.xyz/ | bash
source /home/ec2-user/.zshenv
foundryup

cast call 0x0000000000000000000000000000000000000100  "0x4cee90eb86eaa050036147a12d49004b6b9c72bd725d39d4785011fe190f0b4da73bd4903f0ce3b639bbbf6e8e80d16931ff4bcf5993d58468e8fb19086e8cac36dbcd03009df8c59286b162af3bd7fcc0450c9aa81be5d10d312af6c66b1d604aebd3099c618202fcfe16ae7770b0c49ab5eadf74b754204a3bb6060e44eff37618b065f9832de4ca6ca971a7a1adc826d0f7c00181a5fb2ddf79ae00b4e10e" --rpc-url <node url to 8545 port, e.g. http://localhost:8545/>

zsystm commented 3 hours ago

@limengformal I have successfully reproduced the stuck scenario.

Currently, I'm working on integrating the CometBFT rollback command into the Story codebase to test if this rollback feature can resolve the issue.

Since the Story codebase has some differences from a standard Cosmos SDK chain, I'm unable to directly apply the existing rollback command.
Integration is working on in zsystm/rollback branch

piplabs / story

Recover geth and story clients when geth client is upgraded after the upgrade height #144

Description and context

Suggested solution

Definition of done