We need to build a better understand of what exactly was the root cause behind the different issues encountered in main net after the upgrade to Cosmos SDK v0.50.5 that ultimately ended in a chain restart.
After careful consolidation with the devs over at Cosmos, we have arrived at the following facts:
About 1/3 of active validators failed to perform the upgrade to Paloma v1.13.0, which introduced the upgrade to Cosmos SDK v0.50.5. Those validators were unable to join the chain again, neither by using state synching, snapshots or rewinds.
Encountered problems consisted mostly of AppHash errors (not BlockHash), without further indications of underlaying erros in the log files, as well as missing keys from a store, see:
During the lifetime of Paloma v1.13.0 (approx 2 weeks), additional validators eventually encountered an AppHash error and remained unable to rejoin the network, no matter which strategy was attempted. In some instances, these validators had been part of the active valset for more than a week before encountering this issue.
The team then made additional changes around existing upgrade handlers, IAVL version as well as CometBFT version and released Paloma v1.13.1
During the upgrade of v1.13.1, validators failed to build consensus, and most nodes encountered one or more problems described above
Only a small subset of validates were able to upgrade without issues
There seems to be no noticeable difference in node hardware, or self-built vs. binary release
Without additional information on what exactly caused the observed issues, Paloma remains at risk of losing validators and ending up in a chain halt again. It's crucial we build an understanding of the underlaying issues that led to the behaviours outlined above. Therefore, we have designed the following testing and debugging steps together with the Cosmos dev team:
Grab a snapshot made prior to the upgrade height of v1.13.1 and test loading this snapshot into a node of Paloma v1.13.1, verify whether it's able to replay all blocks and ready to form a network.
Grab a snapshot made prior to the upgrade height of v1.13.1, modify a version of Paloma v1.13.0 to use IAVL v1.0.3 only, no other changes. Verify whether the node is able to replay all blocks and ready to form a network.
Grab a snapshot made prior to the upgrade height of v1.13.1, load into v1.13.0 node, sync until upgrade height. Stop at upgrade height, modify DB to be a single validator (instead of a full valset) and see if network continues. See https://docs.cosmos.network/main/build/building-apps/app-testnet for more information.
IF the network continues, test different configurations to verify whether the issue is non-deterministic
We need to build a better understand of what exactly was the root cause behind the different issues encountered in main net after the upgrade to Cosmos SDK
v0.50.5
that ultimately ended in a chain restart.After careful consolidation with the devs over at Cosmos, we have arrived at the following facts:
v1.13.0
, which introduced the upgrade to Cosmos SDKv0.50.5
. Those validators were unable to join the chain again, neither by using state synching, snapshots or rewinds.v1.13.0
(approx 2 weeks), additional validators eventually encountered an AppHash error and remained unable to rejoin the network, no matter which strategy was attempted. In some instances, these validators had been part of the active valset for more than a week before encountering this issue.v1.13.1
v1.13.1
, validators failed to build consensus, and most nodes encountered one or more problems described aboveWithout additional information on what exactly caused the observed issues, Paloma remains at risk of losing validators and ending up in a chain halt again. It's crucial we build an understanding of the underlaying issues that led to the behaviours outlined above. Therefore, we have designed the following testing and debugging steps together with the Cosmos dev team:
v1.13.1
and test loading this snapshot into a node ofPaloma v1.13.1
, verify whether it's able to replay all blocks and ready to form a network.v1.13.1
, modify a version ofPaloma v1.13.0
to useIAVL v1.0.3
only, no other changes. Verify whether the node is able to replay all blocks and ready to form a network.v1.13.1
, load intov1.13.0
node, sync until upgrade height. Stop at upgrade height, modify DB to be a single validator (instead of a full valset) and see if network continues. See https://docs.cosmos.network/main/build/building-apps/app-testnet for more information.Resources