palomachain / paloma

The fast blockchain messenger protocol
Apache License 2.0
290 stars 135 forks source link

Post Mortem: Understanding reasons behind the chain halt #1124

Closed byte-bandit closed 7 months ago

byte-bandit commented 7 months ago

We need to build a better understand of what exactly was the root cause behind the different issues encountered in main net after the upgrade to Cosmos SDK v0.50.5 that ultimately ended in a chain restart.

After careful consolidation with the devs over at Cosmos, we have arrived at the following facts:

Without additional information on what exactly caused the observed issues, Paloma remains at risk of losing validators and ending up in a chain halt again. It's crucial we build an understanding of the underlaying issues that led to the behaviours outlined above. Therefore, we have designed the following testing and debugging steps together with the Cosmos dev team:

  1. Grab a snapshot made prior to the upgrade height of v1.13.1 and test loading this snapshot into a node of Paloma v1.13.1, verify whether it's able to replay all blocks and ready to form a network.
  2. Grab a snapshot made prior to the upgrade height of v1.13.1, modify a version of Paloma v1.13.0 to use IAVL v1.0.3 only, no other changes. Verify whether the node is able to replay all blocks and ready to form a network.
  3. Grab a snapshot made prior to the upgrade height of v1.13.1, load into v1.13.0 node, sync until upgrade height. Stop at upgrade height, modify DB to be a single validator (instead of a full valset) and see if network continues. See https://docs.cosmos.network/main/build/building-apps/app-testnet for more information.
  4. IF the network continues, test different configurations to verify whether the issue is non-deterministic
  5. IF the issue persists, use the https://github.com/crypto-com/python-iavl tool to inspect the state during upgrade height and identify differences

Resources

vishal-kanna commented 7 months ago

@byte-bandit I'll start looking into it

taariq commented 7 months ago

Fixed with upgrade of IAVL to DyDX Version on mainnet https://github.com/palomachain/paloma/pull/1139/commits/e4e5ec212bf016ca1110622d7e0d4329bd611fd0