vegaprotocol / vega

A Go implementation of the Vega Protocol, a protocol for creating and trading derivatives on a fully decentralised network.
https://vega.xyz
GNU Affero General Public License v3.0
37 stars 19 forks source link

Node joining via State-sync snapshot during protocol upgrade fails #7619

Closed jgsbennett closed 1 year ago

jgsbennett commented 1 year ago

Problem encountered

If you stop a network for protocol upgrade (no visor), in a network where n nodes are required for consensus, start n-1 nodes from local snapshot, and then attempt to start node n via state-sync snapshot, the attempt fails (see attached logs).

Observed behaviour

Errors like:

2023-02-16T15:51:22.453Z    INFO    tendermint  light/client.go:1005    error from light block request from primary, replacing...   {"error": "height requested is too high", "height": 115, "primary": "http{http://127.0.0.1:26607}"}

In this case, the upgrade block was 113, and the command to restart this node was:

2023-02-16 15:51:11.977:[INFO] [MainProcess] [MainThread] [    network.py]:[       start]:[ 590] starting testnet-nodeset-validators-2-validator from snapshot type statesync at height 113
INFO:root:Making request to: http://127.0.0.1:26607/block?height=113
2023-02-16 15:51:12.005:[INFO] [MainProcess] [MainThread] [ tendermint.py]:[trusted_block_hash]:[  24] Making request to: http://127.0.0.1:26607/block?height=113
...
2023-02-16 15:51:12.013:[INFO] [MainProcess] [MainThread] [ tendermint.py]:[trusted_block_hash]:[  33] Found height: 113, hash: 63FCF9FF6F682C2DB2750BEAEF12FD16388D515FCA114C17E34A99C7FDA8B16A

(i.e: It looks to me like the tests did ask for height 113).

Expected behaviour

Third node is able to state-sync snapshot, and consensus is resumed on new version i.e: upgrade is successful.

System response

Describe what the system response was, include the output from the command, automation, or else.

Steps to reproduce

  1. Start a 3 node network with equal stake/delegation across three nodes.
  2. Schedule a protocol upgrade
  3. Wait for upgrade block to be reached
  4. Stop all nodes
  5. Start first two nodes on new version from local snapshot at upgrade_block_height
  6. unsafe_reset_all vega and tm directories on third node.
  7. Start third node from state-sync snapshot at upgrade_block_height

Automation

Test is on extra_protocol_upgrade_acs branch, case is: pytest -k test_validator_restarts_with_wrong_version

Evidence

Logs

state sync logs.zip

system-test.log

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

After Merging

wwestgarth commented 1 year ago

This is the interesting bit from the logs:

2023-02-16T15:51:22.449Z    INFO    tendermint  light/client.go:559 VerifyHeader    {"height": 114, "hash": "EB4019725B31ED704C03290A6740577B08C611D8938A38C0B2A710730AC3C3A4"}
2023-02-16T15:51:22.453Z    INFO    tendermint  light/client.go:1005    error from light block request from primary, replacing...   {"error": "height requested is too high", "height": 115, "primary": "http{http://127.0.0.1:26607}"}
2023-02-16T15:51:22.454Z    INFO    tendermint  statesync/syncer.go:260 failed to fetch and verify app hash {"err": "height requested is too high"}
2023-02-16T15:51:22.454Z    INFO    tendermint  statesync/syncer.go:206 Snapshot rejected   {"height": 113, "format": 0, "hash": "VNZDcfwmwkWxiLl1p8lBY+il4TGis/GJHF0zz33E0Jk="}
2023-02-16T15:51:22.458Z    INFO    tendermint  light/client.go:559 VerifyHeader    {"height": 111, "hash": "0484EB22139F0F3251C91777B84FCB45E88113937DFBFCF7249C97A488C89543"}
2023-02-16T15:51:22.470Z    INFO    tendermint  light/client.go:559 VerifyHeader    {"height": 112, "hash": "243BEC30E91E8E3EDC0DF644A61BB220548399BE42C208C061D93F82B7BFC6DE"}
2023-02-16T15:51:22.474Z    INFO    tendermint  statesync/syncer.go:323 Offering snapshot to ABCI app   {"height": 110, "format": 0, "hash": "XVQ3M5rS9PuFFTs1cjVk/t6R1bYN/8Tz2WNdCDp/j3I="}
2023-02-16T15:51:22.475Z    ERROR   core.protocol.processor processor/abci.go:602   snapshot rejected   {"error": "received snapshot height does not equal config height: 110 != 113"}
2023-02-16T15:51:22.475Z    INFO    tendermint  statesync/syncer.go:206 Snapshot rejected   {"height": 110, "format": 0, "hash": "XVQ3M5rS9PuFFTs1cjVk/t6R1bYN/8Tz2WNdCDp/j3I="}
2023-02-16T15:51:22.479Z    INFO    tendermint  light/client.go:559 VerifyHeader    {"height": 89, "hash": "CAC64E582CD6841EAF3C646583BCE56715D27FC4FF760AB2D9C6FE37E9563706"}
2023-02-16T15:51:22.534Z    INFO    tendermint  light/client.go:559 VerifyHeader    {"height": 90, "hash": "E737F9E51E4885F64FB1EDF85E4E999742C7B0F297568F527FAD3F03362DC098"}
2023-02-16T15:51:22.538Z    INFO    tendermint  statesync/syncer.go:323 Offering snapshot to ABCI app   {"height": 88, "format": 0, "hash": "nhtZuur+zM2R/svVuB3L6kWErddxjqYpYuot2e5MIMY="}
2023-02-16T15:51:22.539Z    ERROR   core.protocol.processor processor/abci.go:602   snapshot rejected   {"error": "received snapshot height does not equal config height: 88 != 113"}

we ask to statesync in with a snapshot at height 113, tendermint fails to get us that snapshot because of a header verification issue for a block thats in the future:

2023-02-16T15:51:22.453Z    INFO    tendermint  light/client.go:1005    error from light block request from primary, replacing...   {"error": "height requested is too high", "height": 115, "primary": "http{http://127.0.0.1:26607}"}

and then core is offered only snapshots at lower block heights, so we reject them.

It would be worth digging into, but I feel like maybe tendermint's statesync-ing isn't supposed to be used this way. I don't know why its looking into a future block -- the trusted-hash we use in the statesync config is for block-height 113.

wwestgarth commented 1 year ago

Found this in the tendermint statesync code:

// We have to fetch the next height, which contains the app hash for the previous height.
    header, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height+1), time.Now())
    if err != nil {
        return nil, err
    }
    // We also try to fetch the blocks at height H and H+2, since we need these
    // when building the state while restoring the snapshot. This avoids the race
    // condition where we try to restore a snapshot before H+2 exists.
    //

which suggests its by-design and is a limitation on their end. The conclusion being that if the network is stopped at a protocol upgrade, no one can statesync in at the latest block height. If they wanted to join at that time they would need to:

wwestgarth commented 1 year ago

Closing because its a tendermint limitation, its not directly related to an AC, and there is a workaround. At most it might be worth a little warning in a doc somewhere about the possible problems trying to join the network during the protocol upgrade.