Closed jgsbennett closed 1 year ago
This is the interesting bit from the logs:
2023-02-16T15:51:22.449Z INFO tendermint light/client.go:559 VerifyHeader {"height": 114, "hash": "EB4019725B31ED704C03290A6740577B08C611D8938A38C0B2A710730AC3C3A4"}
2023-02-16T15:51:22.453Z INFO tendermint light/client.go:1005 error from light block request from primary, replacing... {"error": "height requested is too high", "height": 115, "primary": "http{http://127.0.0.1:26607}"}
2023-02-16T15:51:22.454Z INFO tendermint statesync/syncer.go:260 failed to fetch and verify app hash {"err": "height requested is too high"}
2023-02-16T15:51:22.454Z INFO tendermint statesync/syncer.go:206 Snapshot rejected {"height": 113, "format": 0, "hash": "VNZDcfwmwkWxiLl1p8lBY+il4TGis/GJHF0zz33E0Jk="}
2023-02-16T15:51:22.458Z INFO tendermint light/client.go:559 VerifyHeader {"height": 111, "hash": "0484EB22139F0F3251C91777B84FCB45E88113937DFBFCF7249C97A488C89543"}
2023-02-16T15:51:22.470Z INFO tendermint light/client.go:559 VerifyHeader {"height": 112, "hash": "243BEC30E91E8E3EDC0DF644A61BB220548399BE42C208C061D93F82B7BFC6DE"}
2023-02-16T15:51:22.474Z INFO tendermint statesync/syncer.go:323 Offering snapshot to ABCI app {"height": 110, "format": 0, "hash": "XVQ3M5rS9PuFFTs1cjVk/t6R1bYN/8Tz2WNdCDp/j3I="}
2023-02-16T15:51:22.475Z ERROR core.protocol.processor processor/abci.go:602 snapshot rejected {"error": "received snapshot height does not equal config height: 110 != 113"}
2023-02-16T15:51:22.475Z INFO tendermint statesync/syncer.go:206 Snapshot rejected {"height": 110, "format": 0, "hash": "XVQ3M5rS9PuFFTs1cjVk/t6R1bYN/8Tz2WNdCDp/j3I="}
2023-02-16T15:51:22.479Z INFO tendermint light/client.go:559 VerifyHeader {"height": 89, "hash": "CAC64E582CD6841EAF3C646583BCE56715D27FC4FF760AB2D9C6FE37E9563706"}
2023-02-16T15:51:22.534Z INFO tendermint light/client.go:559 VerifyHeader {"height": 90, "hash": "E737F9E51E4885F64FB1EDF85E4E999742C7B0F297568F527FAD3F03362DC098"}
2023-02-16T15:51:22.538Z INFO tendermint statesync/syncer.go:323 Offering snapshot to ABCI app {"height": 88, "format": 0, "hash": "nhtZuur+zM2R/svVuB3L6kWErddxjqYpYuot2e5MIMY="}
2023-02-16T15:51:22.539Z ERROR core.protocol.processor processor/abci.go:602 snapshot rejected {"error": "received snapshot height does not equal config height: 88 != 113"}
we ask to statesync in with a snapshot at height 113, tendermint fails to get us that snapshot because of a header verification issue for a block thats in the future:
2023-02-16T15:51:22.453Z INFO tendermint light/client.go:1005 error from light block request from primary, replacing... {"error": "height requested is too high", "height": 115, "primary": "http{http://127.0.0.1:26607}"}
and then core is offered only snapshots at lower block heights, so we reject them.
It would be worth digging into, but I feel like maybe tendermint's statesync-ing isn't supposed to be used this way. I don't know why its looking into a future block -- the trusted-hash we use in the statesync config is for block-height 113.
Found this in the tendermint statesync code:
// We have to fetch the next height, which contains the app hash for the previous height.
header, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height+1), time.Now())
if err != nil {
return nil, err
}
// We also try to fetch the blocks at height H and H+2, since we need these
// when building the state while restoring the snapshot. This avoids the race
// condition where we try to restore a snapshot before H+2 exists.
//
which suggests its by-design and is a limitation on their end. The conclusion being that if the network is stopped at a protocol upgrade, no one can statesync in at the latest block height. If they wanted to join at that time they would need to:
Closing because its a tendermint limitation, its not directly related to an AC, and there is a workaround. At most it might be worth a little warning in a doc somewhere about the possible problems trying to join the network during the protocol upgrade.
Problem encountered
If you stop a network for protocol upgrade (no visor), in a network where n nodes are required for consensus, start n-1 nodes from local snapshot, and then attempt to start node n via state-sync snapshot, the attempt fails (see attached logs).
Observed behaviour
Errors like:
In this case, the upgrade block was 113, and the command to restart this node was:
(i.e: It looks to me like the tests did ask for height 113).
Expected behaviour
Third node is able to state-sync snapshot, and consensus is resumed on new version i.e: upgrade is successful.
System response
Describe what the system response was, include the output from the command, automation, or else.
Steps to reproduce
Automation
Test is on extra_protocol_upgrade_acs branch, case is: pytest -k test_validator_restarts_with_wrong_version
Evidence
Logs
state sync logs.zip
system-test.log
Additional context
Add any other context about the problem here including; system version numbers, components affected.
Definition of Done
Before Merging
After Merging
Done
if there is NO requirement for new system-tests