The test change-keyper-set did only pass by accident because of the parameterization and the specific timeframe in which the keyper-set change is concluded compared to the program state.
The on-chain keyper set change in the test e.g. has the following parameters:
activation-block-number: 24
DKGStartBlockDelta: 1200
last-seen block: 19
Note that the last-seen-block represents the last l1 block that was communicated with the shuttermint chain,
and NOT the current l1 block-number retrieved by the l1 json rpc endpoint.
This makes the check in handleOnChainKeyperSetChanges pass:
However if other values are chosen, the program will go into a livelock, because the last-seen block that
makes the check pass, will never be updated again if no new batch-config becomes active:
This creates a circular dependency of all keypers not being able to send out newly seen batch-configs and starting the DKG for them, and thus no batch config becoming active anytime soon, updating the last seen block.
This results in a livelock, where all new keyper-sets on l1 are ignored by the keypers.
One can see that the last block seen is stalling and thus way behind the current l1 block of 207 as well as behind the activation block number of the batch-config / keyper set change.
The config has not been sent out yet, and thus not voted upon and activated etc.
The fix for this issue is to not use the last seen block communicated to the tendermint chain in order to trigger sending out new batch configs, but use the current l1 block number locally seen by the keyper.
The test
change-keyper-set
did only pass by accident because of the parameterization and the specific timeframe in which the keyper-set change is concluded compared to the program state. The on-chain keyper set change in the test e.g. has the following parameters:This makes the check in handleOnChainKeyperSetChanges pass:
https://github.com/shutter-network/rolling-shutter/blob/34e728e6101dc8ffd87c54165c61793d3124541a/rolling-shutter/keyper/keyper.go#L238-L246
However if other values are chosen, the program will go into a livelock, because the last-seen block that makes the check pass, will never be updated again if no new batch-config becomes active:
https://github.com/shutter-network/rolling-shutter/blob/34e728e6101dc8ffd87c54165c61793d3124541a/rolling-shutter/keyper/keyper.go#L184-L204
This creates a circular dependency of all keypers not being able to send out newly seen batch-configs and starting the DKG for them, and thus no batch config becoming active anytime soon, updating the last seen block. This results in a livelock, where all new keyper-sets on l1 are ignored by the keypers.
One can see that the last block seen is stalling and thus way behind the current l1 block of 207 as well as behind the activation block number of the batch-config / keyper set change. The config has not been sent out yet, and thus not voted upon and activated etc.
The fix for this issue is to not use the last seen block communicated to the tendermint chain in order to trigger sending out new batch configs, but use the current l1 block number locally seen by the keyper.