likecoin / likecoin-chain

LikeCoin chain node binary - liked
https://docs.like.co/validator/likecoin-chain-node/setup-a-node
GNU General Public License v3.0
104 stars 23 forks source link

Further hotfix on panic when unbonding unbonded validator #75

Closed nnkken closed 2 years ago

nnkken commented 2 years ago

Also guard the case of validator not found (since it could be already removed in previous unbonding). Also add comments.

Related SDK commit: https://github.com/likecoin/cosmos-sdk/commit/6cd3d089f47487caf72e8a9f18a9ceda93ce3675

The cause of the problem:

  1. a validator (cosmos1abcd...) was unbonding before Bech32 address prefix migration
  2. after prefix migration, the validator was further jailed
  3. the SDK code tries to remove the old unbonding entry and insert a new one
  4. however, since the validator address for the unbonding entry was stored in string format, and the code for removing the unbonding entry is searching for the new Bech32 prefix address, the old entry was not removed
  5. so there are 2 unbonding entries for the same validator
  6. the first unbonding entry triggers, turning the validator status to unbonded
  7. the second unbonding entry triggers, and found that the validator was already unbonded, so the SDK code panic

Currently we are guarding the unbonding code so it won't panic if validator was already unbonded or removed, but leave an error log instead.

elise-ng commented 2 years ago

@nnkken L440 should also continue? 🤔

nnkken commented 2 years ago

@nnkken L440 should also continue? 🤔

I was keeping that because structurally it is possible to run k.RemoveValidator(ctx, val.GetOperator()). But logically speaking it could simply continue. But since we are fixing a bug on the logic itself... I think we should better keep it?

elise-ng commented 2 years ago

@nnkken L440 should also continue? 🤔

I was keeping that because structurally it is possible to run k.RemoveValidator(ctx, val.GetOperator()). But logically speaking it could simply continue. But since we are fixing a bug on the logic itself... I think we should better keep it?

if we continue at L440, we can revert L446-448 back to the vanilla sdk code, seems the code will look cleaner this way; so:

if !found {
    // due to address Bech32 prefix migration, it is possible that entry with address with old prefix was not
    // deleted (e.g. jailed during unbond), so we don't panic here and instead just skip this validator
    ctx.Logger().Error("validator in the unbonding queue was not found", "validator", valAddr)
    continue
}

if !val.IsUnbonding() {
    if val.IsUnbonded() {
        // same issue as the comments above
        ctx.Logger().Error("unbonding validator but the status was unbonded", "validator", valAddr)
        continue
    }
    panic("unexpected validator in unbonding queue; status was not unbonding or unbonded")
}

val = k.UnbondingToUnbonded(ctx, val)
if val.GetDelegatorShares().IsZero() {
    k.RemoveValidator(ctx, val.GetOperator())
}
rickmak commented 2 years ago

According to our discussion and the clause you wrote here. 5. so there are 2 unbonding entries for the same validator. The first one should already did the job. So logically if we discover the validator no longer in the transiting stage but transited state, we should skip the following action via continue.

Same logic for not found (already delete) and unbound (already transit of state).

rickmak commented 2 years ago

Oh, github down resulting my previously comment is later than @chihimng one. 🗡️

nnkken commented 2 years ago

PR updated. The SDK commit now is https://github.com/likecoin/cosmos-sdk/commit/6cd3d089f47487caf72e8a9f18a9ceda93ce3675.

nnkken commented 2 years ago

The new SDK commit (v0.44.8-dual-prefix-hotfix-3) combined the previous commits into one commit for better review.