To recover, only the binary upgrade is required. This is because in the usual case, only the first block/chunk of the NEW protocol version could have led to not upgraded node getting stuck. Next chunk after that would have different previous state root, which would diverge from state root in ChunkExtra. To avoid saving invalid state transition, we panic proactively.
Problem
Upgrade 70 -> 71 is one of specific examples where epoch info generation changes. As described above, it leads to the account state mismatch in the start of the epoch, next of which is the first one with new protocol version. We don't panic one epoch in advance, as Bowen mentioned - because usually we are still able to produce/process this epoch correctly and it doesn't make sense to miss rewards for that. So nodes enter invalid state and get stuck.
In such case it seems that validator would need only to upgrade the binary - no new snapshot would be required. It's not clear whether it is worth the effort.
Drawback
If validator was able to process that one epoch, it will miss rewards for it because of the panic.
Fix idea 2
Find whether it is necessary to lock account' stake one epoch in advance. If it doesn't, we could avoid invalid state transition one epoch in advance, it would happen only when epoch with new protocol version appears, which is natural to expect and which is already handled.
Fix idea 3
If node starts to observe higher protocol versions than it supports - in block infos, I guess - start actively displaying warnings that protocol may be about to upgrade.
Fix idea 4
If a node does not know the voted protocol version, take a snapshot before creating the new epoch and discard it when they upgrade to the latest binary. This way, when they reach the new protocol version and stall, they can upgrade the binary and if that does not fix it, revert to the snapshot.
There will be some storage increase if they miss to upgrade, due to the snapshot, but someone also pays for all the AWS traffic.
Context
Currently, if node missed protocol upgrade announcement, it will panic on the first block with higher protocol version than it supports: https://github.com/near/nearcore/blob/master/chain/chain/src/chain.rs#L2139-L2143
To recover, only the binary upgrade is required. This is because in the usual case, only the first block/chunk of the NEW protocol version could have led to not upgraded node getting stuck. Next chunk after that would have different previous state root, which would diverge from state root in ChunkExtra. To avoid saving invalid state transition, we panic proactively.
Problem
Upgrade 70 -> 71 is one of specific examples where epoch info generation changes. As described above, it leads to the account state mismatch in the start of the epoch, next of which is the first one with new protocol version. We don't panic one epoch in advance, as Bowen mentioned - because usually we are still able to produce/process this epoch correctly and it doesn't make sense to miss rewards for that. So nodes enter invalid state and get stuck.
Fix idea 1
For the future upgrades, panic one epoch in advance. More concretely - here, if next_next_epoch_version > PROTOCOL_VERSION https://github.com/near/nearcore/blob/master/chain/epoch-manager/src/lib.rs#L734, we add a panic with same message as on the link above.
In such case it seems that validator would need only to upgrade the binary - no new snapshot would be required. It's not clear whether it is worth the effort.
Drawback
If validator was able to process that one epoch, it will miss rewards for it because of the panic.
Fix idea 2
Find whether it is necessary to lock account' stake one epoch in advance. If it doesn't, we could avoid invalid state transition one epoch in advance, it would happen only when epoch with new protocol version appears, which is natural to expect and which is already handled.
Fix idea 3
If node starts to observe higher protocol versions than it supports - in block infos, I guess - start actively displaying warnings that protocol may be about to upgrade.
Fix idea 4
If a node does not know the voted protocol version, take a snapshot before creating the new epoch and discard it when they upgrade to the latest binary. This way, when they reach the new protocol version and stall, they can upgrade the binary and if that does not fix it, revert to the snapshot.
There will be some storage increase if they miss to upgrade, due to the snapshot, but someone also pays for all the AWS traffic.
Full thread https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/incorrectly.20applied.20proposal/near/467585317