pokt-network / pocket-core

Official implementation of the Pocket Network Protocol
http://www.pokt.network
MIT License
209 stars 103 forks source link

Clear the global session cache when a node is unjailed #1529

Closed msmania closed 1 year ago

msmania commented 1 year ago

This is a fix for a consensus failure with wrong Block.Header.AppHash, which is probably the same issue as #1478.

Issue description:

For a node to be selected for a session, it must be valid not only at the start of a session but also at the end of a session. If a node is edit-staked, jailed, or unjailed in the middle of a session, a servicer set of the session is changed.

When a node serves relays, on the other hand, it stores session information into the global session cache. If the session's servicer set is changed, the node needs to update the cached session. Otherwise the node accepts a claim that should be rejected, or rejects a claim that should be accepted. As a result, the node ends up with a consensus failure with wrong Block.Header.AppHash.

Root cause and Fix:

The root cause is simple. We have a call to ClearSessionCache in EditStakeValidator, LegacyForceValidatorUnstake, ForceValidatorUnstake, and JailValidator, but not in UnjailValidator. The proposed fix is to add the call in UnjailValidator as well.

Olshansk commented 1 year ago

Thanks @msmania. Logically, this makes sense to me and I've confirmed that we call ClearSessionCache in all other relevant locations.

My only concern is that I don't have the resources right now to validate this and the testing coverage isn't good enough either.

Ccing @PoktBlade for a 2nd opinion if this is good to just 🚢 ?

nodiesBlade commented 1 year ago

Good find, you are correct that we should be clearing the session cache when unjailing, and this could cause consensus issues due to caching causing nondeterministic behavior.

Most validators and servicers should not have the session cached until the claim/proof lifecycle. They would only if the validator/servicer is in the same app session.

Furthermore, sessions are cleared in general based on certain transactions. TLDR - we're riding off luck with nondeterministic behavior

Where does the risk lie?: Given LeanPocket allows nodes to be stacked onto each other, validators will share the same session cache, and such could result in a consensus failure with enough luck if 1/n stacked validators are paired with the jailed servicer that submits a claim. This is because then all stacked validators have the 'wrong' session cached, and unless there is another transaction that clears the session cache, the stacked validators will likely all disagree with the other validators in the network.

Even without LP, the probability of it happening is non-zero.

I am unsure if we should merge this change, or if we should consider a consensus-breaking change for this otherwise we will have some validators clearing the cache and some validators not clearing the cache, which could cause more issues.

nodiesBlade commented 1 year ago

If we want to play it 100% safe, then an upgrade height would be ideal, and release it ASAP.

nodiesBlade commented 1 year ago

I don't mean to distract this conversation, but I added a sample implementation here:

https://github.com/pokt-network/pocket-core/pull/1532

It compiles, that's as far as I've tested it.

reviewpad[bot] commented 1 year ago

AI-Generated Pull Request Summary: This pull request includes two patches:

  1. Clear the global session cache when a node is unjailed. This change ensures the cache is cleared whenever a validator is unjailed to prevent any non-deterministic cache consistency issues.

  2. Porting the upgrade key from #1532. This patch introduces a new feature key, "CLSES", and adds more test cases. It also includes checking tolerance on the upgrade height which helps to enable certain business logic within some tolerance of feature activation, providing more confidence in the feature's release and avoids non-deterministic or hard-to-predict behavior.

reviewpad[bot] commented 1 year ago

AI-Generated Pull Request Summary: This pull request includes two patches:

  1. Clear the global session cache when a node is unjailed: This patch modifies the UnjailValidator function to clear the global session cache when a node is unjailed. It also updates the comments and adds a feature flag to conditionally enable the cache clearing after a specific upgrade height.

  2. Porting the upgrade key from #1532: This patch adds a new function IsOnNamedFeatureActivationHeightWithTolerance to the Codec that checks whether a specific feature is activated within a given height tolerance. Additionally, the patch introduces the ClearUnjailedValSessionKey constant, adds a test case for the new function, and updates the BeginBlock function in the gov module to clear the global session cache only for a few blocks around the upgrade height. The changes are made across multiple files, including codec.go, codec_test.go, module.go, and valStateChanges.go.

msmania commented 1 year ago

I added a few comments to the PR in #1532 that has the flag. Do you mind copying those changes into this PR?

Done!