pokt-network / pocket-core

Official implementation of the Pocket Network Protocol
http://www.pokt.network
MIT License
209 stars 103 forks source link

Pass a correct session end height to `NewSession` #1545

Closed msmania closed 1 year ago

msmania commented 1 year ago

With session rollover (#1536), we accept relays for older sessions. For exaple, a node at block 101 accepts a relay for the session height 97.

There is a bug in the relay validation function relay.Validate. In the example above, the function sees the following values:

ctx.BlockHeight() = 101
sessionBlockHeight = r.Proof.SessionBlockHeight = 97
sessionCtx.BlockHeight() = 97

and if the corresponding session is not cached, it passes sessionCtx and ctx to NewSession. This may return a wrong session because the second argument is supposed to be the end of the session, but in this case it's not.

The proposed fix is if ctx is beyond the relay session, we get a new context of the session end and pass it to NewSession.

nodiesBlade commented 1 year ago

So If I'm understanding correctly, the second argument is to account for nodes that get jailed in between blocks within a a specific session (i.e block 97-100*)

Since we're allowing for session rollovers, we need to check the state of the jailed nodes at the last block (i.e block 100*) within a specific session. This could happen if a node serves a session rollover relay after restarting the pocket core process, which would then not have the cached session anymore, since it's only persistent in memory.

Does that sound about right? If so, LGTM.

msmania commented 1 year ago

Thank you for your comment. It's not limited to jail. Any of newstake/editstake/unstake/jail/unjail in the middle of a session can change the session node set. This means the session node set is not finalized until the session end block is committed. It's defined by the behavior of ValidateClaim.

This could happen if a node serves a session rollover relay after restarting the pocket core process, which would then not have the cached session anymore,

Not exactly. This could be a problem if a node caches a session rollover relay for 97 and the session node set is changed at the next seession height 101, AND the cached session is not cleared until the node start receiving claims for the session 97.

If that happens, the cached session is not correct, so the node accepts claims that should be rejected or vice virsa.

nodiesBlade commented 1 year ago

I think your concern(s) makes sense and this fix is applied for both scenarios.

any of newstake/editstake/unstake/jail/unjail in the middle of a session can change the session node set.

(May be unrelated to your issue) But this sounds like session caching is busted fundamentally. If we cache the session by SessionHeader, then most actors won't reflect the state of validators/servicers for sessionBlock[(start+1..., end] because servicers will usually only cache the start of the session, until the cache is cleared (when it submits the claim). This means that depending on when a servicer receives a relay, it's possible for two servicers to cache different 'session states'

msmania commented 1 year ago

(May be unrelated to your issue) But this sounds like session caching is busted fundamentally.

Yes, and I think that's why as a mitigation we call ClearSessionCache() at several transactions that impact active sessions and at every session.

I agree that the current design of the session cache is not effective. It hasn't been exposed so much until LeanPocket.

nodiesBlade commented 1 year ago

Hmm.. the fundamental issue still lies in that nodes do not have a deterministic way to save the same session state throughout handling relays. I think your fix patches up one of the problems as a direct result of session rollover, but definitely something to think about in the future PR. Thank you for your explanations! This all makes sense and LGTM!

Each node has its own SessionCache in LP to isolate any shared concerns, except the first node which has a reference to GlobalSessionCache. In hindsight, I think this should be named GlobalConsensusCache and any node (regardless of LP) should not be sharing this cache to prevent a consensus hiccup.

So in your scenario:

If that happens, the cached session is not correct, so the node accepts claims that should be rejected or vice versa.

All validators should still share the same session state and reject the claim. I'm not sure if this can actually happen with LP given that the nodes have an individual cache for sessions when handling relays and share a singular common cache for consensus unless it's the first node in LP. Can you confirm if my understanding is correct here?

msmania commented 1 year ago

First of all, thank you for signing off!

In hindsight, I think this should be named GlobalConsensusCache and any node (regardless of LP) should not be sharing this cache to prevent a consensus hiccup.

Personally the name GlobalSessionCache sounds right to me because it caches sessions after all. GlobalConsensusCache sounds like it cached some data in tendermint layer.

All validators should still share the same session state and reject the claim. I'm not sure if this can actually happen with LP given that the nodes have an individual cache for sessions when handling relays and share a singular common cache for consensus unless it's the first node in LP. Can you confirm if my understanding is correct here?

I was thinking of the following scenario. What prevents this from happening is we clear caches every session. If we don't, I think any secondary node can hit the wrong apphash error like the case of Node2 below. This is all theoretical. Maybe I'm wrong and missing something.


SessionNodeCount = 4

Block 101: Session A/B/C/D Block 102: Session A/B/C/D Block 103: Session A/B/C/D Block 104: Session A/C/D/E (B out; E in)

Node1 with LP: hosting A (and not hosting E)

Node2 with LP: hosting B


Anyway, this is unrelated to this PR. Maybe I should create a new issue.