pokt-network / pocket-core

Official implementation of the Pocket Network Protocol
http://www.pokt.network
MIT License
209 stars 103 forks source link

Session roll over fix #1536

Closed nodiesBlade closed 1 year ago

nodiesBlade commented 1 year ago

Addresses #1464 by adding a configurable session block tolerance into pocket config client_session_sync_allowance with a default value of 1

This means that once a new session appears, servicer nodes will still respect old sessions up until 1 session block (4 blocks at the time of this PR).

DISCLAIMER: SHOULD DEFINITELY TEST.......................... USING THE REGULAR RELEASE CYCLE MECHANISMS

nodiesBlade commented 1 year ago

Newly added tests

=== RUN   TestKeeper_HandleRelay
I[2023-03-16|00:54:58.549] Initializing 1512ec818e72bfec7696f443e2324f79d90b4762 session and evidence cache 
--- PASS: TestKeeper_HandleRelay (0.20s)
=== RUN   TestKeeper_HandleRelayWithinTolerance
I[2023-03-16|00:54:58.726] Initializing 5408648319da126ccd8cce3c5efa44a93a9d8943 session and evidence cache 
--- PASS: TestKeeper_HandleRelayWithinTolerance (0.15s)
=== RUN   TestKeeper_HandleRelayOutsideTolerance
I[2023-03-16|00:54:58.904] Initializing 0e694b1904edc2fbdc13978ec27487a717d51bde session and evidence cache 
I[2023-03-16|00:54:59.058] Initializing d8dbfee56de47cb4c4708e61c012d1b99bb42d8b session and evidence cache 
--- PASS: TestKeeper_HandleRelayOutsideTolerance (0.33s)
PASS

Process finished with the exit code 0
nodiesBlade commented 1 year ago

Note: some nodes will seal their evidence claim immediately after sesson rollover and this will result in evidence sealed errors. In order to remediate this, we could modify claim and proof tx to only occur on 2,3,4th of a session block instead depending on the nodes address offset. incorporating this network wide will require more due diligence as the reason why this code exists in first place is to prevent too many claims/proofs from being submitted at once in a single block

Olshansk commented 1 year ago

@PoktBlade Thanks for doing this!

I haven't reviewed the code but just wanted to ACK that I'm aware and it's on my backlog.

DISCLAIMER: SHOULD DEFINITELY TEST.......................... USING THE REGULAR RELEASE CYCLE MECHANISMS

I, unfortunately, think that testing and getting confidence in this change is going to be one of the hardest parts, aside from the social coordination of deploying it.

I have two requests:

  1. Can you look into whether we can extend the e2e testing framework to support relays and have an integration test that validates this change.

  2. I have an open PR that enables spinning up a LocaNet and sending relays by specifying altruist endpoints (details are in the docs). 2.1 Can you use it to validate/test this change? 2.2 Please note that the infra is janky but the best we have (to the best of my knowledge), so I apologize in advance.

nodiesBlade commented 1 year ago

Hey,

I have set up my own localnet stack from rigging up my own E2E stack. So I can test to see if session roll over works.

the e2e automatic test suite is new to me, I don't have the bandwidth to test that. It was new and delivered abruptly so no one besides pni has context of that. The previous testing mechanisms included a QA suite of test cases and then a manual regression test using said qa test suite. I can add some scenarios to that, and someone can translate it to the automatic version.

Olshansk commented 1 year ago

tl;dr Noted, I will put the time into this myself.


I have set up my own localnet stack from rigging up my own E2E stack. So I can test to see if session roll over works.

I tried to put together good & easy-to-follow documentation (so it can be used by others), but if you can share the results of your own stack (potentially a video and/or documentation), that would be good too. Ref: https://github.com/pokt-network/pocket-core/blob/fixing_localnet/localnet.md

the e2e automatic test suite is new to me, I don't have the bandwidth to test that. It was new and delivered abruptly so no one besides pni has context of that

Understood. I tried to make the documentation easy to follow but will invest time in extending, improving and testing it. https://github.com/pokt-network/pocket-core/tree/staging/e2e

nodiesBlade commented 1 year ago

Hey @Olshansk I responded to most of your comments - thanks for taking the time to review it. I think the largest I concern was that you mentioned one of the tests were not running for you after reverting the major core piece of this branch, indicating something might be wrong. Can you provide your test results and we can take it from there, thank you!

Olshansk commented 1 year ago

Shouldn't this test fail if I revert the changes you made?

Screenshot 2023-03-28 at 7 58 28 PM

nodiesBlade commented 1 year ago

Shouldn't this test fail if I revert the changes you made?

Screenshot 2023-03-28 at 7 58 28 PM

Tests should not fail under the current designs.

Explanation

So the test scenario is mainly bootstrapped by a function setupHandleRelayTest which initializes practically everything that wasn't mockable and whatnot so that these functions actually run without runtime errors. This function also returns a signed relay which is encoded with a session block height of 976 (referred as relaySessionBlockHeight from now on), always. Finally, it mocks out the context so that way the current height is set by currentBlockHeight in the function signature.

Change made

In your scenario, you revert my changes, so the network will always grab the latest session block height dictated by currentBlockHeight.

Tests:

Test One (Simple handle relay):

  1. Block height 976
  2. relay encoded with relaySessionBlockHeight as 976
  3. GetLatestSessionBlockHeight is still 976,
  4. Validation passes, relays is handled.

Test Two (Within tolernace):

  1. Block height 977
  2. relay encoded with relaySessionBlockHeight as 976
  3. GetLatestSessionBlockHeight is still 976,
  4. Validation passes, relays is handled.

Test Three (outside tolernace):

  1. Block height 926
  2. relay encoded with relaySessionBlockHeight as 976
  3. GetLatestSessionBlockHeight is now 926
  4. so therelayBlockSessionHeight != GetLatestSessionBlockHeight, and therefore invalid block height error assertion matches.

pls explain

Now you might be asking, what's up with these random/magic numbers in these tests, and why are the tests structured this way. Unfortunately, this is just tech debt within the code base already, and I'm happy to resolve as much as I can and provide clarity as you see fit. What's more important to me is that you understand the code that I'm writing and you agree that the logic works. Everything else, I prefer to be a bit more pragmatic and timely on how to solve these PRs. I'll solve as much as I can in the same PR and will suggest moving the unresolved ones to another GH issue/PR. Can you mark the ones as blockers, and the ones as nice to have, and I will prioritize as needed.

P.S - I used delve to debug to see the values on the heap/stack and the function calls trace, good tool to use

nodiesBlade commented 1 year ago

Hey @Olshansk

Addressed most of your comments, try giving it a read again with the test cases, it should provide more clarity for your reading. If it takes too long, just message me async and I'll do my best to explain the logic.

nodiesBlade commented 1 year ago

Also, you mention we should have a feature flag.. Just a couple notes to that:

  1. The tests written previously never accounted for invalid block height tolerance, it was a simple test for the relay test in a happy route, that's why you don't see a 'failure case' test. The outside tolerance test adds a invalid block height test for both my changes and before my changes, effectively.
  2. The handleRelay function is purely client sided, it has nothing to do with consensus, so there is not really a need for a true 'feature flag'.
msmania commented 1 year ago

Sorry for interrupting the ongoing discussion. If I understand this patch correctly, the change in HandleDispatch is not necessary, right?

nodiesBlade commented 1 year ago

Sorry for interrupting the ongoing discussion. If I understand this patch correctly, the change in HandleDispatch is not necessary, right?

I would deem it necessary as if we are going to allow for “sessions to be rolled over” then we should allow the node to dispatch for the old sessions as well.

msmania commented 1 year ago

I would deem it necessary as if we are going to allow for “sessions to be rolled over” then we should allow the node to dispatch for the old sessions as well.

Session role over can be achieved by the change in HandleRelay, right? The other part is not necessary.

When someone, basically the Portal, starts a relay, it should always use the latest session and shouldn't request a relay with an old session because app or node's staking status may have been changed. And the Portal indeed does that by setting SessionHeader.SessionBlockHeight = 0 (I think it happens here), so it doesn't change the relay behavior anyway.

What your change in HandleDispatch achieves is to enable us to query an old session. I agree this is convenient, but I think it's out of the scope of this PR.

Moreover, you need to be careful about the second argument of NewSession, which must be at most the end of the session of the first argument. Otherwise it may return a wrong session as described in #1529.

nodiesBlade commented 1 year ago

What your change in HandleDispatch achieves is to enable us to query an old session. I agree this is convenient, but I think it's out of the scope of this PR.

Fair enough. Removed it. Portals who want to dispatch can later send in a PR or modify their own PCC to dispatch for an old session.

msmania commented 1 year ago

Thanks! One more ask about test code. In your test, each testcase calls setupHandleRelayTest, creating the whole thing from scratch. This is not an ideal design. Also, you should not hardcode block numbers even though they're stored separately. Please use ctx.BlockHeight(), which is 976, and keeper.BlocksPerSession(ctx), which is 25.

I think the test should be something like this. With this, one setup covers all scenarios and magic numbers are removed completely.

func TestKeeper_HandleRelay(t *testing.T) {
    setupHandleRelayTest(...)

    blockPerSession := keeper.BlocksPerSession(ctx)
    allowance = some positive value

    types.GlobalPocketConfig.ClientSessionBlockSyncAllowance = 0

    // Send relay with older heights without session roll over
    for i := 0; i < blockPerSession*(allowance+1); i++ {
        height := ctx.BlockHeight() - i
        // create a relay payload with `height`
        resp, err := keeper.HandleRelay(mockCtx, validRelay)
        // verify the result accordingly
    }

    types.GlobalPocketConfig.ClientSessionBlockSyncAllowance = allowance

    // Send relay with older heights with session roll over
    for i := 0; i < blockPerSession*(allowance+1); i++ {
        height := ctx.BlockHeight() - i
        // create a relay payload with `height`
        resp, err := keeper.HandleRelay(mockCtx, validRelay)
        // verify the result accordingly
    }
}
nodiesBlade commented 1 year ago

Thanks! One more ask about test code. In your test, each testcase calls setupHandleRelayTest, creating the whole thing from scratch. This is not an ideal design. Also, you should not hardcode block numbers even though they're stored separately. Please use ctx.BlockHeight(), which is 976, and keeper.BlocksPerSession(ctx), which is 25.

I think the test should be something like this. With this, one setup covers all scenarios and magic numbers are removed completely.

func TestKeeper_HandleRelay(t *testing.T) {
  setupHandleRelayTest(...)

  blockPerSession := keeper.BlocksPerSession(ctx)
  allowance = some positive value

  types.GlobalPocketConfig.ClientSessionBlockSyncAllowance = 0

  // Send relay with older heights without session roll over
  for i := 0; i < blockPerSession*(allowance+1); i++ {
      height := ctx.BlockHeight() - i
      // create a relay payload with `height`
      resp, err := keeper.HandleRelay(mockCtx, validRelay)
      // verify the result accordingly
  }

  types.GlobalPocketConfig.ClientSessionBlockSyncAllowance = allowance

  // Send relay with older heights with session roll over
  for i := 0; i < blockPerSession*(allowance+1); i++ {
      height := ctx.BlockHeight() - i
      // create a relay payload with `height`
      resp, err := keeper.HandleRelay(mockCtx, validRelay)
      // verify the result accordingly
  }
}

If we were using something like Ginkgo to structure our tests, I'd agree. Otherwise, having the tests separately helps isolate the environments and what we're testing for since these are unit tests and not integration / e2e tests.

Also, you should not hardcode block numbers even though they're stored separately. Please use ctx.BlockHeight(), which is 976, and keeper.BlocksPerSession(ctx), which is 25.

Also agree, but this would require more restructuring. I would push for a follow-up PR to improve tests if we really want to sharpen this old cold base (not worth it for this use case IMO), my bandwidth is quite limited atm to do more refactoring. If it's an absolute must, then please let me know and I'll find time to do it as I'd like see pokt continue to improve. (Or if you'd like to set up these changes, that'd be great too)

Olshansk commented 1 year ago

/reviewpad summarize

reviewpad[bot] commented 1 year ago

AI-Generated Pull Request Summary: This pull request makes several changes to the configuration file, including adding a new ClientSessionBlockSyncAllowance field, updating the default values and modifying the HandleRelay, GetCacheKey and IsBetween functions. Additionally, it adds new tests for the HandleRelay function and updates the existing tests. It also introduces a new IsProofSessionHeightWithinTolerance function to check if the latest session block height is within a specified tolerance range.

reviewpad[bot] commented 1 year ago

AI-Generated Pull Request Summary: This pull request modifies the PocketConfig struct, adding ClientSessionBlockSyncAllowance field while updating the DefaultConfig function accordingly. It introduces a new function called IsProofSessionHeightWithinTolerance, which checks if the relaySessionBlockHeight is within an acceptable range, and changes the HandleRelay function to use this new method. Additionally, new test cases have been added to service_test.go to test the updated HandleRelay function. Finally, a IsBetween utility function has been added to utils.go.

reviewpad[bot] commented 1 year ago

AI-Generated Pull Request Summary: This pull request updates the types/config.go and pocketcore/keeper/service.go files, adding a new configuration value ClientSessionBlockSyncAllowance and logic to handle cases where the relay proof session height is within the defined tolerance range. Additionally, a new utility function IsBetween has been added to types/utils.go, and new tests have been added to cover the new logic in pocketcore/keeper/service_test.go.

nodiesBlade commented 1 year ago

for your diagram, can you include the session block height for more clarity? P.S, I did change the max bounds back to latest session block height

2) all the relays are still via client side validation, so the client should still acknowledge that it is a valid relay if it's from a previous sessions within bounds and store it as part of its evidence storage. The time to submit a claim is bounded by ClaimSubmissionWindow, so as long as you submit your claim within this window, you will be rewarded

3) I'm not sure what the purpose of relay metadata is. Perhaps one of the reasons it is there to prevent nodes from serving a relay that has to do a large historical lookup (i.e block 1). I ended up making it its own client parameter. Both also serve a purpose to check if the node is in sync in its own ways.

nodiesBlade commented 1 year ago

Does this mean there is no protocol level validation that a servicer is submitting rewards only for the session where the relays came from?

No, that's not what I am saying. HandleRelay validates a relay proof (checks if it's signed by AAT, the correct session block height, etc), serves the relay request, then stores the relay proof inside a LevelDB database until it's ready to submit a claim. This entire function end to end is done locally and not broadcasted to the blockchain. The network is unaware until the claim/proof is submitted. The claim can be submitted after the session has ended, and then the servicer has up until the ClaimSubmissionWindow to submit the claim before it is considered matured (invalid).

The relay proof itself is validated to match the session it belongs in whenever it comes time for the proof tx, where a randomly selected leaf is selected, and the leaf's relay proof session block height itself is validated against the claim's session.

I think the best resources to get you up to speed on this is:

  1. The Claim/Proof TX documentation
  2. The code where a claim is considered mature (passed the claim submission window)
  3. The leaf validation code 1, 2 where it matches the session block height
msmania commented 1 year ago

I updated the diagram. @PoktBlade Can you confirm this is correct?

sequenceDiagram
    title 4 Blocks Per Session

    participant C as Client
    participant N as Node(h=100 sbh=97)

    alt w/o change or tolerance=0
        C->>+N: Relay for sbh=97
        N->>-C: Ok
    else w/o change or tolerance=0
        C->>+N: Relay for sbh!=97
        N->>-C: InvalidBlockHeightError
    else tolerance=2
        C->>+N: Relay for sbh=97
        N->>-C: Ok
        C->>+N: Relay for sbh=97-4*1
        N->>-C: Ok
        C->>+N: Relay for sbh=97-4*2
        N->>-C: Ok
        C->>+N: Relay for sbh=97-4*3
        N->>-C: InvalidBlockHeightError
        C->>+N: Relay for sbh=97+4*1
        N->>-C: InvalidBlockHeightError
    end

*sbh = Session Block Height

nodiesBlade commented 1 year ago

I updated the diagram. @PoktBlade Can you confirm this is correct?

*sbh = Session Block Height

Looks good!

Olshansk commented 1 year ago

@PoktBlade @msmania My main question has to do with validation of the RelayProof.

Assume:

  1. Node services the request
  2. Node stores it in local memory
  3. Node submits claim
  4. Chain requires proof for leaf with relay of height 93
  5. Node submits proof 6.93 != 97 because rp.SeeionBlogHeight != sessionBlockHeight so even though the relay was serviced, the node will not get rewarded

What am I missing here?

// "Validate" - Validates the relay proof object
func (rp RelayProof) Validate(appSupportedBlockchains []string, sessionNodeCount int, sessionBlockHeight int64) sdk.Error {
    // validate the session block height
    if rp.SessionBlockHeight != sessionBlockHeight {
        return NewInvalidBlockHeightError(ModuleName)
    }
msmania commented 1 year ago

@PoktBlade @msmania My main question has to do with validation of the RelayProof.

I think you mean this check, where a proof tx handler validates a random-picked proof. As you can see, the height check is against the sbh of the corresponding claim. With session roll over, a node at height 100 accepts a relay for sbh=93 and puts it in an evidence of sbh=93, not 97. After that, nobody knows that relay was a latecomer.

msmania commented 1 year ago

Now we have two parameters ClientBlockSyncAllowance and ClientSessionBlockSyncAllowance. The former is block, and the latter is session. It's very confusing. ClientSessionBlockSyncAllowance should be block too, or renamed to ClientSessionSyncAllowance or something.

Olshansk commented 1 year ago

Capturing offline discussion.

Code References

  1. Here is where we store the relay in the local DB.

    • Note that it uses the metadata of the relay, not the node
  2. Here is where we submit the claim after which point no more relays for that session be handled.

    • Note that it is random to smooth out the time at which the claim is submitted so its not all at once
    • There is potential opportunity to make this client configurable in the future
  3. For reference, here is where we compute session heights

    • Important to note that the division implicitly rounds to an int

Frame of thinking

On-chain limitations

Processing

reviewpad[bot] commented 1 year ago

AI-Generated Summary: This pull request introduces a new configuration parameter ClientSessionSyncAllowance to give the clients a range to sync session to node's height. The IsProofSessionHeightWithinTolerance method is added to check if the relay session block height is within an acceptable range. A new test case is introduced to verify the changes, and the HandleRelay method is updated accordingly.

reviewpad[bot] commented 1 year ago

AI-Generated Summary: This pull request introduces a new configuration option named client_session_sync_allowance, which allows setting a tolerance for the number of sessions that a relay request is allowed to be out-of-sync with the current session height. The implementation includes changes in the PocketConfig struct and DefaultConfig, as well as adding a new function IsProofSessionHeightWithinTolerance in the keeper. The corresponding documentation in quickstart.md has also been updated. Unit tests for testing the new functionality are added in service_test.go.

nodiesBlade commented 1 year ago

Squashed all my initial commits to rebase easily

I believe this is good to merge in. @Olshansk, merge in if you think so as well!

msmania commented 1 year ago

I'd move IsProofSessionHeightWithinTolerance to service.go, but I'm just fine. This patch all looks good to me. Thanks!

reviewpad[bot] commented 1 year ago

AI-Generated Summary: This pull request introduces a new configuration property client_session_sync_allowance to manage the allowed range of session heights in relay requests. It updates the types/config.go and app/config.go files to include this new property with a default value of 1. Also, it modifies the x/pocketcore/keeper/service.go to check if the relay's session height is within the allowed range while handling the relay. Lastly, relevant test cases are updated and added to x/pocketcore/keeper/service_test.go to ensure the new feature's correctness.

nodiesBlade commented 1 year ago

Resolved the nits, good to go again.

POKT-Discourse commented 1 year ago

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/into-the-gateway-verse-bootstrapping-a-multi-gateway-ecosystem/4532/1