feat: enforce max chains during new session

h5law commented 11 months ago

Description

Human Summary

This PR addresses the blocker for the Gandalf proposal. It does not implement any of the changes proposed in this proposal but instead enforces the MaxChains parameter which is currently unforced to an extent. It would prevent a node from joining a session if they are staked to more chains than allowed. This enables the future changes to this parameter to be enforceable on-chain by making sure any node joining a session is abiding by this parameter.

Fixes #1584

Summary generated by Reviewpad on 06 Dec 23 01:50 UTC

This pull request includes several changes that affect multiple files:

In the "claim.go" file:
- An import statement has been added for "github.com/tendermint/tendermint/rpc/client".
- The function "SendClaimTx" has been modified to include multiple arguments on separate lines for better readability.
- An additional condition has been added to the if statement inside the "SendClaimTx" function.
- Several log statements inside the "SendClaimTx" function have been modified to use chaining for better readability.
- The function "ValidateClaim" has been modified to add a new error condition for checking the maximum number of chains an app can be staked to.
- The "ValidateClaim" function now checks the length of the app's chains and compares it to the maximum allowed chains.
- Formatting changes have been made to the log statements inside the "ValidateClaim" function for better readability.
- The variable declaration for "msg" inside the "DeleteExpiredClaims" function has been modified to use the shorthand syntax.
In the "run" job:
- The test command timeout has been increased to 15 minutes by adding the argument "-timeout=15m". This change ensures that the tests have enough time to complete.
In the "go.mod" file:
- The version of the "github.com/google/go-cmp" package has been updated from "v0.5.6" to "v0.5.8".
- The version of the "golang.org/x/sys" package has been updated from "v0.1.0" to "v0.14.0".
In the "x/pocketcore/types/service.go" file:
- The import statements have been rearranged and reorganized.
- The "Validate" function of the "Relay" struct has been modified to accept additional parameters and has been formatted with proper line breaks for better readability.
- An additional validation has been added to check if the number of chains an app is staked to exceeds the permitted limit.
- Some variable assignments and conditional statements have been modified for clarity and consistency.
- The "sessionEndCtx" variable has been corrected to be properly assigned based on the block height.
- Some comments within the code have been updated for better clarity.
In the "codec.go" file:
- A new constant "EnforceMaxChainsUpdateKey" has been added.
- A new method "IsAfterEnforceMaxChainsUpgrade" has been added to check if the height is after a specific upgrade.
- The "IsOnNonCustodialUpgrade" method has been updated to include the height in the check.
- The "IsAfterNamedFeatureActivationHeight" method has been updated to include the height in the check.
- The "SliceToExistingMap" function has been updated to use shorthand declaration for the "fmap" variable.
- The "SliceToMap" function has been updated to use shorthand declaration for the "fmap" variable.
- The "MapToSlice" function has been updated to use shorthand declaration for the "fslice" variable.
In the "expectedKeepers.go" file:
- The function "MaxChains" has been added to the "PosKeeper" interface.
- The function "MaxChains" has been added to the "AppsKeeper" interface.
In the "error_codes.go" file:
- A new error code "CodeChainsOverLimitError" with a value of 91 has been added.
- A new error "ChainsOverLimitError" with a message "the number of staked chains is over the limit" has been added.
- A new function "NewChainsOverLimitError" has been added that creates an error with the "CodeChainsOverLimitError" code and a formatted error message that includes the number of chains that were received and the maximum allowed chains.
In the "session.go" file:
- In the "Session" struct, a new import statement for the "log" package has been added. Additionally, the "HashString" method has been modified to return "s.SessionHeader.HashString()" instead of "s.HashString()".
- The "NewSession" function has been modified to add new parameters and adjust the parameter alignment. The added parameters are "sessionNodesCount" and "blockHash".
- The "NewSessionNodes" function has been modified to add new parameters and adjust the parameter alignment. The added parameters are "sessionKey" and "sessionNodesCount".
- The "NewSessionNodes" function also includes additional code related to retrieving values from the "ctx" and "keeper", as well as checking node conditions and adding nodes to the session.
- The "MaxPossibleRelays" function includes a comment correction and a rounding operation for the returned value.
In the "service_test.go" file:
- Lines 26 and 27: The variables "ethereum" and "bitcoin" are modified to use octal escape sequences instead of hexadecimal ones. The value "01" is now "0o1" and the value "02" is now "0o2".
- Lines 148 and 152: The variable "ethereum" is modified to use an octal escape sequence instead of a hexadecimal one. The value "01" is now "0o1".
- Lines 197 and 201: The variable "ethereum" is modified to use an octal escape sequence instead of a hexadecimal one. The value "01" is now "0o1".
- Lines 326 and 330: A new method "MaxChains" is added to the "MockAppsKeeper" struct. This method returns the value 15.
- Lines 397 and 401: A new method "MaxChains" is added to the "MockPosKeeper" struct. This method returns the value 15.

These changes aim to improve various aspects of the code, such as readability, maintainability, error handling, and usage of newer package versions. Please review these changes thoroughly.

reviewpad[bot] commented 11 months ago

Thank you @h5law for this first contribution!

Olshansk commented 11 months ago

@msmania Could you take a look at this review when you have a chance?

msmania commented 11 months ago

Can you add some description to explain why we need this? Reviewpad covers what this patch does, but it doesn't cover why part. And if this is to unblock Gandalf, please mention it too.

msmania commented 11 months ago

Thank you for adding the description. Given that this upgrade is EnforceMaxChains, probably we may want to enforce app's MaxChains too as an opportunistic change? The change would be to add a similar check in Relay.Validate and ValidateClaim.

RossiNYC commented 11 months ago

@h5law - can you link this PR to https://github.com/orgs/pokt-network/projects/143?pane=issue&itemId=38406491

I think you're solving for it.

POKT-Discourse commented 11 months ago

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/pip-32-fix-maxchains-paramater-fmp/4805/1

h5law commented 11 months ago

@msmania I have updated the PR to check the app has not exceeded max chains when validating relays and during claim validation. Is this what you had in mind?

msmania commented 11 months ago

@msmania I have updated the PR to check the app has not exceeded max chains when validating relays and during claim validation. Is this what you had in mind?

Yes, the logic is something like that, but you need to use MaxChains in the AppKeeper (here). There are two parameters pos/MaximumChains and application/MaximumChains.

RossiNYC commented 11 months ago

Just FYI @Olshansk - during the last builder call PNF decided this PR needs a DAO vote to get included into the next build - Shane is aware and is working to publish the PIP that will x-ref this.

h5law commented 11 months ago

@msmania

I should have noticed this earlier. Does this approach, changing NewSessionNodes, cause service disruption?

So to be clear this change will not cause service distruption as no nodes are able to stake higher than the 15 limit. However if the DAO decides to change this parameter in future then PNF should release a statement and people will have to change their max chains stakes to the new value before it happens on chain or else they will not be able to join the session.

Even in a transition phase before all node runners adapt their nodes to a new MaxChains, we should keep the service quality as is. To do that, I think the right approach would be a node with excessive chains still serve relays, but it won't earn service rewards. What do you think?

I think that blocking service rewards is a good way to go. I think this would just be a check in the claim/proof lifecycle code where funds are distributed which just disables the sending of funds if the node is "overstaked". I can find this and add this in soon

msmania commented 11 months ago

So to be clear this change will not cause service distruption as no nodes are able to stake higher than the 15 limit. However if the DAO decides to change this parameter in future then PNF should release a statement and people will have to change their max chains stakes to the new value before it happens on chain or else they will not be able to join the session.

What I'm saying is, if all nodes hosting a chain have staked 15 chains, after the parameter is changed to a value smaller than 15, that chain stops working because a dispatch request enters an infinite loop. Even if a half of nodes staked have staked excessive chains, a dispatch request gets slower because of more iterations in the loop.

Of course we release a statement and encourage node runners to restake, and of course they're incentivized to do so, but still, we cannot expect all node runners are quick and responsive. QoS shouldn't be relay on this kind of off-chain human activity.

msa6867 commented 10 months ago

I have been out of the loop for a while, so I apologize for the very late input. The updated approach deviates from the approach that was voted on (enforce at session). I recommend going back to a session-enforced approach, but rather than doing an enforcement check as to whether len (chain list) is LE MaxChain, simply ignore any entry in the node’s chain ID list beyond the first MaxChain entries when selecting nodes for a session. That should solve most of the concerns raised by @msmania (about transition stability if not enough nodes drop chain list below maxchain ahead of a parameter update) without moving enforcement to rewards. Enforcing at rewards is problematic for reasons discussed by myself and @RawthiL on the forum thread.

Please note that if you decide to go with this approach (ie, ignore entries beyond first MaxChain), then it will be good to have full awareness of how the database is ordered.... is the list ordered by data entry, or is it stored in numerically-ascending order of values, etc. Expected behavior both in testnest and mainnet will be slightly different depending on the answer to this question.

POKT-Discourse commented 10 months ago

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/pip-34-fix-maxchains-paramater-fmp/4805/8

moatus commented 10 months ago

A much simpler approach would be the following. Pass the current MaxChain parameter to session selection code and update the session selection code so as to effectively ignore any chain ID in a node's list of chain IDs beyond the first MaxChain entries. The only downside is that enforcement would be done at the level of portal rather than protocol, so in a multi-portal environment, you would need to make sure each portal enforces this.

@msa6867 You suggest that enforcement of MaxChains should be on the gateways themselves, but that is much harder to enforce. Systems would have to be created to audit sessions and gateway traffic to ensure that gateways are enforcing a network wide rule. The network already has a consensus layer, to ensure that rules are being followed, so it makes way more sense to utilize it for MaxChains, as was it's intended design.

I personally felt that the network's own consensus layer made more sense then creating an off-chain mechanism of enforcement (which is essentially and off-chain consensus mechanism). This is why I did not pursue the gateway enforcement route and believe in this current route. The DAO has already voted on a consensus level change, so I would suggest you take alternative suggestions to the DAO/forum first. This PR is specifically focused on a consensus level fix so any discussion on an alternative gateway level mechanism should happen over there.

POKT-Discourse commented 10 months ago

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/pip-34-fix-maxchains-paramater-fmp/4805/10

RawthiL commented 10 months ago

Really? why have you changed the mechanism after the vote started @h5law ?

Enabling an starvation attack and adding complexity to the RTTM calculation is not the right call. Enforcing on the session generation is the right way to go. I think that the issue brought up by @msmania can be addressed in an other way. The chances that no node runner will stake on a chain are really low. Why would this happen if the node runner already has the working blockchain node before the change? Also, we have Community Chains and bootstrap nodes to quickly fix this. These "abandoned" chains will mean constant and uncontested POKT for the first node runner that stakes.

msmania commented 10 months ago

@msmania do you think using the session context for only the param check. I have it this way but using the current context for the blockheight (for the flag check only)

Not in RewardForRelays. I was thinkig of rejecting a claim in ValidateClaim if the sender has excessive chains. It's very simple.

Further, I am not sure why we are focused on consensus-breaking changes when this should be fixable without breaking consensus.

Updating the session selection code is consensus-breaking anyway.

h5law commented 10 months ago

@RawthiL Ideally I would leave the change to node selection, however I think the claim validation check is better to avoid a (unlikely) bad state.

@msmania I have moved the chain check to the ValidateClaim method.

For anyone else the PR is small I would appreciate if you have comments about a specific line or area of the code could you leave your comment on that line, if you can, or at least link to it - this would be appreciated on my end just to make addressing things easier when context switching from Shannon.

@msa6867 I think your idea about gateway level enforcement is not ideal, this should be enforced at the protocol level. Anything that changes: session selection, rewards or claims, most things in regards to what we are talking about will be consensus breaking. About this touching numerous files a lot of the changes are simply my LSP formatting the code as it wasnt formatted properly before (gopls) this is in fact a very small PR.

@moatus @RawthiL I realise I made the change during the vote that moved away from session selection, I am following @msmania lead here. I think the vote was more focused on enforcing the parameter at the protocol level. Whether enforced through joining a session or being able to claim for it, the result is the same (no change at the moment). If the param were to change then we would want no issues. Enforcement at the session level opens a potential risk to QoS levels, enforcement at claim validation seems more appropriate for both apps and servicers.

Any feedback is welcomed this change should be vetted and the best approach should be taken

msa6867 commented 10 months ago

I have edited my comment above to clean it up and get rid of the negative tone. Sorry about that. Also Ignore what I said about enforcing at gateway... still at session but with a methodology of ignoring all but first MaxChain entries... This mod should take care most of the QoS state concern raised by @msmania while allowing enforcement to still be at session.

@h5law, yeah I saw upon going back that most of this was formatting, lol.

h5law commented 10 months ago

@msmania @moatus @msa6867 @RawthiL WDYT of the first n selection logic?

RawthiL commented 10 months ago

@h5law I think that the problem (which is not a real problem IMO) will persist, as the order of the chains in the stake command hardly follows any order. All the chains that are not on the "N" first positions will suffer the same problem as if the node was unstaked. The "problem" is not solved, its impact is only reduced.

Also, it would be interesting to know if the order of the vector elements is preserved when written and read during the marshaling/unarshaling of the json data (I'm a complete ignorant here, but I heard some concerns). If the read/write order is not strict than this "read n first chains" is not possible to implement.

My concern is that even with picking the "N" top chains, we will keep carrying nodes staked for more chains that what is allowed, creating non-compliances in the data for a reason that is not relevant enough. We should go for the original commit, there was no problem there, lazy node runners are not to be trusted for high QoS.

msmania commented 10 months ago

First of all, thank you all for considering and discussing the concern I raised. I believe it's necessary discussion. And I really appreciate your diligence and patience @h5law.

@msmania @moatus @msa6867 @RawthiL WDYT of the first n selection logic?

For this logic, I agree with @RawthiL. Besides his points, I think it even deincentivizes node runners to restake their nodes. If their 15-chains nodes still earn rewards, there is no reason for them to pay fees and reduce the chains.

I like this idea in RawthiL's post.

We can plan the enforcement of the max chain parameter. The same way we do with any other protocol update, we observe the chain data, when we are sure that we have at least 24 nodes on each chain with only one chain staked, we make the change.

When we decrease the param, before submitting a gov change tx, we can wait until the majority of nodes are restaked and make sure NewSessionNodes returns within a reasonable time. Having this off-chain operation is not ideal, but we do if it's needed. With that, I agree with the first approach, not picking nodes with excessive chains in NewSessionNodes.

msa6867 commented 10 months ago

I agree with what @RawthiL is saying. We have a pretty tight-knit community of node runners that are well connected to the foundation and each other. It should not at all be a problem to signal the intention to lower the MaxChain value and get voluntary movement to the new level ahead of the actual parameter change, at least to sufficient level to ensure a smooth transition for apps, even if a long-tail of less-connected node runners see their nodes blocked at transition..

moatus commented 10 months ago

Thanks for the update @h5law. I'm not a coder myself so I can't comment on the actual code, but if the logic follows what has been stated, then I believe this is good in my book.

As @msmania mentioned, we can have a coordinated approach prior to reducing chains. Granted this should happen only happen once (maybe twice depending on the DAO's decision) so making sure it's coordinated shouldn't be difficult. It's in the best interest of node runners to make sure their nodes are staked in the right order.

At the end of the day, any node runner wanting returns will need to update their stake, so no reason to try and work around that fact. If a provider is going to neglect their order for some reason, that is on them. We shouldn't need to put any more guard rails that what we already have.

I think this approach address all past concerns. Thanks for the update 👍

h5law commented 10 months ago

Final update 🤞🏼 In a bid to keep things simple I reverted back to the original commit's logic. It seems we are in agreement that this doesn't pose as much a problem as previously thought, as we will coordinate the release as we do all others.

I chose to revert the logic simply because it makes the code simpler. Simple is easier to understand and this logic really makes the most sense I believe, I don't see a need to overcomplicate it.

I appreciate all your guys' (@msmania @RawthiL @moatus @msa6867) inputs and hope this final form is satisfactory.

RawthiL commented 10 months ago

@h5law It would be nice if you comment this rollback in the forum too, so we have the subject settled for all audiences (not everyone reads the PR comments)

POKT-Discourse commented 10 months ago

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/pip-34-fix-maxchains-paramater-fmp/4805/24

h5law commented 10 months ago

Added new tests but CI fails for another reason, I have no idea why this error occurs and its fully out of scope of this PR to attend to it. Expanding the timeout fixed it previously but this behaviour is strange and should not occur. Besides that @Olshansk @msmania this PR is ready to go

msmania commented 10 months ago

Added new tests but CI fails for another reason, I have no idea why this error occurs and its fully out of scope of this PR to attend to it. Expanding the timeout fixed it previously but this behaviour is strange and should not occur. Besides that @Olshansk @msmania this PR is ready to go

Hmm, what failed is TestRPC_QueryUnconfirmedTxs, where I think one transaction was processed before the test ran UnconfirmedTxs, which has nothing to do with this change anyway. I'm fine with ignoring irrelevant test failures.

POKT-Discourse commented 8 months ago

This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:

https://forum.pokt.network/t/rc-0-11-1-upgrade-and-hi/5012/1

pokt-network / pocket-core