Closed h5law closed 9 months ago
Thank you @h5law for this first contribution!
@msmania Could you take a look at this review when you have a chance?
Can you add some description to explain why we need this? Reviewpad covers what this patch does, but it doesn't cover why part. And if this is to unblock Gandalf, please mention it too.
Thank you for adding the description. Given that this upgrade is EnforceMaxChains, probably we may want to enforce app's MaxChains too as an opportunistic change? The change would be to add a similar check in Relay.Validate
and ValidateClaim
.
@h5law - can you link this PR to https://github.com/orgs/pokt-network/projects/143?pane=issue&itemId=38406491
I think you're solving for it.
This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:
https://forum.pokt.network/t/pip-32-fix-maxchains-paramater-fmp/4805/1
@msmania I have updated the PR to check the app has not exceeded max chains when validating relays and during claim validation. Is this what you had in mind?
@msmania I have updated the PR to check the app has not exceeded max chains when validating relays and during claim validation. Is this what you had in mind?
Yes, the logic is something like that, but you need to use MaxChains
in the AppKeeper (here). There are two parameters pos/MaximumChains
and application/MaximumChains
.
Just FYI @Olshansk - during the last builder call PNF decided this PR needs a DAO vote to get included into the next build - Shane is aware and is working to publish the PIP that will x-ref this.
@msmania
I should have noticed this earlier. Does this approach, changing
NewSessionNodes
, cause service disruption?
So to be clear this change will not cause service distruption as no nodes are able to stake higher than the 15 limit. However if the DAO decides to change this parameter in future then PNF should release a statement and people will have to change their max chains stakes to the new value before it happens on chain or else they will not be able to join the session.
Even in a transition phase before all node runners adapt their nodes to a new
MaxChains
, we should keep the service quality as is. To do that, I think the right approach would be a node with excessive chains still serve relays, but it won't earn service rewards. What do you think?
I think that blocking service rewards is a good way to go. I think this would just be a check in the claim/proof lifecycle code where funds are distributed which just disables the sending of funds if the node is "overstaked". I can find this and add this in soon
So to be clear this change will not cause service distruption as no nodes are able to stake higher than the 15 limit. However if the DAO decides to change this parameter in future then PNF should release a statement and people will have to change their max chains stakes to the new value before it happens on chain or else they will not be able to join the session.
What I'm saying is, if all nodes hosting a chain have staked 15 chains, after the parameter is changed to a value smaller than 15, that chain stops working because a dispatch request enters an infinite loop. Even if a half of nodes staked have staked excessive chains, a dispatch request gets slower because of more iterations in the loop.
Of course we release a statement and encourage node runners to restake, and of course they're incentivized to do so, but still, we cannot expect all node runners are quick and responsive. QoS shouldn't be relay on this kind of off-chain human activity.
I have been out of the loop for a while, so I apologize for the very late input. The updated approach deviates from the approach that was voted on (enforce at session). I recommend going back to a session-enforced approach, but rather than doing an enforcement check as to whether len (chain list) is LE MaxChain, simply ignore any entry in the node’s chain ID list beyond the first MaxChain entries when selecting nodes for a session. That should solve most of the concerns raised by @msmania (about transition stability if not enough nodes drop chain list below maxchain ahead of a parameter update) without moving enforcement to rewards. Enforcing at rewards is problematic for reasons discussed by myself and @RawthiL on the forum thread.
Please note that if you decide to go with this approach (ie, ignore entries beyond first MaxChain), then it will be good to have full awareness of how the database is ordered.... is the list ordered by data entry, or is it stored in numerically-ascending order of values, etc. Expected behavior both in testnest and mainnet will be slightly different depending on the answer to this question.
This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:
https://forum.pokt.network/t/pip-34-fix-maxchains-paramater-fmp/4805/8
A much simpler approach would be the following. Pass the current MaxChain parameter to session selection code and update the session selection code so as to effectively ignore any chain ID in a node's list of chain IDs beyond the first MaxChain entries. The only downside is that enforcement would be done at the level of portal rather than protocol, so in a multi-portal environment, you would need to make sure each portal enforces this.
@msa6867 You suggest that enforcement of MaxChains should be on the gateways themselves, but that is much harder to enforce. Systems would have to be created to audit sessions and gateway traffic to ensure that gateways are enforcing a network wide rule. The network already has a consensus layer, to ensure that rules are being followed, so it makes way more sense to utilize it for MaxChains, as was it's intended design.
I personally felt that the network's own consensus layer made more sense then creating an off-chain mechanism of enforcement (which is essentially and off-chain consensus mechanism). This is why I did not pursue the gateway enforcement route and believe in this current route. The DAO has already voted on a consensus level change, so I would suggest you take alternative suggestions to the DAO/forum first. This PR is specifically focused on a consensus level fix so any discussion on an alternative gateway level mechanism should happen over there.
This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:
https://forum.pokt.network/t/pip-34-fix-maxchains-paramater-fmp/4805/10
Really? why have you changed the mechanism after the vote started @h5law ?
Enabling an starvation attack and adding complexity to the RTTM calculation is not the right call. Enforcing on the session generation is the right way to go. I think that the issue brought up by @msmania can be addressed in an other way. The chances that no node runner will stake on a chain are really low. Why would this happen if the node runner already has the working blockchain node before the change? Also, we have Community Chains and bootstrap nodes to quickly fix this. These "abandoned" chains will mean constant and uncontested POKT for the first node runner that stakes.
@msmania do you think using the session context for only the param check. I have it this way but using the current context for the blockheight (for the flag check only)
Not in RewardForRelays
. I was thinkig of rejecting a claim in ValidateClaim
if the sender has excessive chains. It's very simple.
Further, I am not sure why we are focused on consensus-breaking changes when this should be fixable without breaking consensus.
Updating the session selection code is consensus-breaking anyway.
@RawthiL Ideally I would leave the change to node selection, however I think the claim validation check is better to avoid a (unlikely) bad state.
@msmania I have moved the chain check to the ValidateClaim
method.
For anyone else the PR is small I would appreciate if you have comments about a specific line or area of the code could you leave your comment on that line, if you can, or at least link to it - this would be appreciated on my end just to make addressing things easier when context switching from Shannon.
@msa6867 I think your idea about gateway level enforcement is not ideal, this should be enforced at the protocol level. Anything that changes: session selection, rewards or claims, most things in regards to what we are talking about will be consensus breaking. About this touching numerous files a lot of the changes are simply my LSP formatting the code as it wasnt formatted properly before (gopls) this is in fact a very small PR.
@moatus @RawthiL I realise I made the change during the vote that moved away from session selection, I am following @msmania lead here. I think the vote was more focused on enforcing the parameter at the protocol level. Whether enforced through joining a session or being able to claim for it, the result is the same (no change at the moment). If the param were to change then we would want no issues. Enforcement at the session level opens a potential risk to QoS levels, enforcement at claim validation seems more appropriate for both apps and servicers.
Any feedback is welcomed this change should be vetted and the best approach should be taken
I have edited my comment above to clean it up and get rid of the negative tone. Sorry about that. Also Ignore what I said about enforcing at gateway... still at session but with a methodology of ignoring all but first MaxChain entries... This mod should take care most of the QoS state concern raised by @msmania while allowing enforcement to still be at session.
@h5law, yeah I saw upon going back that most of this was formatting, lol.
@msmania @moatus @msa6867 @RawthiL WDYT of the first n selection logic?
@h5law I think that the problem (which is not a real problem IMO) will persist, as the order of the chains in the stake command hardly follows any order. All the chains that are not on the "N" first positions will suffer the same problem as if the node was unstaked. The "problem" is not solved, its impact is only reduced.
Also, it would be interesting to know if the order of the vector elements is preserved when written and read during the marshaling/unarshaling of the json data (I'm a complete ignorant here, but I heard some concerns). If the read/write order is not strict than this "read n first chains" is not possible to implement.
My concern is that even with picking the "N" top chains, we will keep carrying nodes staked for more chains that what is allowed, creating non-compliances in the data for a reason that is not relevant enough. We should go for the original commit, there was no problem there, lazy node runners are not to be trusted for high QoS.
First of all, thank you all for considering and discussing the concern I raised. I believe it's necessary discussion. And I really appreciate your diligence and patience @h5law.
@msmania @moatus @msa6867 @RawthiL WDYT of the first n selection logic?
For this logic, I agree with @RawthiL. Besides his points, I think it even deincentivizes node runners to restake their nodes. If their 15-chains nodes still earn rewards, there is no reason for them to pay fees and reduce the chains.
I like this idea in RawthiL's post.
We can plan the enforcement of the max chain parameter. The same way we do with any other protocol update, we observe the chain data, when we are sure that we have at least 24 nodes on each chain with only one chain staked, we make the change.
When we decrease the param, before submitting a gov change tx, we can wait until the majority of nodes are restaked and make sure NewSessionNodes
returns within a reasonable time. Having this off-chain operation is not ideal, but we do if it's needed. With that, I agree with the first approach, not picking nodes with excessive chains in NewSessionNodes
.
I agree with what @RawthiL is saying. We have a pretty tight-knit community of node runners that are well connected to the foundation and each other. It should not at all be a problem to signal the intention to lower the MaxChain value and get voluntary movement to the new level ahead of the actual parameter change, at least to sufficient level to ensure a smooth transition for apps, even if a long-tail of less-connected node runners see their nodes blocked at transition..
Thanks for the update @h5law. I'm not a coder myself so I can't comment on the actual code, but if the logic follows what has been stated, then I believe this is good in my book.
As @msmania mentioned, we can have a coordinated approach prior to reducing chains. Granted this should happen only happen once (maybe twice depending on the DAO's decision) so making sure it's coordinated shouldn't be difficult. It's in the best interest of node runners to make sure their nodes are staked in the right order.
At the end of the day, any node runner wanting returns will need to update their stake, so no reason to try and work around that fact. If a provider is going to neglect their order for some reason, that is on them. We shouldn't need to put any more guard rails that what we already have.
I think this approach address all past concerns. Thanks for the update 👍
Final update 🤞🏼 In a bid to keep things simple I reverted back to the original commit's logic. It seems we are in agreement that this doesn't pose as much a problem as previously thought, as we will coordinate the release as we do all others.
I chose to revert the logic simply because it makes the code simpler. Simple is easier to understand and this logic really makes the most sense I believe, I don't see a need to overcomplicate it.
I appreciate all your guys' (@msmania @RawthiL @moatus @msa6867) inputs and hope this final form is satisfactory.
@h5law It would be nice if you comment this rollback in the forum too, so we have the subject settled for all audiences (not everyone reads the PR comments)
This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:
https://forum.pokt.network/t/pip-34-fix-maxchains-paramater-fmp/4805/24
Added new tests but CI fails for another reason, I have no idea why this error occurs and its fully out of scope of this PR to attend to it. Expanding the timeout fixed it previously but this behaviour is strange and should not occur. Besides that @Olshansk @msmania this PR is ready to go
Added new tests but CI fails for another reason, I have no idea why this error occurs and its fully out of scope of this PR to attend to it. Expanding the timeout fixed it previously but this behaviour is strange and should not occur. Besides that @Olshansk @msmania this PR is ready to go
Hmm, what failed is TestRPC_QueryUnconfirmedTxs
, where I think one transaction was processed before the test ran UnconfirmedTxs
, which has nothing to do with this change anyway. I'm fine with ignoring irrelevant test failures.
This pull request has been mentioned on Pocket Network Forum. There might be relevant details there:
https://forum.pokt.network/t/rc-0-11-1-upgrade-and-hi/5012/1
Description
Human Summary
This PR addresses the blocker for the Gandalf proposal. It does not implement any of the changes proposed in this proposal but instead enforces the
MaxChains
parameter which is currently unforced to an extent. It would prevent a node from joining a session if they are staked to more chains than allowed. This enables the future changes to this parameter to be enforceable on-chain by making sure any node joining a session is abiding by this parameter.Fixes #1584
Summary generated by Reviewpad on 06 Dec 23 01:50 UTC
This pull request includes several changes that affect multiple files:
In the "claim.go" file:
In the "run" job:
In the "go.mod" file:
In the "x/pocketcore/types/service.go" file:
In the "codec.go" file:
In the "expectedKeepers.go" file:
In the "error_codes.go" file:
In the "session.go" file:
In the "service_test.go" file:
These changes aim to improve various aspects of the code, such as readability, maintainability, error handling, and usage of newer package versions. Please review these changes thoroughly.