Closed bji closed 2 years ago
There are only two significant complexities of implementation of my proposal:
It must be possible for the tower to be informed that a vote on a given slot is a switch from one fork to another, and to be told or compute the common ancestor (or, inversely, the earliest voted-on slot that is on the dead fork).
It must be possible to save the history of slot-lockouts along with the tower in all places where the tower is persistently stored, and to load it back in when the tower is loaded. This would be easily achievable if the tower didn't have frozen ABI, which I assumes means that it cannot have serialized fields added to it. I would probably propose that the slot-lockouts history simply be appended to the existing serialized form of the tower in persistent storage, but I don't know if Solana Labs would like that idea.
Otherwise, the implementation is really trivial:
a. Create a simple data structure that stores a copy of the VecDeque
Because the root is documented as being arbitrary within the local tower, I assume it's not necessary to save and restore the root_slot also in the tower's vote_state when switching forks.
The rough order of maximum additional cost to store the slot-lockouts history is:
31 slots x 31 slots, which is 961 additional Lockout data structures stored. This will make the tower storage alot larger, something to consider. It could be serialized in a compressed form, which would help the size but depending on the cost of I/O versus compression, may not be the right trade-off.
I think that testnet may be experiencing performance problems just due to the increased load of handling VoteStateUpdate instead of Vote.
Is there any evidence to support this claim? The only findings I have are that block sizes have been increased which https://github.com/solana-labs/solana/pull/26092 should solve.
I think that testnet may be experiencing performance problems just due to the increased load of handling VoteStateUpdate instead of Vote.
Is there any evidence to support this claim? The only findings I have are that block sizes have been increased which #26092 should solve.
No I have none really, just an assumption given that the processing of VoteStateUpdate looks way more complex than processing of Vote and that these transactions are the vast majority of transactions on testnet. Block sizes increasing due to the larger VoteStateUpdate transactions may also be contributory.
Also -- can the lockouts from the tower just be taken from the actual Vote account for the validator at the bank of the common ancestor of the slot it's switching from and the slot it's switching to? Then nothing extra would need to be tracked at all.
This is accomplished most readily by having the tower save a copy of its lockouts right before each slot it votes on. Then when it switches forks, it finds the most recent slot voted on that is a common ancestor between the forks, and resets its vote_state's lockouts to the lockouts that existed immediately after the vote on that slot. Now it is in sync with the rest of the cluster, and applying the new slot to the tower for the new vote on the new fork does exactly what the rest of the cluster thinks it does.
There are many reasons why this doesn't work
By resetting your tower to the common ancestor's tower you ignore any accumulated lockout on your ancestors due to the wrong fork. Given threshold check 8 this in theory could be 8 lockout doublings you miss when resetting your tower.
Similarly you ignore any situations in which ancestors were popped off. For example in network partition validator could be stuck on a minority fork for a long time failing threshold. Eventually this causes the common ancestor and further ancestors to be popped off tower. When partition is resolved and we switch to main fork we now erroneously adopt a tower that still has these ancestors.
Finally the problem that lead to this issue in the first place, out of order votes. When choosing your least common ancestor saved tower you make this claim "Now it is in sync with the rest of the cluster, and applying the new slot to the tower for the new vote on the new fork does exactly what the rest of the cluster thinks it does.". If any votes landed out of order this is false and now the validator is stuck voting with votes that cannot be applied to the on chain vote state until sufficient lockout expires.
Also -- can the lockouts from the tower just be taken from the actual Vote account for the validator at the bank of the common ancestor of the slot it's switching from and the slot it's switching to? Then nothing extra would need to be tracked at all.
This is something that carl and I discussed in the past - it is incredibly difficult. The issue is it is impossible to trust the validity of this vote state because you are disregarding any votes that were in transit waiting to land. Similarly the definition of common ancestor that you voted on becomes muddled here - what if that vote never landed? What if it landed out of order? Do you then use the block in which it landed? What if that block is on a different fork?
Let me know if this makes sense or you have any questions about the scenarios, happy to draw them out in depth if necessary.
If you're curious here's the initial discussion of the solution: discord://discord.com/channels/428295358100013066/489504501749907468/887686875777749032 Unfortunately I think the fire channel from last september has been axed so don't have any logs from there.
If you're curious here's the initial discussion of the solution: discord://discord.com/channels/428295358100013066/489504501749907468/887686875777749032 Unfortunately I think the fire channel from last september has been axed so don't have any logs from there.
Thanks. Corrected link is:
https://discord.com/channels/428295358100013066/489504501749907468/887686875777749032
As far as I can tell, the discussion went straight from "there is a problem" to "what about syncing the cluster's tower state from the local tower state". I didn't see anyone suggest the opposite, which is what I am suggesting, and which I think must solve the same issue and in a much more direct way.
Consider a validator V which went off and voted on a fork. The partition included only three validators, two leaders (A and B) and the validator in question (V), which was never a leader.
Now consider that in this situation V could end up with local lockouts that don't match the cluster because of the example that started issue 20014 (i.e. local votes on 1 - 2 - 3, 3 is on dead fork, V's local lockouts are wrong with respect to the live fork).
This is the crux of the problem.
However, consider that the partition meant that no one on the network saw V's vote on 3 other than A, B, and V.
In order for A and B not to slash V, V must wait until its lockouts on 3 expire before it can switch to 6. So it does.
But once it waits long enough, no one else in the cluster even knows that the 3 vote existed.
Therefore, the effect on the cluster is the exact same as if V never voted on 3.
Therefore, V can act as if it never voted on 3.
Therefore, V can reset its tower state to exactly what it was after the vote on 2, because that is what it means to have never voted on 3.
The only thing V has to do is wait until lockouts expire on 3 so as not to be slashed by A or B. Then it can reset to lockouts state immediately before 3 (which is the same as immediately after 2).
The cluster CANNOT see any difference between the situation where V went off and voted on a dead fork versus one in which it didn't. Therefore V cannot cause any problems by behaving like that's what happened, all the way up to resetting its tower state like I mentioned.
Therefore, V can act as if it never voted on 3.
By voting on 3 you increment the lockout on 1 and 2, if there was say some other fork coming off of 0 you'd now have to wait longer in order to switch to it.
Therefore, V can act as if it never voted on 3.
By voting on 3 you increment the lockout on 1 and 2, if there was say some other fork coming off of 0 you'd now have to wait longer in order to switch to it.
Yeah but aren't you understanding my proposal? You un-double the lockouts on 1 and 2 when switching forks away from the slot that caused you to double 1 + 2. That's what it means to "act as if it never voted on 3".
From the block chain's perspective V never voted on 3 and never doubled 1 + 2. Therefore, when V leaves that fork, it returns to the state equal to "never voted on 3 and never doubled 1 + 2".
Anyone who viewed V's vote on 3 could then slash V for this switch from 0.
... unless you're suggesting that if you leave a fork all votes on that fork are invalid, in which case lockouts are meaningless as all switches would be valid. V "never voted" on 3 so it should be able to vote on all other forks as well.
Anyone who viewed V's vote on 3 could then slash V for this switch from 0.
... unless you're suggesting that if you leave a fork all votes on that fork are invalid, in which case lockouts are meaningless as all switches would be valid. V "never voted" on 3 so it should be able to vote on all other forks as well.
What do you mean by "lockouts are meaningless"? They are not meaningless. They prevent V from switching forks too soon, and votes showing that V did so can be used to slash V. To be clear, the evidence would not all be on-chain. Some of it would have to come from validators who observed V's votes on the dead fork, and can prove what happened, by providing that evidence in the form of vote transactions submitted by the offending validator. Some of that evidence will not be a direct lineage of the surviving fork because some of those transactions were for slots that don't exist on the surviving fork. But the evidence can still be gathered and presented by validators who saw it happen.
However, the lockout doublings that were done by V on the dead fork because of its vote on 3, are meaningless with respect to the switched-to fork, since the 3 vote didn't happen from the perspective of the block chain.
When you switch you provide a switching proof for validators that have observed both forks to verify and validate. In this case the lockouts on the blockchain are incorrect, as they were never doubled so all validators would then have to maintain a separate vote_state to track your true lockouts.
When you switch you provide a switching proof for validators that have observed both forks to verify and validate. In this case the lockouts on the blockchain are incorrect, as they were never doubled so all validators would then have to maintain a separate vote_state to track your true lockouts.
They already do maintain a separate vote_state to track your true lockouts. It's your vote account.
Also the lockouts on the block chain are correct, by definition. It's the local validator who is incorrect because it retained effects from votes on dead forks when switching to other forks.
so how do you suppose they're going to keep track of your true lockouts without a full voting history? You're suggesting pushing the problem of keeping track of accurate lockouts into another structure.
that's incorrect, the lockouts on the block chain are to encapsulate the voting behavior of the validator as closely as possible.
so how do you suppose they're going to keep track of your true lockouts without a full voting history? You're suggesting pushing the problem of keeping track of accurate lockouts into another structure.
that's incorrect, the lockouts on the block chain are to encapsulate the voting behavior of the validator as closely as possible.
But they already do.
pub struct VoteState {
// ...
pub votes: VecDeque<Lockout>,
// ...
}
VoteState is maintained by each validator, for every validator (including itself).
The local validator additionally has a degenerate version of VoteState stored in its Tower data structure to track its lockouts including votes that it cast but it hasn't seen land on chain yet.
No they don't, VoteState
is stored on chain and computed from the bank during replay
No they don't,
VoteState
is stored on chain and computed from the bank during replay
You're just talking about the mechanism whereby validators track vote state of other validators. That doesn't change the fact that validators track/maintain the vote state of each validator.
Maybe we're just using different terminology and that is confusing things? I don't know.
In every Bank, the VoteState for every validator is stored. It includes the Lockouts that for that validator. These Lockouts are computed as votes are replayed.
Therefore, every validator, by virtue of having Banks with the exact same contents of every other validator, has the same Lockouts state at every slot for every validator. And that is the canonical, exact, and correct lockouts for that validator at that slot.
|- 3 - 4 - 5 - 6
|
0 - 1 - 2
| |- 12- 13
|- 15
Scenario, validator V votes on 0,1,2,3,4,5 then decides to switch to 12, and finally to 15. Validator V should be slashed because it violated lockouts when switching from 11 to 15 (1's lockout has not expired).
With your proposed solution: The local tower for V after voting on 2 is
Slot | Lockout
--------------
0 | 8
1 | 4
2 | 2
Now assume that for the votes 0 - 5 they all landed in 6
The on chain vote state for V in bank 6
Slot | Lockout
--------------
0 | 64
1 | 32
2 | 16
3 | 8
4 | 4
5 | 2
The vote state for V on bank 13 where vote for 12 has landed
Slot | Lockout
--------------
12 | 2
So with this information how can i validate that slashing has occurred? The switching proof for 6 -> 12 is correct. Using the proposed vote states the switching proof for 12 to 15 is also correct.
You might say, let us keep a chain of pointers to forks that V has switched to and validate lockouts from all banks at the tip of these forks. Then the vote state on bank 6 would show us slashing. Suspending disbelief for a moment that we don't prune dead forks and have all this information readily available in memory, there are still holes here to fill.
We could easily change this example such that the votes for 0-5 landed not in 6 but some descendant, and now should we use the vote_state at the time of switch, or the vote_state of the current tip of the fork when evaluating? Use the max? Sure what if some alternative fork was popped off - how do we merge these vote_states?
I'm saying you simply don't have enough information to enforce lockout violations if you don't accurately track lockouts in on chain vote_state. You can propose other mechanisms but the crux of it is that you must have enough information to know V's tower pre and post switch in order to slash. If you have this information that means you are keeping an accurate tower for all other validators and that is not possible without VoteStateUpdate
or at least Vote
.
I'm sorry but I don't understand:
"Validator V should be slashed because it violated lockouts when switching from 11 to 15 (1's lockout has not expired)."
It didn't switch from 11 to 15, it switched from 12 to 15, it never voted on 11. 11 was never in its lockouts at all.
I don't see a lockout violation in your example.
OK wait a second I think I got it. The lockout violation is because of the votes on 0, 1, 2, 3, 4, 5, -- the validator is locked on a fork that includes slot 2 until at least slot 18, and includes slot 1 until at least slot 33. So voting 0 - 15 violates both those.
sorry s/11/12 yeah voting on 15 violates slot 2 and 1 from the initial fork
So the data needed to detect this is there. The VoteState for 6 for V will show that it is locked out on 1 and 2 at the time that it casts the vote for 15.
Yeah but how do you get that vote state? With VoteStateUpdate
it's all self contained
Yeah but how do you get that vote state? With
VoteStateUpdate
it's all self contained
Yeah but how do you verify that the self contained data is correct and not missing stuff?
If I have a validator that never even saw V's votes on 3 - 5 (because it never saw blocks 3 - 5), how do I know that it's lying if I next see a VoteStateUpdate from it that doesn't include the effects on lockouts of those votes?
If I never saw it's votes on 3 - 5 then I can't verify a switching proof from 5 -> 12. If instead it submits a switching proof from 3 -> 12, when the votes from 3 - 5 land those votes violate lockout (no valid switching proof from 12 -> 3 and will be slashed.
But the votes on 3 - 5 never land, it's a dead fork. I may never see them.
What's this switching proof from 5 -> 12? Why would V include that if it's trying to pretend that it never voted on 3 - 5?
It is going to pretend that it never voted on 3 - 5. So it's going to submit its vote for 5, then later submit its vote for 12, then later still submit whatever transactions it wants to that look like it never voted on 3 - 5 or 12.
If I vote on 3 - 5 at some point they will make it to a leader through tpu or gossip. If you're saying that the votes never reach anyone then it's the same as if it never voted
Sure even if it switches 2 -> 12 after having voted for 5 at some point 5 will make it to someone and be slashable
So if I receive bogus VoteStateUpdates and believe that V's voting sequence actually went 0 - 1 - 2 - 12 - 15, then later when I see the original VoteStateUpdate transactions for slots 3 through 5, I will not just reject the slots 3 through 5 VoteStateUpdate because it's clearly an old tx for old slots, but will recognize them as one part of a slashable event?
Yeah because the vote for 5 will have the lockout for 1 at 32 which isn't possible with your current vote state update of 15
So doesn't matter if it happened before or after, it's still a slashable vote not an old vote. If it's an old vote than the lockouts should be consistent.
OK so I think I'm finally getting it and I really appreciate your patience. I guess that slash detection using my methodology requires keeping lots of fork history around (I had assumed it was already always kept around anyway, but I guess not). But VoteStateUpdate allows any two vote events to be detected as slashable regardless of the order they arrive and without requiring that the validator keep the fork data around.
This is basically what you were alluding to in your long comment, if I'm not mistaken.
Pretty much, think you'd either have to keep all forks (and i'm not convinced that is actually enough) or do something similar to what mvines is doing and keep track of local towers by parsing vote transactions.
If you're doing the second method then you'd still have to deal with out of order votes which i'm not sure your proposal handles.
As for processing concerns I think you might be overestimating how long processing VoteStateUpdate
in relation to Vote
is. VoteStateUpdate
also allows us to do some optimizations in gossip and banking_stage during times of high spam by dropping extraneous votes from validators (we only care about the latest since votes are no longer incremental) which would also be a performance win.
One point: VoteStateUpdate processing requires access to the VoteState of the vote account at the slot that the VoteStateUpdate tx is landing. check_update_vote_state_slots_are_valid() takes that vote state as an argument.
Therefore, your assertion that a VoteStateUpdate received later can detect a slashable event for a VoteStateUpdate received earlier, must also require that the fork history of the fork on which the VoteStateUpdate is landing, be available.
Slashing and switching proofs are yet to be implemented I think there would be a separate pipeline to verify that, you don't need the dead fork to verify that type of lockout violation.
"If you're doing the second method then you'd still have to deal with out of order votes which i'm not sure your proposal handles."
My proposal was only about keeping the local lockouts in sync with the cluster for every fork, not trying to detect lockouts. I thought that the original problem, as alluded to causing a crash on last September, had nothing to do with slash detection, but with validators having incorrectly doubled lockouts in their tower which then prevented them from switching forks when they should be able to, then leading to a mass failure to switch forks when enough validators got into this state.
If the goal is detecting lockouts then I have to think harder about how to do that without VoteStateUpdate, if possible.
Slashing and switching proofs are yet to be implemented I think there would be a separate pipeline to verify that, you don't need the dead fork to verify that type of lockout violation.
Yeah but you need that just to verify the validity of the VoteStateUpdate tx itself before you can even use it for anything. check_update_vote_state_slots_are_valid() requires it.
I might be incompetent in a lot of things but I think I can code a pathway using Option
🙃
Yeah I was saying that with out of order votes your solution doesn't keep local lockouts in sync with the cluster
Then when it switches forks, it finds the most recent slot voted on that is a common ancestor between the forks, and resets its vote_state's lockouts to the lockouts that existed immediately after the vote on that slot. Now it is in sync with the rest of the cluster, and applying the new slot to the tower for the new vote on the new fork does exactly what the rest of the cluster thinks it does.
If you had out of order votes this statement isn't true.
I might be incompetent in a lot of things but I think I can code a pathway using
Option
upside_down_faceYeah I was saying that with out of order votes your solution doesn't keep local lockouts in sync with the cluster
Then when it switches forks, it finds the most recent slot voted on that is a common ancestor between the forks, and resets its vote_state's lockouts to the lockouts that existed immediately after the vote on that slot. Now it is in sync with the rest of the cluster, and applying the new slot to the tower for the new vote on the new fork does exactly what the rest of the cluster thinks it does.
If you had out of order votes this statement isn't true.
I think the confusion comes down to this:
In my view, lockout state is only dependent on the state of the fork in which the lockouts are evaluated. Anything that happened in another fork is not relevant to lockouts on this fork. Detecting slashing requires analyzing data from multiple forks, possibly continuously as votes arrive.
In your view, lockout state comes from the sum total of all forks. This single state of all lockouts is advanced by VoteStateUpdate which must include lockout doublings from all forks that the validator ever voted on; not always including all lockouts form all forks in VoteStateUpdate risks slashing because any VoteStateUpdate inconsistencies need only the two VoteStateUpdate transactions to demonstrate the lockout violation.
So when you say my proposal doesn't have correct lockouts when votes are received out of order, it's with respect to your definition of lockouts. With my definition of lockouts, vote order doesn't matter, resetting to lockouts state to the common ancestor when switching forks will keep my validator's lockouts in sync with on-chain VoteState (by definition). But again, it will require exterior mechanisms for detecting slashing.
Now the crux of the issue is, which is more efficient:
or
I really feel like detecting slashing should be done off-validator. That really expensive processing doesn't need to be done by validators, who can readily just accept all votes and assume they are valid (like they are doing now). The threat of there being some watchdog group running slash detection, along with an actual "slash" transaction that accepts slash proofs and slashes if they are verified, would be enough to implement slashing, and it would take alot of work (and complexity) out of validators.
|- 3
0 - 1 - 2
|- 5 - 6
You vote 0, 1, 2 the tower you save is
Slot | Lockout
0 | 8
1 | 4
2 | 2
You vote on 3 and then you later you switch to 5 so you restore the above tower and then apply your vote on top. Your local tower now looks like this:
Slot | Lockout
0 | 8
1 | 4
5 | 2
Turns out when your votes landed in the 5 fork the vote for 2 came before the vote for 1 so your on chain state actually looks like this and now you have a discrepancy.
Slot | Lockout
5 | 2
Yeah i'm open to the idea of slashing being done off validator. VoteStateUpdate
ensures that your local tower is in sync with on chain vote state and minimizes the risk of you accidentally submitting a slashable vote due to consensus issues.
Slashing aside, with the out of order votes it's possible to have a huge divergence between your local tower and on chain vote state which means you can't vote until you expire enough of your local tower (what happened last september).
|- 3 0 - 1 - 2 |- 5 - 6
You vote 0, 1, 2 the tower you save is
Slot | Lockout 0 | 8 1 | 4 2 | 2
You vote on 3 and then you later you switch to 5 so you restore the above tower and then apply your vote on top. Your local tower now looks like this:
Slot | Lockout 0 | 8 1 | 4 5 | 2
Turns out when your votes landed in the 5 fork the vote for 2 came before the vote for 1 so your on chain state actually looks like this and now you have a discrepancy.
Slot | Lockout 5 | 2
You can't switch from 3 to 5. When you vote for 3, you double 2's lockouts, so you now have:
Slot | Lockout
0 | 16
1 | 8
2 | 4
3 | 2
That means that you can't vote on a non-decendent of 3 until after slot 5.
So you vote on 6 instead. You reset to:
Slot | Lockout
0 | 8
1 | 4
2 | 2
Then apply 6:
Slot | Lockout
0 | 8
6 | 2
Sure my b make it slot 6 and make the initial chain pre fork longer, still same problem.
Now the crux of the issue is, which is more efficient:
- VoteStateUpdate verifying lots of state on every vote (which includes the work of check_update_vote_state_slots_are_valid(), a necessary prerequisite to ensure that invalid VoteStateUpdate transactions are rejected)
- But VoteStateUpdate also makes slash detection super trivial
or
- Vote being a super simple and cheap transaction, but
- Detecting slashable votes requiring keeping history of all slots that have not achieved finality, and also handling out-of-order votes with regards to detecting lockout violations
I really feel like detecting slashing should be done off-validator. That really expensive processing doesn't need to be done by validators, who can readily just accept all votes and assume they are valid (like they are doing now). The threat of there being some watchdog group running slash detection, along with an actual "slash" transaction that accepts slash proofs and slashes if they are verified, would be enough to implement slashing, and it would take alot of work (and complexity) out of validators.
I think priority # 1 is to make a vote program that can handle out of order votes. Then detecting slashing / lockouts etc. If you have a suggestion for something like that I'm all ears.
Problem
VoteStateUpdate is a costly and complex solution to issue 20014 when simpler and more performant solutions are available.
Proposed Solution
The issue is that when a validator switches forks, its local vote_state's lockouts are not correct with respect to the switched-to fork. The example given is votes on 1 - 2 - 3 (dead fork, switched away from) - 6 (new fork, switched to). The rest of the cluster pops 1 - 2 since their lockouts were never doubled, but the local validator doesn't pop 1 - 2 because it thought it voted on 3 which doubled 1 - 2's lockouts.
The VoteStateUpdate solution is to get the rest of the cluster to sync to the local validator's lockouts by basically demonstrating that "see, I voted on 3, so my lockouts doubled, even though they didn't really because my vote on 3 is not part of the real block history, it's on a dead fork". The implementation of this is very complicated and performance-costly. AshwinSekar is trying to reduce the costs but it's likely still way too costly. I think that testnet may be experiencing performance problems just due to the increased load of handling VoteStateUpdate instead of Vote.
Instead I think the solution should be to have the local validator fix its lockouts when switching forks. It is possible for the validator to update its own local lockouts so that they reflect what the rest of the cluster has computed, instead of forcing the rest of the cluster to use the local validators lockouts.
This is accomplished most readily by having the tower save a copy of its lockouts right before each slot it votes on. Then when it switches forks, it finds the most recent slot voted on that is a common ancestor between the forks, and resets its vote_state's lockouts to the lockouts that existed immediately after the vote on that slot. Now it is in sync with the rest of the cluster, and applying the new slot to the tower for the new vote on the new fork does exactly what the rest of the cluster thinks it does.
A validator still must observe proper lockouts when switching from the switched-from fork to the switched-to fork. It does this by retaining its lockouts while "still on that fork, waiting for lockouts to expire", and only updating the local vote_state with the saved lockouts as it is casting its first vote on a slot on the new fork. The validator has already ensured that it is safe to vote on the new fork before my proposed change has any effect on its vote_state's lockouts, so it can't be any less safe in terms of avoiding lockout violations, than the existing implementation.