Feature Proposal: Revert VoteStateUpdate and use a different method for fixing issue 20014

bji commented 2 years ago

Problem

VoteStateUpdate is a costly and complex solution to issue 20014 when simpler and more performant solutions are available.

Proposed Solution

The issue is that when a validator switches forks, its local vote_state's lockouts are not correct with respect to the switched-to fork. The example given is votes on 1 - 2 - 3 (dead fork, switched away from) - 6 (new fork, switched to). The rest of the cluster pops 1 - 2 since their lockouts were never doubled, but the local validator doesn't pop 1 - 2 because it thought it voted on 3 which doubled 1 - 2's lockouts.
The VoteStateUpdate solution is to get the rest of the cluster to sync to the local validator's lockouts by basically demonstrating that "see, I voted on 3, so my lockouts doubled, even though they didn't really because my vote on 3 is not part of the real block history, it's on a dead fork". The implementation of this is very complicated and performance-costly. AshwinSekar is trying to reduce the costs but it's likely still way too costly. I think that testnet may be experiencing performance problems just due to the increased load of handling VoteStateUpdate instead of Vote.
Instead I think the solution should be to have the local validator fix its lockouts when switching forks. It is possible for the validator to update its own local lockouts so that they reflect what the rest of the cluster has computed, instead of forcing the rest of the cluster to use the local validators lockouts.
This is accomplished most readily by having the tower save a copy of its lockouts right before each slot it votes on. Then when it switches forks, it finds the most recent slot voted on that is a common ancestor between the forks, and resets its vote_state's lockouts to the lockouts that existed immediately after the vote on that slot. Now it is in sync with the rest of the cluster, and applying the new slot to the tower for the new vote on the new fork does exactly what the rest of the cluster thinks it does.

A validator still must observe proper lockouts when switching from the switched-from fork to the switched-to fork. It does this by retaining its lockouts while "still on that fork, waiting for lockouts to expire", and only updating the local vote_state with the saved lockouts as it is casting its first vote on a slot on the new fork. The validator has already ensured that it is safe to vote on the new fork before my proposed change has any effect on its vote_state's lockouts, so it can't be any less safe in terms of avoiding lockout violations, than the existing implementation.

bji commented 2 years ago

Sure my b make it slot 6 and make the initial chain pre fork longer, still same problem.

You can try, but you won't find a configuration that doesn't also violate lockouts. This is by definition. If the local validator waits until any slot it would switch to doesn't violate lockouts of where it is switching from, then it isn't violating lockouts. And it can't be violating lockouts on the fork it's switching to because the slot it's voting on is newer than any vote that it already cast that already didn't violate lockouts relative to that original fork.

AshwinSekar commented 2 years ago

Even if you vote on 6 your on chain and local state aren't in sync

bji commented 2 years ago

Even if you vote on 6 your on chain and local state aren't in sync

Why not? The chain had me at:

Slot | Lockout
0    | 8
1    | 4
2    | 2

after the vote on 2. After my vote on 6, it has me at:

Slot | Lockout
0    | 8
6    | 2

Which is what I also have.

AshwinSekar commented 2 years ago

On chain the vote for 2 landed before vote for 1

Slot | Lockout
0    | 4
2    | 2

Applying 1 failed, and then applying 6 pops off 2 and leaves 0 with a lockout of 4 instead.

bji commented 2 years ago

Then the vote for 1 can never land since it's old. And my validator, since it only resets to valid lockout states of VoteStates that are already on-chain, would never reset to a state that included 1 either.

bji commented 2 years ago

When I wrote this earlier:

"So you vote on 6 instead. You reset to:

Slot | Lockout 0 | 8 1 | 4 2 | 2 "

It was with the assumption that you are resetting to a valid on-chain state of lockouts. If 1 didn't land, it wouldn't be in those lockouts.

bji commented 2 years ago

It's not possible to have a VoteState locally that includes lockouts that didn't land on-chain. Since my proposal only syncs a validator's tower to VoteState that the chain is tracking for itself, it doesn't ever include votes that it saw in an order differently than the block chain recorded them.

AshwinSekar commented 2 years ago

You said you save your local lockouts before every vote so wouldn't you save 1? Or are you saying you're resetting to some remote lockouts? I think resetting to a remote lockout on switch has even more problems because then you're missing votes that haven't landed yet.

bji commented 2 years ago

You said you save your local lockouts before every vote so wouldn't you save 1? Or are you saying you're resetting to some remote lockouts? I think resetting to a remote lockout on switch has even more problems because then you're missing votes that haven't landed yet.

Forget the implementation I originally proposed, I realized later that it's equivalent to just syncing the local tower's lockouts to the on-chain lockouts at the slot of the common ancestor of the forks that are being switched between.

So if that's what you mean by "resetting to some remote lockouts", then yes that's what I mean.

The block chain records what actual votes my validator landed at every slot and what their effects are on lockouts of the VoteState at that slot. My proposal is that when I switch forks, my validator:

a) waits until the slot it's switching to doesn't violate lockouts of the fork it's switching from (which, since that's the fork it's currently on, matches the lockouts that the Tower is tracking for itself, which include all votes cast on that fork, even ones that may not ever land or did land on the fork) (by the way the current implementation already does, and must do, this, it's what the Tower is for)

b) Syncs Tower lockouts to the VoteState lockouts of my validator's vote account at the common ancestor of the fork it's switching from and the fork it's switching to right before the first vote cast on the new fork

(a) guarantees no lockout violations (b) guarantees that the validator's Tower lockouts stay in sync with the cluster

None of this helps with detecting slashing violations by other validators.

AshwinSekar commented 2 years ago

b) Syncs Tower lockouts to the VoteState lockouts of my validator's vote account at the common ancestor of the fork it's switching from and the fork it's switching to right before the first vote cast on the new fork

This misses any votes that didn't land yet in your common ancestor so your new vote state will be out of sync again. Regardless of how you wait you could always have votes in transit that don't land until you switch screwing up your sync.

Any way I think we're pretty far out of topic. If you have a proposal for an alternate to VoteStateUpdate that can handle out of order votes i'd be happy to look at a detailed implementation idea. or actual evidence that processing VoteStateUpdate is causing perf degradation.

bji commented 2 years ago

Any way I think we're pretty far out of topic. If you have a proposal for an alternate to VoteStateUpdate that can handle out of order votes i'd be happy to look at a detailed implementation idea. or actual evidence that processing VoteStateUpdate is causing perf degradation.

That's really confusing, because what I proposed handles out of order votes, and I just described why and how. And it's exactly the topic of this github issue. I think that by "handle out of order votes" you are again talking about detecting slashing? You must be, because what I propose isn't impacted by out of order votes at all.

Keep in mind that it's not just about performance, it's more about maintainability (VoteStateUpdate handling is large and complex) and especially the effect that has on subsequent changes.

solana-labs / solana

Feature Proposal: Revert VoteStateUpdate and use a different method for fixing issue 20014 #27473

Problem

Proposed Solution