paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.86k stars 680 forks source link

disputes: implement validator disabling #784

Open ordian opened 2 years ago

ordian commented 2 years ago

Once a dispute is concluded and an offence is submitted with DisableStrategy::Always, a validator will be added to DisabledValidators list.

Implement on-chain and off-chain logic to ignore dispute votes for X sessions. Optionally, we can ignore backing and approval votes and remove from the reserved validator set on the network level.

Possibly outdated: https://github.com/paritytech/polkadot-sdk/issues/785

⚠️ FOR THE MOST UP TO DATE INFO REFER TO: Disabling Project Board ⚠️

Possibly related paper here.

⚠️ FOR THE MOST UP TO DATE INFO REFER TO: Disabling Project Board ⚠️

Goals for new validator disabling/Definition of Done

  1. Not affecting consensus - disabling can never become a security threat.
  2. Handling broken validators nicely (prevent continuous spam).
  3. Plays well with existing disabling in substrate
  4. Makes sure to never break.
  5. (Consequence of the above): We can enable slashing - safe and securely.

Timeline

As quickly as possible, definitely by the end of the year.

eskimor commented 2 years ago

Also disable validators who repeatedly vote against valid. Disabling means in general that we should not accept any votes/statements from that validator for some time, those include:

In addition, depending on how quickly we disable a validator, it might already have raised thousands of disputes (if it disputes every single candidate for a few blocks), we should also consider deleting already existing disputes (at the dispute-coordinator) in case one side of the dispute consists only and exclusively of disabled validators - so we apply disabling to already pending participations, not just new ones.

This might be tricky to get right (sounds like it could be racy). The reason we should at least think about this a bit, is that so many disputes will delay finality for a significant amount of time resulting in DoS.

Things to consider:

ordian commented 2 years ago

Also disable validators who repeatedly vote against valid.

That's tracked in paritytech/polkadot-sdk#785 and is purely runtime changes.

How quickly do we disable?

We can disable as soon as a dispute (reaching threshold) concludes.

This might be tricky to get right (sounds like it could be racy)

Indeed. I'd be in favor of not complicated this unnecessarily.

eskimor commented 1 year ago

Just had a discussion with @ordian . So what is the point of disabling in the first place? It is mostly about avoiding service degradation due to some low number of misbehaving nodes (e.g. just one). There are other mechanism in place which provide soundness guarantees even with such misbehaving nodes, but service quality might suffer for everybody (liveness).

On the flip side, with disabling, malicious actors could take advantage of bugs/subtle issues to get honest validators slashed and thus disabled. Therefore disablement if done wrong, could actually lead to security/soundness issues.

With this two requirements together, we can conclude that we don't need perfect disablement, but an effective rate limit for misbehaving nodes is enough to maintain service quality. Hence we should be able to limit the number of nodes being disabled at any point in time, to something like 10% maybe 20% ... in any case to something less than 1/3 of the nodes. If this threshold is reached, we can either by random choice or based on the amount of accumulated slashes (or both) enable some nodes again.

This way we do have the desired rate limiting characteristics, but at the same time make it unlikely that an attacker can get a significant advantage via targeted disabling.

Furthermore as this is about limiting the amount of service degradation a small number of nodes (willing to get slashed) can cause, it makes sense to only start disabling once a certain threshold in accumulated slashes is reached.

For the time being, we have no reason to believe that these requirements are any different for disabling in other parts of the system, like BABE. We should therefore double check that and if it holds true strive for a unified slashing/disabling system that is used everywhere through the stack in a consistent fashion.

eskimor commented 1 year ago
  1. Figure out a disabling strategy that limits severeness of honest nodes getting disabled.
  2. keep the network functional in all cases: have enough validators enabled for grandpa to work.
  3. Expose an API to the node, for retrieval of disabled validator.
  4. Don't accept statements/votes from disabled validators on node and runtime.
  5. Don't accept connections from disabled validators
tdimitrov commented 1 year ago

I'll leave my thoughts on a strategy for validator disabling here so that we can discuss it and improve it further (unless it's a total crap :hankey:).

When a validator gets slashed it's disabled following these rules:

  1. The validator will be disabled during the rest of the session. Or in other words - the list of disabled validators will be cleared on each session start.
  2. No more than BYZANTINE_THRESHOLD validators are disabled at the same time. Otherwise we'll break the network.
  3. Each validator will have an offense score indicating how bad his offense was. I think it's safe to use the slash amount for this score. When we reach BYZANTINE_THRESHOLD number of disabled validators, we can re-enable a small offender so that we can disable a bigger one.
  4. If we reach a point where the total offense score is BYZANTINE_THRESHOLD * SLASH FOR SERIOUS OFFENSE we can force a new era, because we have got too much offenders in the active set.

Open questions:

eskimor commented 1 year ago

Reiterating Requirements:

  1. For re-enabling slashes for approval voters, we need disablement being proportional to the slash.
  2. We would like to rate limit pretty quickly to avoid validators accumulating slashes too much in case of bugs/hardware faults.
  3. We need to make sure to never disable too many validators, as this would cause consensus issues. Target should be adjustable, but 10% seems like a reasonable number.

2 is conflicting with 1, as a small slash would result in barely any rate limiting. On the flip side, if a node is misbehaving it is definitely better to have it disabled and protect the network this way, than keep slashing the node over and over again for the same flaw.

Luckily there is a solution to these conflicting requirements: Having the disabling strictly proportional to the slash is only necessary once a significant number of nodes would get disabled, hence we can introduce another (lower) threshold on number of slashed nodes, if it is below that threshold we just disable all of them, regardless of the amount.

Meaning of Disabling

Disabled nodes will always be determined in the runtime, so we do have consensus. There should be an API for the node to retrieve the list of currently disabled nodes as per a given block. The effect will be that no data from a validator disabled in a block X, should ever end up in block X+1. For simplicity and performance we will ignore things like relay parents of candidates, all that is relevant is the block being built. On the node side, we do have forks, therefore we will ignore data from validators as long as a disabling block is in our view.

Runtime

Node

For all nodes being disabled in at least one head in our current view:

Affected subsystems:

If we wanted to go fully minimal on nodes side changes, it should be enough to honor disabled state in the dispute coordinator. Degradation in backing performance should be harmless, approval subsystems are also robust against malicious actors and filtering in the provisioner is strictly speaking redundant as the filtering will also be performed in the runtime.

Disabling Strategy

We will keep a list of validators that have been slashed, sorted by slash amount. For determining for the current block, which validators are going to be disabled we do the following:

  1. We check whether the list of currently slashed validators is less than lower threshold amount (see above), if so - all slashed validators go on the disabled list and we skip the remaining points.
  2. For each slashed validator, add it to the list of disabled validators randomly with a probability equal to their slash amount: 100% slash - always on the list, 10% slash - in 10% of the time, ..
  3. We check whether the list of disabled validators is less than 10% of all validators, if not we randomly remove nodes from the disabled list until we reached the threshold.

I would suggest to ignore slash amount in 3 for simplicity, because:

  1. The higher the slash the higher the probability to be on the list to begin with, so we are already weighing based on slash.
  2. The protocols should be robust against a few rogue validators, having nothing to lose.
  3. Having so many nodes disabled is an edge case, that should never happen and if it did it is very likely due to a bug: Therefore while 100% slashed nodes have nothing to lose, it is actually quite likely that less slashed validators don't behave any better regardless.

Rule 1 protects the network from single (or a low amount) of rogue validators and also protects those validators from themselves: Instead of getting slashed over and over again, they will end up being disabled for the whole session. Giving operators time to react and fix their nodes. (See point 2 in requirements)

This means we will have two thresholds: One where, as long as we are below we always disable 100% and one where, once we are above start to randomly enable validators again.

Disabling, eras, sessions, epochs

Information about slashes should be preserved until a new validator set is elected. With a newly elected validator set, we can drop information about slashed validators and start anew with no validators disabled.

If we settle on this approach, then this would be obsoleted by the proposed threshold system.

tdimitrov commented 1 year ago

Two questions/comments:

For all nodes being disabled in at least one head in our current view:

Why head in current view instead of 'slashed in finalized block'? To be proactive in case of finality stall?

And the second related to disabling stragegy:

  1. We check whether the list of disabled validators is less than 10% of all validators, if not we randomly remove nodes from the disabled list until we reached the threshold.

I think we should do this in two steps:

  1. Randomly remove nodes which are not big offenders (100% slash).
  2. If all the nodes in the list are big offenders - start removing them randomly too.
eskimor commented 1 year ago

I think we should do this in two steps:

1. Randomly remove nodes which are not big offenders (100% slash).

2. If **all** the nodes in the list are big offenders - start removing them randomly too.

Yes, we could do that, but I argued above that we should be able to keep it simple without any harm done.

Why head in current view instead of 'slashed in finalized block'? To be proactive in case of finality stall?

Yes. Given that attacks on disputes can trigger a finality stall, it would be really bad if attackers could avoid getting disabled by their very attack. While at the same time for honest, but malfunctioning nodes they might already accumulate a significant amount of slash before getting disabled.

Sophia-Gold commented 1 year ago

4. If we reach a point where the total offense score is BYZANTHINE_THRESHOLD * SLASH FOR SERIOUS OFFENSE we can force a new era, because we have got too much offenders in the active set.

What are the other repercussions of forcing a new era? This sounds like a good idea, but I'm guessing it could break a lot of unrelated things. We should consider tooling as well.

[ ] Should we keep track of the offense score of a validator? For example our disabled list is almost full. We add validator A for a small offense. Then validator B makes something more severe so we remove A and add B. Then validator A does something bad again. What should be his offense score - old score + new offense score or just new offense score? The latter makes more sense to me but it will require extra runtime storage.

I think we can just use the slashes as @eskimor suggested. But, yes, if a validator is disabled then reactivated then slashed again we need to recalculate the disabled list.

I am still a little uncomfortable with the notion of disabling validators who haven't been 100% slashed in order to protect them from bugs when they can always ask to have the slashes reversed by governance. My bias is towards handling it economically and increasing the slashing amount if we think repeated misbehavior would bring too much load on the network before a bad actor loses all their stake. However, this probably isn't compatible with the solution we came up with for time overruns (since we have to balance the overrun charge with the collective amount slashed from potentially as much as a byzantine threshold of approval checkers). I'll probably just have to accept this.

tdimitrov commented 1 year ago

What are the other repercussions of forcing a new era? This sounds like a good idea, but I'm guessing it could break a lot > of unrelated things. We should consider tooling as well.

We discussed it yesterday. It's not a good idea. Starting a new era takes time and it's not safe to force it if we have got too many misbehaving validators. We won't do this.

eskimor commented 1 year ago

About the rate limiting, considering that we have that upper limit on disabled nodes. I think having a rate limiting disabling strategy for lesser slashes makes sense and adds little to no complexity. It only makes sense, with accumulating slashes though or alternatively if we considered the slashes being accumulative at least from the disabling strategy perspective. Consider nodes that are not behaving equally bad, some nodes being more annoying than others, then we would disable them more and more until they are eventually silenced, having the network resume normal operation. While other nodes, only having minor occasional hickups or even only one, would continue operating normally.

This also has the nice property that the growth of the disabling ratio for an individual node will automatically slow down, as there are less possibilities for the node to do any offenses. So to get disabled 100%, you really have to be particularly annoying.

About accumulating slashes:

We would like to protect the network from a low number of nodes going rogue, but once disputes are raised by more than just a couple of nodes it is not an isolated issue, but either an attack or more likely a network wide issue.

In case of an attack, it would then be good to have accumulating slashes, in case of a network wide issue - accumulating slashes would still be no real harm, if we can easily refund them - can we?

For isolated issues, nodes are protected from excessive slashing via disabling.

burdges commented 1 year ago

A priori, we should avoid randomness here since on-chain randomness is biasable. It makes analyzing this annoying and appears non-essential. I've not thought much about it though, so if it's easy then explain.

We can disable the most slashed nodes of course, which also remains biasable, but not for quite so long in theory.

Ideally, we should redo the slashing for the whole system, aka removing slashing spans ala https://github.com/w3f/research/blob/master/docs/Polkadot/security/slashing/npos.md, but that's a larger undertaking. We'd likely plan for subsystem elsewhere bugs too, which inherently links this to the subsystem.

burdges commented 1 year ago

I am still a little uncomfortable with the notion of disabling validators who haven't been 100% slashed in order to protect them from bugs when they can always ask to have the slashes reversed by governance.

We want slashes to be minimal while still accomplishing their protocol goals. It avoids bad press, community drama, etc.

We do not know exactly what governance considers bugs, like what if the validator violates some obscure node spec rule. It's maybe even political, like based upon who requests a refund, who their ISP is, etc. In fact, there exist stakers like parity and w3f who'd feel reluctant to request refunds for some borderline bugs.

tdimitrov commented 1 year ago

We will keep a list of validators that have been slashed, sorted by slash amount.

We are disabling only slashed validators? We won't disable anyone disputing a valid block or voting for invalid block (unless being a backer)?

eskimor commented 1 year ago

Yes we only ever disable slashed validators. We do disable on disputing valid block though and we will also slash and disable for approving an invalid block, see paritytech/polkadot-sdk#635 .. but a suitable disabling strategy as discussed here is a prerequisite for the latter.

tdimitrov commented 1 year ago

And one more question regarding:

  1. For each slashed validator, add it to the list of disabled validators randomly with a probability equal to their slash amount: 100% slash - always on the list, 10% slash - in 10% of the time, ..

If there is space for all 100% slash and all 10% slash (in this case) - should we (a) add all 10% slashed validators to the set or (b) still add them with 10% probability (and potentially skip some validators)?

I think you meant (a) otherwise there is contradiction with:

  1. We check whether the list of currently slashed validators is less than lower threshold amount (see above), if so - all slashed validators go on the disabled list and we skip the remaining points.
eskimor commented 1 year ago

No it is (b) - point 1 was under the prerequisite that we are below the lower threshold. For point 2 and on-wards this is not the case. Idea being: If there are only a few rogue validators having problems - just disable them and don't bother. It is not a security threat and keeping them silent is better for everybody.

tdimitrov commented 1 year ago

Yes, my bad. There is no contradiction. If we are at point 2, we are already above the limit.

Sophia-Gold commented 1 year ago

I like thinking of this as rate limiting instead of disabling. Something at least like (1/2)^percentage_slash so that a validator slashed 1% is only active every other block, 2% every 4th block. Probably steeper than this.

And then if we reach a concerning threshold of active validators, even just on average, we can slow the rate limiting. A special case is when it's so bad we need to reactivate validators that have been slashed 100%: they still shouldn't be allowed to back candidates and maybe not produce relay chain blocks either. We could generally have the slower rate limiting apply only to finality and not backing and block production.

The upside of this is it doesn't require randomness. However, the problem is we'd need to think about whether nodes are synced up in how they're rate limited. For example, if you have 10% of the network 50% rate limited that would be fine if the rate limiting is staggered, which is less likely in practice if we don't intentionally design it that way.

tdimitrov commented 1 year ago

A special case is when it's so bad we need to reactivate validators that have been slashed 100%: they still shouldn't be allowed to back candidates and maybe not produce relay chain blocks either.

I think we can't do this. If we disable more than f validators - we'll break the security assumptions of the protocols. Doesn't allowing them to back candidates is more or less equal to disabling them.

The upside of this is it doesn't require randomness.

Can you elaborate on this? How will we pass by without randomness?

Sophia-Gold commented 1 year ago

I think we can't do this. If we disable more than f validators - we'll break the security assumptions of the protocols. Doesn't allowing them to back candidates is more or less equal to disabling them.

This would just be choosing safety over liveness, no?

Can you elaborate on this? How will we pass by without randomness?

We can rate limit deterministically, like in my example. Regardless of whether we do it deterministically or try to do it randomly, we do still probably need to assume all rate limited validators are sometimes inactive in the same slot -- or likely because maybe they were slashed for the same reason. Unless we try to intentionally stagger them and do some complicated bookkeeping around it. So if we want no more than 10% inactive then we'd probably have to back down on rate limiting when 10% are rate limited at all. Maybe that's not a problem.

tdimitrov commented 1 year ago

This would just be choosing safety over liveness, no?

We can put it this way. My main concern was that we were trying to handle a case when there are more than f byzantine nodes but this is not entirely correct. f is related to all validators, not just the ones in the active set right?

My concern with disabling too many validators is killing the network in case of a bug which is not an attack. If we sacrifice liveness aren't we killing any chances of governance to recover the network?

Something at least like (1/2)^percentage_slash so that a validator slashed 1% is only active every other block, 2% every 4th block. Probably steeper than this.

Yes I understand your idea for the disabling now. Thanks!

eskimor commented 1 year ago

Sophia and I discussed the possibility of letting disabled validators still participate in finality but not back any candidates and maybe also let them not produce blocks (if too many validators get disabled). The problem is, both complicate things:

  1. We can also not block block production endlessly, if too many are blocked malicious nodes took over and can influence randomness more than they should for example.
  2. Backing sounds less harmful at first, but we are planning to moving more and more relay chain functionality to parachains so this can become a security problem as well.

To keep it simple, I would suggest to stick to the boolean kind of disabling we have right now. You can either be disabled at a given point in time or not, there are no other kinds of disabling, like disabled but still allowed to vote on finality and such.

Sophia-Gold commented 1 year ago

2. Backing sounds less harmful at first, but we are planning to moving more and more relay chain functionality to parachains so this can become a security problem as well.

Ah, right. We didn't discuss relay chain block production and I just added it in my comment here. Eventually everything we care about continuing on the relay chain will be on a system parachain so backing would be a problem as well.

tdimitrov commented 1 year ago

@kianenigma can you share your feedback on this?

Some context - we want to adapt the validator disabling strategy and expand it to parachain consensus. More or less it's specified in this comment. Do you see any problems with this? Will it play nicely with the rest of the slashing/consensus/etc code in substrate?

More specifically - are you comfortable with the new disabling strategy and making it the default (and only) one in substrate's staking pallet?

kianenigma commented 1 year ago

Luckily there is a solution to these conflicting requirements: Having the disabling strictly proportional to the slash is only necessary once a significant number of nodes would get disabled, hence we can introduce another (lower) threshold on number of slashed nodes, if it is below that threshold we just disable all of them, regardless of the amount.

If I read this correctly, this is easily doable via something like:

pub enum DisableStrategy {
    /// Independently of slashing, this offence will not disable the offender.
    Never,
    /// Only disable the offender if it is also slashed.
    WhenSlashed,
    /// Independently of slashing, this offence will always disable the offender.
    Always,
    /// Disable the offender if more than given percent of the set has already been disabled.
    AfterRatio(Perbill),
}

The changes in slashing.rs would need to check the number of already disabled validators in this era, and pass true or false based on this to add_offending_validator.

I don't know off the top of my head head to get this ratio, and see if we are below or above it, but in theory it should be possible. It might need a query to the session pallet (to which we already have an interface via SessionInterface) to get the number of total validators in the current session, and the ones that have been disabled.

We use the same logic to determine if we should trigger a new era or not, so it should be fine.

A question I have is if disablement is "all or nothing" type of situation, or not? A validator in my mental model, as it stands now, has two main roles:

  1. Block authoring
  2. Parachain consensus

Current interface and implementation does not distinguish between the two and by disabling a validator, deprives them of both duties.

burdges commented 1 year ago

We'll need more nuanced disabling here because we cannot really limit the number disabled. We should take discuss a hybrid scheme:

If a backer votes come back invalid then

If an approval vote comes back invalid then

If a dispute comes back incorrect then

We've this insanely fancy slashing system that limits the total slashing, even under bugs which governance never refunds. We might nerf it though, and it's very sensitive to parameter choices, so yes we might screw up by leaning only upon it to limit damage from repeated slashing.

I still owe @kianenigma some more serious reevaluation of this, but it's worth discussing what approvals looks like, assuming the slashing system can continue to have damage limits itself. If this avoids us removing people then we can have a simpler analysis of protocols like grandpa and approvals, and we can avoid the case where an adversary exploits a bug to disable many people in particular.

We've no similar analysis issues with disabling whole parachains, but of course if you disable a critical system parachain then that's another problem. We could've selected parachains immune to disabling, and then just be extra careful with their code upgrades, or even make their code live in the relay chain code and/or transition them to new wasm engines slower.

eskimor commented 1 year ago

I think disabling parachains should be an orthogonal topic. Getting this right might be even more nuanced than a good validator disabling strategy. @burdges your proposal sounds pretty complex and advantages are not really clear to me. I would really like to keep this as simple as possible, while maintaining reasonable security/liveness. We can for example adjust above proposal, so we will never re-enable a 100% slashed validator, if there are any other to re-enable to reach the threshold (10%). But then this really should be good enough.

eskimor commented 1 year ago

In any scenario where we end up with rate limiting, instead of full disablement we should have rather large strides. They should be larger than DISPUTE_CANDIDATE_LIFETIME_AFTER_FINALIZATION, maybe twice the size. So if a validator is disabled 50% of the time the he would be disabled for 20 blocks in a row and then re-enabled for 20 blocks in a row, instead of flip-flopping each block. This way, by the time the validator is enabled again its dispute votes in that time frame would likely be already obsolete and we minimize the harm done.

Overkillus commented 1 year ago

Going back to some of the original requirements:

  1. For https://github.com/paritytech/polkadot-sdk/issues/635, we need disablement being proportional to the slash.

I don't believe proportional disabling is needed at all as explained in here.

2 is conflicting with 1, as a small slash would result in barely any rate limiting.

If 1 doesn't require proportional disabling then there is no longer any conflict between them and we can simply disable fully even for minor slashes in spirit of...

if a node is misbehaving it is definitely better to have it disabled and protect the network this way, than keep slashing the node over and over again for the same flaw.

Which is essentially aiming to protect honest nodes with minimal slashing by fully disabling it.

We recently had a chat about it but I'd point it out here on record. Disabling on minor slashes and accumulating slashes should both provide enough security as a deterrent against repeating offences, but disabling for minor offences is more lenient for honest faulty nodes and that's why we prefer it. Ideally we'd have both disabling AND accumulating as attackers can still commit multiple minor offences (for instance invalid on valid disputes) in the same block before they get punished and disabled, but damages done should be minimal so it's not a huge priority.

burdges commented 1 year ago

We've parallel discussion in https://github.com/paritytech/polkadot-sdk/issues/635#issuecomment-1705789752 btw

Overkillus commented 1 year ago

Potential disabling strategy direction partially based on a call with @eskimor:

Write-up is based on the idea that: Disabling is (generally) not a security* requirement but a liveness** optimisation.

*Security = Invalid candidates cannot go through (or are statistically very improbable) **Liveness = Valid candidates can go through (at a decent pace)

TLDR at the bottom!

Nature of Disabling

By disabling we mean directly setting some active validators as disabled which would prohibit them from participating in all or some parts of the consensus.

A simple argument for disabling is that if someone is already slashed 100% and they have nothing to loose they could cause harm to the network and should be silenced.

It is worth noticing that even if we don't have ANY direct disabling in the system slashed nodes will get eventually forced out of the active validator set due to insufficient funds in the next validator election. This will for all intents and purposes force them out (akin to disabling), simply with a higher latency (24-48h).

There are a few difference between disabling directly and getting forced out at the end of an era:

What happens with no direct disabling?

If there is no disabling a new type of an attacker has to be considered - a validator that is already slashed 100% and has nothing to lose. What damage can he cause during his window of activity?

  1. Liveness attacks: 1.1 Break sharding (with mass no-shows or mass disputes): It forces everyone to do all the work which affects liveness but doesn't kill it completely. The chain can progress at a slow rate. 1.2 Mass invalid candidate backing: Spawns a lot of worthless work that needs to be done but it is bounded by backing numbers. Honest backers will still back valid candidates and that cannot be stopped. Honest block authors will eventually select valid candidates and even if disputed they will win and progress the chain.

  2. Security attacks: 2.1 The best and possibly only way to affect security is by getting lucky in the approval process. Currently ~30 approvals are needed to pass approval voting and if by chance all of them would be malicious they could get a single candidate through. Chance for that is around 4*10^-15. The concern is that by not disabling attackers they could get significantly more tries. Assuming they can back invalid candidates on 50 cores for 48 hours straight and only those candidates get included it still gives a 7*10^-9 chance of success which is abysmal considering the cost (all malicious stake slashed). Their chances can get higher if we consider randomness manipulation in 1/3 of cases where they are block authors. This COULD be a problem and it would necessitate further calculations (but that should't be necessary based on later sections).

Attack 1.2 and 2.1 should generally be pretty futile as a solo attacker while 1.1 could be possible with mass disputes even from a single attacker.

Eventually those attackers (after 1-2 eras, 24-48h) will get pushed out during the election. The cost was paid and someone was punished simply with a relatively high latency.

Risk of disabling

The primary risk behind having any sort of disabling is that it is a double-edged sword that in case of any dispute bugs could disable honest nodes or be abused by attackers to specifically attack honest nodes. Disabling honest nodes could tip the scales between honest and dishonest nodes and destabilise the protocol. Honest nodes being pushed out of consensus is primarily a problem for approval voting and disputes where a supermajority is required.

It is worth noting that is is fundamentally a defence in depth strategy because if we assume disputes are perfect it should not be a real concern. In reality disputes are difficult to get right, and non-determinism and happen so defence in depth is crucial when handling those subsystems.

What about slashes with no direct disabling?

Slashing by itself is less of a problem due to its high latency of getting pushed out of the validator set. It still affects the honest slashed node in the short term (lost funds), if the slash was truly unjustified the governance should refund the tokens after an investigation. So generally in the long term no harm will be done. It gives 24-48 hours to react in those cases which is at least a small buffer further escalate if an attack pushing out honest nodes out of consensus would show up.

If we compare it with direct disabling that can strike incredibly quickly (as quickly as few disputes can conclude), it is possible to push out a significant portion of honest validators.

Mitigation

Firstly, how can we mitigate the risk of disabling?

Mitigating risks of disabling

We have a few options available in our arsenal.

Bounding disabling will be needed nevertheless and it's just a matter of what value to choose for it. The mot obvious and bold one is simply 1/3.

When it comes to limiting the scope of disabling to address the risks this part of the analysis gives us a direct hint:

Honest nodes being pushed out of consensus is primarily a problem for approval voting and disputes where a supermajority is required.

Based on the above we can make it so:

Mitigating the fundamental risks

Those are the risks that show up when there is no disabling at all so:

  1. Liveness attacks: 1.1 Break sharding (with mass no-shows or mass disputes) 1.2 Mass invalid candidate backing
  2. Security attacks: 2.1 Getting lucky in approval voting

To address the above we can further alter the disabled status:

Disabling strategy:

Definition of validator disablement:

Summary

This disablement strategy allows for maximising the liveness benefits from direct disabling without any extra security concerns. parachains should always stay live enough to be able to make at least some progress and no forcing out of honest nodes is possible (at least in the short term - era).

Practically it makes it so that even if there is some PVF nondeterminism that can be exploited we have at least 24-48 hours until it starts forcing nodes out of consensus (approvals and disputes in that case) while maintaining the benefits of early mitigation through limited direct disabling.

Open questions

Should block production be blocked for disabled validators?

Are there any other side-effects of disablement in Substrate?

Can governance react fast enough in cases where honest validators are getting slashed and will be pushed out during the next election?

TLDR

Even if we don't have disabling slashed validators get forced out at the end of an era (akin to getting disabled).

Direct disabling within the same era (low latency disabling) can aim to improve liveness and not touch security.

NPoS elections and getting forced out due to insufficient funds (high latency disabling) protects security in the long run by kicking out bad actors, but giving governance more time to react in edge cases.

Direct disabling (what happens directly after getting slashed):

Disable up to 1/3 of the network.

Thanks to above direct disabling can never affect security by disabling too many validators as a defence in depth measure.

ordian commented 1 year ago

Should block production be blocked for disabled validators?

Yes, this is already the case in Substrate. Since we don't have any punishment for producing an invalid block, we should not allow another waster of resources.

https://github.com/paritytech/polkadot-sdk/blob/c0a4ce1fc87b7b39bc307072e60db97ade3cd6be/substrate/frame/babe/src/lib.rs#L360-L365

1.1: Dispute voting is allowed but limited during disablement. To be precise honest nodes do not respond to disabled dispute statements

Should we also reject dispute imports in the runtime where the only against votes are from the disabled validators? I guess this is less of an issue, since we need at least f+1 votes for the import to succeed, otherwise it's rejected in the runtime. And since we don't cast our vote, f+1 votes would be difficult to get (unless enough nodes are running an old validator version).

Overall this seems like a solid plan 👍

We should look into all other types of offences (e.g. beefy equivocations) and make sure this works well for them or if there are any other adjustments needed.

Can governance react fast enough in cases where honest validators are getting slashed and will be pushed out during the next election?

Slashing can be reverted easily, since it's not applied instantly, but I highly doubt that Governance can react quickly enough before the next session, especially on Kusama, so we should not rely on that.

What about re-enablement after being disabled? Assuming the queue might get full, we could enable previously disabled validator.

eskimor commented 1 year ago

Slashing can be reverted easily, since it's not applied instantly, but I highly doubt that Governance can react quickly enough before the next session, especially on Kusama, so we should not rely on that.

Can we fix that? Note that Governance does not need to react within a session (except if we are forcing a new election, which we probably should not do then!), but only within an era. If we mitigated the risk of honest nodes(and stash) getting kicked out at the next election, we would be golden when it comes to security issues caused by non-determinism.

Overkillus commented 1 year ago

One point that I'll repeat from the above write-up is:

It is worth noting that is is fundamentally a defence in depth strategy because if we assume disputes are perfect it should not be a real concern.

What I mean by that that if slashing honest nodes is not possible we have no real problems or concerns. The strategy above is simply aiming to mitigate the damage IF something goes wrong. And if it goes wrong it means that our assumptions are broken (either nondeterminism or protocol loopholes slipped through) and it will be extremely hard to cover every base perfectly. Because it is a defence in depth strat we don't need it to perfectly cover everything but at least serve as a damage control reducing or delaying the aftermath.

Should we also reject dispute imports in the runtime where the only against votes are from the disabled validators? I guess this is less of an issue, since we need at least f+1 votes for the import to succeed, otherwise it's rejected in the runtime. And since we don't cast our vote, f+1 votes would be difficult to get (unless enough nodes are running an old validator version).

Shouldn't be an issue so I'd keep it as it is. Crossing 1/3 will not be easy as you pointed out.

Slashing can be reverted easily, since it's not applied instantly, but I highly doubt that Governance can react quickly enough before the next session, especially on Kusama, so we should not rely on that.

I don't think we should design the Polkadot protocol under the assumption that it has to work with Kusama parametrisation. End goal is the best possible state of things on Polkadot and Kusama is there to simply test what can be tested. Kusama is more agile to promote quick testing and not to block us from deploying optimal parameters on Polkadot. My 2 cents at least.

but I highly doubt that Governance can react quickly enough before the next session,

Session? Do you mean era? Elections happen once per era AFAIK so that's 24h and NPoS elections are quite heavy to compute so they are computed some time in advance as well. If you are slashed at the end of an era I am genuinely not sure how it plays out with elections since it might invalidate some precomputed solutions. (Could that be an attack vector? By invalidating some solutions you could use your foreknowledge to make it so your solution wins the election... Need to investigate that more.)

In general I was assuming a 24-48h window for getting organically kicked out after an election. So 1 era buffer (although that might be unrealistic). It is a defence in depth measure so having this time is better than having none (with instant direct disabling) and it only becomes relevant if there's a protocol mistake or nondeterminism we didn't anticipate.

In general disputes should not happen. If a dispute happened everyone should be on high alert and fellowship / people invested in governance should immediately investigate. If you as a node operator believe you were unfairly slashed you should swiftly raise an alarm bell. So the 24-48h seem like enough to at least realise something is going on.|

IWhat about re-enablement after being disabled? Assuming the queue might get full, we could enable previously disabled validator.

I dont see a need to re-enable. What's the argument for it?

We should look into all other types of offences (e.g. beefy equivocations) and make sure this works well for them or if there are any other adjustments needed.

I started looking into it and compiling a list of offences. Didn't finish it yet but agree that having that would be very helpful. I'd appreciate some pointers into where to look into all of them. For instance BEEFY didn't even cross my mind.

eskimor commented 1 year ago

So the 24-48h seem like enough to at least realise something is going on.|

It is most likely enough time to react, but possibly not enough time to have some action enacted on chain. Confirmation periods and stuff are quite long on Polkadot also as Andronik pointed out, this could happen close to the end of an era + we might have things like a forced new era....

Therefore I think relying on Governance to be able to react fast enough is a non starter unfortunately. Not doing accumulative slashes + minimizing the occasions of 100% slashes as much as possible should get us pretty far.

So far mostly backers can be slashed 100%, but they can also way more easily be protected than approval voters, because we can have stricter limits on everything here.

With this I can already sleep quite well, we just need to make sure that the reasoning is documented in the guide and code, so we maintain this property over time.

ordian commented 1 year ago

I started looking into it and compiling a list of offences. Didn't finish it yet but agree that having that would be very helpful. I'd appreciate some pointers into where to look into all of them. For instance BEEFY didn't even cross my mind.

https://github.com/search?q=repo%3Aparitytech%2Fpolkadot-sdk%20report_offence&type=code So far we are using offences pallet for slashing of:

Session? Do you mean era? Elections happen once per era AFAIK so that's 24h and NPoS elections are quite heavy to compute so they are computed some time in advance as well. If you are slashed at the end of an era I am genuinely not sure how it plays out with elections since it might invalidate some precomputed solutions. (Could that be an attack vector? By invalidating some solutions you could use your foreknowledge to make it so your solution wins the election... Need to investigate that more.) In general I was assuming a 24-48h window for getting organically kicked out after an election. So 1 era buffer (although that might be unrealistic). It is a defence in depth measure so having this time is better than having none (with instant direct disabling) and it only becomes relevant if there's a protocol mistake or nondeterminism we didn't anticipate.

Summoning @paritytech/staking-core to answer election questions (when exactly they happen and at what point slashing is/isn't accounted for).


Some points from the call:

burdges commented 1 year ago
  • Backing is not allowed
  • Approval voting is allowed

These make perfect sense. I'd add disabling does not touch grandpa either.

  • Dispute voting is allowed but limited (not escalated when coming from disabled nodes)

We're happy here overall, but we do technically alter the security analysis here. We might identify & suggest other tweaks, like adjusting needed_approvals or tranche zero samples.

Should block production be blocked for disabled validators?

No.

We've a spam dispute, but likely a some bug, not malicious. We should address plausibly malicious block production activities elsewhere.

eskimor commented 1 year ago

More refinement as of today's call:

First of all @ordian would like to use the relay parent for determining the disabled state on the node side. The result would be that we use a different state on the node side as compared to the runtime. This should be "fine" though:

  1. For disputes: There is no checking in the runtime anyway. All we do is refrain from participation.
  2. For backing: We can be slightly out of sync, but there are not bad consequences: First, the relay parent in backing can only be an ancestor of the current leaf. So we are strictly using an older (or identical) state than the runtime. This means for transitions enabled -> disabled, we might do redundant work for a few blocks in statement distribution. No harm done. For disabled -> enabled transition (very rare), a node would worst case effectively stay disabled a couple of more blocks -> also fine. Especially since this only happens in an absolute corner case, where we already are outside of byzantine assumptions. Enabling on era boundaries does not matter, as we ware clearing backed candidates on session boundaries anyway.

State Pruning

State pruning should not be an issue for backing, because we only prune state on the canonical chain something like 250 blocks after finality. For abandoned forks this happens sooner, but we also don't care whether backing is successful or not on dead forks.

For disputes pruning is a problem, but the problem with era changes goes even deeper: Let's assume a validator is no longer in the active set after an era change. With the current system we now have absolutely no way of disabling such a validator (as the runtime no longer knows about it). Hence if it starts disputing only directly after the era change, it can cause a significant volume of spam: It can dispute all candidates on the unfinalized chain until the era change. If finality is not lagging, these might "only" be 100-200 disputes. If finality is lagging more, it could be thousands.

Using the relay parent here helps a bit, as then the least if the validator started disputing before the era change, we would preserve the disabled state. Still as described above (disputing only starts after era change), this does not fully alleviate the problem and further is brittle, as we might not be able to retrieve state for blocks of abandoned forks.

Given that disabling for disputes only has effect on the node side anyway (runtime is protected by importing only confirmed disputes), this can be mitigated by letting the dispute coordinator being able to disable nodes itself. It could for example keep disabled nodes of the previous session (as of the latest leaf it has seen of that session) until finality (including DISPUTE_CANDIDATE_LIFETIME_AFTER_FINALIZATION) has reached the current session.

For what state to use, it seems using the state of current leaves is more robust for disputes (+ persistence of previous session state until finality). For statement-distribution, either should be fine ... although using a different method only there would be odd.

@ordian @Overkillus @tdimitrov

ordian commented 1 year ago

If we agree to never re-enable a validator within a session, I'd propose we introduce a mapping of disabled_validators: SessionIndex -> BTreeSet<ValidatorIndex>, which is being updated on every active leaf by simply adding new disabled validators to the list. Then backing, statement-distribution and disputes use that API instead.

If that sounds reasonable, there are a few details to resolve:

Overkillus commented 1 year ago

If we agree to never re-enable a validator within a session

Think we want to push this even further, I don't think we ever need to re-enable a validator within an era. Enabling them after a session will lead to them committing the offence again if they are buggy or malicious. The scope of disabling should be eras as the validator sets change in the scope of eras.

This might slightly alter the storage considerations. We can follow-up on the disputes call.

eskimor commented 1 year ago

If we agree to never re-enable a validator within a session, I'd propose we introduce a mapping of disabled_validators: SessionIndex -> BTreeSet, which is being updated on every active leaf by simply adding new disabled validators to the list. Then backing, statement-distribution and disputes use that API instead.

I am not sure how this resolves the issue with era changes I described.

In particular I am not sure which problem you want to solve with this at all. :thinking: .. Do you want to change the runtime API to only get us newly disabled nodes?

I might be missing something obvious ... it is quite late already. :sleepy:

ordian commented 1 year ago

In particular I am not sure which problem you want to solve with this at all.

The implementation simplicity, ensuring we use only one strategy consistently across backing, statement-distribution and disputes on the node side. This API is meant to be stored and used on the node side only.

I am not sure how this resolves the issue with era changes I described.

It doesn't. Persisting disabled state for the next era should help though.

Overkillus commented 11 months ago

Mini-guide for the current version of the disabling strategy:

ordian commented 11 months ago

Whether or not we re-enable disabled validators, I'd argue we need special handling for disputes and not just rely on on-chain state.

  1. First, imagine there's a way to disable honest validators without getting slashed (e.g. timed disputes). Then you vote invalid without getting disabled in the runtime. (re-enabling is only useful in this case if they back invalid candidates)
  2. If disputed candidate is in the previous era, you can't even disable malicious validators (they might be not in the active set anymore).

For dispute disabling it would be easier to use relay_parent state (if possible) + lost_dispute: HashMap<SessionIndex, LruSet<ValidatorIndex>>, where the lru set stores indices of validators who recently lost a dispute. It can be bounded by bizantine_threshold per session or even n_validators. We can prune this map after dispute_period.

For statement-distribution, as it only an optimization, and the main filtering will be done in the runtime, by using relay_parent we could filter more valid backing statements with re-enabling, creating very short parachain liveness issue (not a big deal), or less - also not a big deal. By using union of leaves, this is also possible, but less or more likely depending on out-of-sync issues. I think the latter is slightly preferred.

Overkillus commented 11 months ago

First, imagine there's a way to disable honest validators without getting slashed (e.g. timed disputes). Then you vote invalid without getting disabled in the runtime. (re-enabling is only useful in this case if they back invalid candidates)

First of all attackers need to pull off a successful time dispute attack and get 1/3 of the network slashed (probably a minuscule amount if any at all based on if we have time disputes countermeasures). If they succeed they can:

Then you vote invalid without getting disabled in the runtime.

I assume what you mean is they vote invalid on valid candidates AKA start spam disputes. Yes, in that case they would not get disabled. They could continue spamming the disputes effectively breaking sharding. The chain should still be secure although extremely slow. Biggest problem is the true perpetrator of the attack (the collator suggesting the carefully timed block) gets away scot free but this simply loops back into our time dispute countermeasures which might be needed to protect against it. If malicious guys would pay for this there is little gain except lowering our liveness temporarily. While lives is suffering we can investigate and check how exactly they managed to slash honest nodes and if we can protect against that specific flavour of nondeterminism in the future. Security wise we should be good.

If disputed candidate is in the previous era, you can't even disable malicious validators (they might be not in the active set anymore).

Is this an issue tho? Disabling is there to reduce damage done in the current era so if they are already gone there's not much more damage they can cause. Generally the main bulk of the punishment comes from the slash which should still be applied even if they are no longer in the active validator set.

Overkillus commented 11 months ago

And pulling off a time dispute attack where you are not even slashed means that it must be the collator attack variant (we have 3 main flavours of time dispute attacks: malicious collators, backers or checkers).

Time dispute attacks organised by malicious collators are hard to pull off. They would need to construct a block that takes less than 2s on at least 2 out of 5 backers (one of the reasons why I'm strongly opposed to lowering the backing requirement) and then the same block would need to miraculously take more than 12s on many (1/3 in fact) of approval checkers. While not statistically impossible this is the least probable time dispute attack.

eskimor commented 11 months ago

First, imagine there's a way to disable honest validators without getting slashed (e.g. timed disputes).

I think we slash even in time disputes - or at least we can now. (Slashes are deferred, validators don't lose nominators, no chilling, ...)

If disputed candidate is in the previous era, you can't even disable malicious validators (they might be not in the active set anymore).

Yep.

For the node side disabling data structures, I don't think the suggested one cuts it. I would propose the following (pseudo code):

lost_disputes: LruMap<SessionIndex, HashSet<ValidatorIndex>>

(lru size is session window)

We need this map for two reasons:

  1. For handling offenses that happened in a past session. (Runtime can not disable validators which might no longer exist in the current session, yet they might still raise disputes for old session candidates.)
  2. Using the relay parent is awesome because of determinism, but with async backing and increasing the allowed depth of relay parents, we are getting slow in applying any disabling.

Now, on receiving a dispute message, what we would be doing is the following:

  1. We receive a dispute message for some session.
  2. We lookup the disabled state based on the relay parent - if block does not exist, disabled state does not matter, because we would only participate if confirmed anyways. If it exists, we check how many validators are disabled.
  3. If disabled as of (2) - don't participate, if not disabled based on (2) check count of validators disabled in (2). If below threshold, also check lost_disputes for the session of the disputed candidate - if found, don't participate. If (2) is already above threshold, then ignore lost_disputes. (We never want to risk consensus)

On concluding disputes, we add losing validators to lost_disputes.

With this strategy we are covering both (1) and (2), without risking any consensus issues. TL;DR: Only use the node side set, if disabled set in the runtime is not too large already.

eskimor commented 11 months ago

Is this an issue tho?

Yes. There will always be at least 10 blocks (with forks and lacking finality maybe even significantly more) full with candidates that can be disputed of the previous session(s). With 100 cores, we are talking about > 1000 candidates.

Now this validator can still dispute those candidates, even if no longer live in the current session. This can be quite a significant number of disputes and with our current 0% slash it would go completely unpunished:

  1. The runtime can not disable that guy in the current session any more, because it is no longer active.
  2. The validator would not lose out on any rewards at all, if he does it right at the end of the era - where it can no longer get disabled.

Now with the above algorithm, the guy would still go unpunished, but at least he the harm to the network would be minimized. This would actually be an argument for >0% slashes.

Overkillus commented 9 months ago

This is the most current design of the disabling strategy: https://github.com/paritytech/polkadot-sdk/pull/2955

Overall state done, only missing validator re-enabling. Can deploy with it missing but awaiting audit before deployment.