prysmaticlabs / prysm

Go implementation of Ethereum proof of stake
https://www.offchainlabs.com
GNU General Public License v3.0
3.47k stars 1.01k forks source link

Check on-chain for attestations at validator launch #7985

Closed potuz closed 3 years ago

potuz commented 3 years ago

🚀 Feature Request

Description

In the event of having a corrupted validator.db an easy and cheap safeguard would be to grab attestations from the last 3 epochs from the beacon-node and check if the validating indexes appear there. This would prevent having a double attestation in most cases. This test can be run only at launchtime and if the validator.db does not show any activity for the last/current epoch. So that the impact will be minimal in long running validators.

I'd like to see what the team think of possible design patterns before submitting a PR, but I think this can be as easy as pulling the last 3 epochs blocks from the node, and comparing the attestations for the indexes of the validating accounts. And stopping if something is found on-chain that is not found on the validator.db with a possible flag --disable-launchtime-check or whatever to override this behavior.

Additional Info

Discussion in Discord: https://discord.com/channels/476244492043812875/730447046930071672/782041664327909417

terencechain commented 3 years ago

I will advocate no flag. We should eliminate the number of the flags for both beacon node and validator. More flags == more doc support == more room for errors and unnecessary frictions.

What if the db doesn't have previous epoch info but the block attestation has?

Can you also run this idea in the eth2 discord? Will be good to get opinions from other implementor teams

potuz commented 3 years ago

Thanks for the tag, I wasn't allowed. I think if the db doesn't have info from the previous epoch but the block has it the validator should stop and assume either the db is corrupted (had a dirtly stop for example) or there's another validator running the same keys. In both cases you should stop to prevent such attestations from happening. If you do not want to add an extra flag that is fine too. In principle if it was a corrupt db, 3 epochs into the future you'd be able to relaunch. And if you had another validator attesting, then you still do not want to launch anyway. So I am not against your suggestion of not having this flag.

terencechain commented 3 years ago

Flag might be useful if the user intentionally understands the risk and still wants to procee. Example:

1.) epoch 100: attest on computer A, stop validator 2.) epoch 101: start validator on computer B with a fresh DB

I'm not sure how really applicable the above case is but it's a trade off that's worth looking at

potuz commented 3 years ago

@arnetheduck for pointed me to this implementation of Nimbus

https://github.com/status-im/nimbus-eth2/pull/1915

which is pretty much the same as I am suggesting except they default for 2 epochs instead of three. I haven't read in detail because I do not know their code, but an implementation detail (that I admittedly might be reading wrong) is that they wait and listen for 2 epochs before they start broadcasting instead of asking the network for the last 2 epochs blocks and checking attestations.

prestonvanloon commented 3 years ago

Is there a reason for 2 or 3 epochs? Technically, you could be slashed as far back as the weak subjectivity period. (i think)

potuz commented 3 years ago

Is there a reason for 2 or 3 epochs? Technically, you could be slashed as far back as the weak subjectivity period. (i think)

Yes but this should only happen in a forked environment. It seems like a good compromise since this request will impact every launch of the validator. Notice that this is not meant to be a replacement of the slashing protection but rather a safeguard against other client using the same keys. These are the only slashings we've seen so far

potuz commented 3 years ago

In addition to nimbus implementation here's lighthouse's https://github.com/sigp/lighthouse/pull/2230

Also, I think this newer issue can be safely closed since it is most likely to be covered by any solution to this issue

https://github.com/prysmaticlabs/prysm/issues/8513

nisdas commented 3 years ago

Ok thanks, shifting over discussion to #8513 .

potuz commented 3 years ago

Ok thanks, shifting over discussion to #8513 .

I would've closed the newer issue instead as this one actually has a discussion

nisdas commented 3 years ago

Ah sorry, misread this . Porting over message to this issue.

nisdas commented 3 years ago

I wrote a short design doc on something with similar goals a while back: https://hackmd.io/@prysmaticlabs/B16P-Bbgd

However the main difference here is that instead of listening in on gossip we instead just search all attestation from the last N epochs and find if there is a conflicting attestation. Having something that 'waits' for a while is a possible liveness fault, as in the event of a restart a validator would miss an attestation in that particular epoch.

potuz commented 3 years ago

I wrote a short design doc on something with similar goals a while back: https://hackmd.io/@prysmaticlabs/B16P-Bbgd

I like this approach with the endpoiint to check for sanity. I just wanted to point out that there is an economical tradeoff between security and the price payed by not attesting at launch. While it's true that all current slashings would have been prevented by this method of looking back a few epochs and gathering attestations. It worries me a little in the long term. A validator is much more likely to be slashed in a forked/non-finalizing scenario. Currently we have seen validators running multiple instances of the same keys and not being slashed for a long time cause they were attesting precisely on the same data. This is less likely to happen on a forked scenario. In this situation, we will be dropping blocks and possibly attestations will not be included later if they were from different forks, so that when we request the attestations for the last n epochs and we don't get any from our key, in fact we may have been attesting anyway on a different fork.

Having said so, I think the approach of looking back is probably better anyway, even when forked we would expect at least 1 attestation to be included in later blocks in our current fork, even if they were on a different fork at the time of attesting. I have no intuition nor hard numbers about how many attestations went actually missing on Medalla.

quangld commented 3 years ago

Flag might be useful if the user intentionally understands the risk and still wants to procee. Example:

1.) epoch 100: attest on computer A, stop validator 2.) epoch 101: start validator on computer B with a fresh DB

I love this ideal: it's easier to implement

ahadda5 commented 3 years ago

Will implement the change based on this Design

potuz commented 3 years ago

I wanted to leave here a method that I think has many advantages and outweighs the minor disadvantages which I enumerate below. In short the method consists of

  1. The validator client launches. Let TARGET be the last target epoch of its attestations and HEAD be the current head epoch.
  2. If HEAD - TARGET < 2 continue operation as normal, there is nothing to do here (discussion below)
  3. Otherwise grab beacon states for the last three epochs HEAD-i with i = 0,1,2. Check for balance changes. If there is any positive balance change stop the validator and abort operation.

This will stop any validator with a doppelganger running somewhere and an included attestation with target bigger than TARGET except in the following cases

The problem in a) is not what this issue tries to solve: running two validators on different computers while the network is highly forked is dangerous anyway. We are trying to simply prevent the slashings by operator mistakes. The problem in b) is completely negligible given that this is the order of magnitude of all active validators now. For this to happen while at the same time being running on two machines at the same time the probability is just infinitesimal. b) can be completely mitigated by replacing positive balance by balance_change > -2 * base_reward, but I think this is an unnecessary burden that will also need to be changed in Altair.

The advantages of this methods are several on the other hand:

One argument that can be said is that if the user is running hundreds of validators, and they are running on two machines, this method will be slower, since it will require 3 states, while the method of looking at back blocks will likely find already an attestation in the very first block fetched. But this argument is flawed, because for this user, in the vast majority of cases that he's relaunching, will only be launching in one machine and it will require grabbing 64 blocks and going through the loop on bitfields on each one of them to check that it hasn't attested.

ahadda5 commented 3 years ago

Good and definitely much faster than the brute search through attestations. Questions(throughout we assume the user is setting the global flag to check for duplicates)

  1. If you are checking 3 epochs back, so we check if Balance@epoch-3 != Balance@epoch(-2) and then is Balance@epoch(-2) != Bal@epoch(-1) and Bal@epoch(-1) != Balance@currentEpoch?
    Lets consider the possibilities here Possibility A - A Duplicate is active somewhere else. In such a case it is being rewarded and we can catch it. Possibility B - A duplicate is active but caused a surrounding/ed votes resulting in valid slashing. This logic will catch it which is better than only checking if it is going up. Activity elsewhere means up most of the time and unlikely down few times? Possibility C - The validator was stopped in the past three epochs. Then we start it. If balance changed we have a duplicate (Possibility A or B) Or if the balance stayed the same(less a bit for inactivity- How do we catch that? can we determine the inactivity across N epochs precisely? ) for true inactivity, our logic will start the validator as intended.

  2. Balance checks are done down to the next Wei or base rewards multiples ? Please unpack your statement

    can be completely mitigated by replacing positive balance by balance_change > -2 * base_reward

  3. In terms of balance checks . The validator -> rpc call to beacon to retrieve the Balance across those epochs. This is done reclusively for each Index (Key)? If it is changing, then we stop as explained in 1?

  4. Can you explain this , a 0.5% probability , isn't 0.5% ^ 2 epochs = 0.25%?

In current mainnet, each event happens 0,5% of the times, so this will happen with probability 2*10^-5.

nisdas commented 3 years ago

@potuz

HEAD - TARGET < 2 In this case the validator already attested to this or the previous epoch and our own attestation might be included, neither this method, nor the method of going through blocks will be able to distinguish here if there is another validator running or this very same validator submitted the previous attestation. Regardless, there are two cases, the user has just restarted the validator in this case and if he/she is using the same key on two machines and launching and restarting he should have been slashed already.

This would be a fair assumption I guess, if the amount is less than 2 then going forward by skipping the check would be fine then.

The problem in a) is not what this issue tries to solve: running two validators on different computers while the network is highly forked is dangerous anyway. We are trying to simply prevent the slashings by operator mistakes. The problem in b) is completely negligible given that this is the order of magnitude of all active validators now. For this to happen while at the same time being running on two machines at the same time the probability is just infinitesimal. b) can be completely mitigated by replacing positive balance by balance_change > -2 * base_reward, but I think this is an unnecessary burden that will also need to be changed in Altair.

Slashing by operator mistakes do happen and they are much more likely to happen in cases where the network is highly forked. For that reason we should default to the most secure method, which would be to check -2 * base_reward . If the network is highly forked, having a different head/target is much more likely. As you have noted we will again have to change this for Altair to account for the changed validator rewards accounting.

On the cost of this, I would disagree that this is cheaper/etc. Iterating through attestations in each block is much cheaper compared to regenerating the state. All methods so far do require us to regenerate states and then use that. We have cached committees during shuffling, so we simply can use that to check for each validator that has attested/or not.

I am not against having the balance check instead, but it will require us to change again in Altair. Also any future forks that might change validator accounting in anyway will have to be handled correctly here. Doing the attestation check is more of a 'direct' /in-protocol check to validate a particular validator's participation versus doing a balance check which is more indirect here.

nisdas commented 3 years ago
  1. Can you explain this , a 0.5% probability , isn't 0.5% ^ 2 epochs = 0.25%?

@ahadda5 0.5% ^ 2 does not equal to 0.5% * 0.5 .

ahadda5 commented 3 years ago

@nisdas ahh correct 2.5 *10^ -5

potuz commented 3 years ago

@ahadda5

If you are checking 3 epochs back, so we check if Balance@epoch-3 != Balance@epoch(-2) and then is Balance@epoch(-2) != Bal@epoch(-1) and Bal@epoch(-1) != Balance@currentEpoch?

No, I am checking currentEpoch - i for i = 0,1,2so only three states instead of four.

Possibility C - The validator was stopped in the past three epochs. Then we start it. If balance changed we have a duplicate (Possibility A or B) Or if the balance stayed the same(less a bit for inactivity- How do we catch that? can we determine the inactivity across N epochs precisely? ) for true inactivity, our logic will start the validator as intended.

No, if the validator was stopped less than 2 epochs ago (I mentioned 2 instead of three in my message) then if balance changed we cannot infer that we have a duplicate because attestations for HEAD-2 still reflect in the balance change at HEAD transition

Balance checks are done down to the next Wei or base rewards multiples ? Please unpack your statement

If you miss head and target and your attestation is included in the next slot you obtain -1/8 * base_reward. In the worst case scenario you miss both and is included in 31 slots of distance, you get essentially - base_reward (plus a tiny bit). All this needs to be multiplied by participation rate. On the other hand if you are inactive you will get -3*base_reward which is strictly lower than those numbers. So checking that you got strictly more than -3*base_reward indicates that you were active on a different client.

@nisdas

Slashing by operator mistakes do happen and they are much more likely to happen in cases where the network is highly forked. For that reason we should default to the most secure method, which would be to check -2 * base_reward . If the network is highly forked, having a different head/target is much more likely.

The cost of computing the actual lower bound adds an unnecessary complication: you need to factor in participation rate and base reward which are slot and fork dependent. This will require going block by block which defeats the purpose of my proposal and it does not give you any real benefits. Instead, what you can do is simply decide to go back N epochs from head, where N depends on the current finality, so if finality is 2 then you choose N=2 and what I said above for balance > 0 applies and you catch active validators with probability 1 - 2*10^{-5}. If finality is higher you just simply change N accordingly so that if you had an active validator the probability is high enough that it voted for the right fork in some of the last epochs.

nisdas commented 3 years ago

The cost of computing the actual lower bound adds an unnecessary complication: you need to factor in participation rate and base reward which are slot and fork dependent. This will require going block by block which defeats the purpose of my proposal and it does not give you any real benefits. Instead, what you can do is simply decide to go back N epochs from head, where N depends on the current finality, so if finality is 2 then you choose N=2 and what I said above for balance > 0 applies and you catch active validators with probability 1 - 2*10^{-5}. If finality is higher you just simply change N accordingly so that if you had an active validator the probability is high enough that it voted for the right fork in some of the last epochs.

This could be worth a try, although it would still be vulnerable to future changes in the spec. After thinking through I am more amenable to using this method.

ahadda5 commented 3 years ago

@nisdas ok i will create another pull . Explain the following in new design please

  1. Is N preset or determined programmatically? how is N changed based on finality?
  2. Do we check if the balance changed across epochs (>, < or !=) ? How is this checked.

    No, if the validator was stopped less than 2 epochs ago (I mentioned 2 instead of three in my message) then if balance changed we cannot infer that we have a duplicate because attestations for HEAD-2 still reflect in the balance change at HEAD transition

What i'll do is write up a detailed design and come back before i start coding.

nisdas commented 3 years ago

@ahadda5 There isn't a need to redo this in a new pull request. You essentially just need to change what the beacon rpc method does. Previously it was checking all attestations in blocks for a a validator's index. Now you just need to check a validator's balance for the relevant state. Everything else stays the same.

Is N preset or determined programmatically? how is N changed based on finality?

N is basically your epochs since the validator's last recorded history. Ex: Validator was last observed in Epoch 8 and current epoch is 12. So we check from epoch 9 - 12.

Do we check if the balance changed across epochs (>, < or !=) ? How is this checked.

You just need to check if the balance delta for a validator is more than 0 across the N epochs.

What i'll do is write up a detailed design and come back before i start coding.

Sure thing, this doesn't change the basic structure of the feature much. The check is boxed into the beacon rpc method.

potuz commented 3 years ago

For some reason replies to this thread got buried in my mailbox:

N is basically your epochs since the validator's last recorded history. Ex: Validator was last observed in Epoch 8 and current epoch is 12. So we check from epoch 9 - 12.

I think this is wrong, on current situations, having a validator offline for several epochs will cost a lot unnecessarily, which is what my proposal wants to avoid: I want to set N=2 if the last finalized checkpoint is HEAD-2 and increase otherwise. For example you can set N = (HEAD-FINALIZED_EPOCH-1)^3 + 1 and or increase the exponent if you feel unsafe.

nisdas commented 3 years ago

I think this is wrong, on current situations, having a validator offline for several epochs will cost a lot unnecessarily, which is what my proposal wants to avoid: I want to set N=2 if the last finalized checkpoint is HEAD-2 and increase otherwise. For example you can set N = (HEAD-FINALIZED_EPOCH-1)^3 + 1 and or increase the exponent if you feel unsafe.

I think we might be referring to differing things here. I was referring to the historical lookback when a validator starts up while you are referring to the doppelganger waiting period.

potuz commented 3 years ago

I think this is wrong, on current situations, having a validator offline for several epochs will cost a lot unnecessarily, which is what my proposal wants to avoid: I want to set N=2 if the last finalized checkpoint is HEAD-2 and increase otherwise. For example you can set N = (HEAD-FINALIZED_EPOCH-1)^3 + 1 and or increase the exponent if you feel unsafe.

I think we might be referring to differing things here. I was referring to the historical lookback when a validator starts up while you are referring to the doppelganger waiting period.

No indeed I'm referring to the lookback as well since I want to avoid requesting blocks/states

ahadda5 commented 3 years ago

@potuz for the case where Head- Target < 2 ,why can't we employ the signed block search approach. It guarantees if we have duplicate activity. For head - Target > 2 then the duplicate attestations/proposals in Head-2 are reflected in the balance and we can rely solely on balance.

potuz commented 3 years ago

@potuz for the case where Head- Target < 2 ,why can't we employ the signed block search approach. It guarantees if we have duplicate activity. For head - Target > 2 then the duplicate attestations/proposals in Head-2 are reflected in the balance and we can rely solely on balance.

That is all my proposal: to look at balance changes and nothing else! but you are not going to see balances on the blocks, balances are on the beacon states.

EDIT: Ohh Apologies, I misread the statement: you mean to use signed blocks for HEAD-TARGET < 2. I think if the last target was HEAD then it may be impossible to discern a duplicate instance running from our own instance. If the last target was HEAD-1 then it may be possible to see a vote for HEAD and call that a duplicate entry. I am not sure if it's worth the trouble implementing this.

nisdas commented 3 years ago

No indeed I'm referring to the lookback as well since I want to avoid requesting blocks/states

Gotcha, I guess this is configurable. We can test it out to see which parameter value would be desirable.