Open iurii-ssv opened 2 days ago
Another related issue that we can solve here (see "low hanging fruit" from the message above) is that in the current implementation (even if Beacon nodes are reasonably fast) QBFT instances start at "slightly" different time. This essentially creates an undesirable effect of round timeout(s) kicking in at different moment for different operators (this has also been mentioned in https://github.com/ssvlabs/SIPs/pull/37). This adversely affects the QBFT consensus that follows (the worse the differences in Instance start time are - the worse it affects QBFT I imagine).
I think this can be solved quite easily just by
reconstructed partial RANDAO signatures
event, or maybe it's just better to simply start it at exact wall-clock time - something like "duty target slot time -4s" for example)round 1 deadline
- got block to propose
) but probably additionally we also need to take into account that, for Ethereum proposer, round 2 might have to end prematurely (before it's designated round 2 QFBT timeout) just because it's too late to broadcast signed Ethereum block (which is probably somewhere in the range of 2.5s-4s) thus it might not even make sense to even try and start round 2
I've been looking into QBFT and proposer duty recently a bit deeper, might be missing some context but I think there is a liveness issue (probably has been discussed in the past, but it's best to document such findings to serve as docs if nothing else) in proposer runner. Namely,
once proposer runner
reconstructed partial RANDAO signatures
it immediately makes a blocking call to Beacon node (GetBeaconBlock
) which can take arbitrary amount of time to finish (production logs sometimes show this request can last 1-3s), it doesn't even matter if there a timeout on this call set or not - there is no reason to block here unless proposer runner is gonna be the leader of qbft consensus that comes next (and even if he is the leader - blocking here is suboptimal as I describe below). The liveness suffers because while proposer runner is blocked he can't participate in qbft at all (so, he prevents himself from executing the next stage of duty even though he might never need this Beacon block data he is waiting for - that is, if he never needs to lead a round, or if he uses justified prepared value when leading round changes instead of his own).So, instead of blocking on
GetBeaconBlock
we want to make an asynchronous call (perhaps even with retries, but these would need to be fast) and use the data if/when it comes. I haven't studied the relevant code in detail, but we'll need to start QBFTInstance
and block its execution only if the proposer runner turns out to be a leader for Round 1 or any other Rounds that follow (plus we'll need to be able to updateInstance.StartValue
once Beacon data eventually comes, obviously in concurrently safe manner). This way we will be able to participate in QBFT consensus ASAP, but not as a leader. This is a low hanging fruit. We can do better than that,I see 2 potential options on what we can do if proposer runner wants to be a leader but can't get data from Beacon node: 1) leader yielding his round voluntarily: leader, instead of starting qbft with proposal (since he doesn't have a block from Beacon node) will do a sort of round-change - the main difference, I think, is that every node can verify the validity of such round change immediately just by checking initiator signature + that he is indeed the leader (for target slot, and current round) when receiving such message - hence it should be pretty fast to change leader this way, and it should also work when proposer runner is chosen to be the leader of round change (note: we also would need to 1) store it with
Instance.State
to make sure yielded leader doesn't attempt to propose in yielded round 2) addyieldJustifications
to pass along with Propose msg so that leader(s) who took over can justify proposing in the round they weren't meant to) 2) operators can broadcast blocks fetched from Beacon node (GetBeaconBlock
) to other operators in their cluster (those who are expected to perform same proposer duty) as soon as they fetch these blocks, this way operator doesn't 100% depend on Beacon node to provide block and rather he can just take any of the blocks sent by peers in the same cluster and propose these as his own (there are options, but the general idea is: operator can wait certain amount of time for Beacon request to execute, lets say 500ms, and once this time is exceeded operator can try to fallback to the block provided by peer(s)); we will also need to verify peer blocks withProposedValueCheckF
before proposing as our own (to ensure peer isn't suggesting us to sign on the block with a slot we've already signed another block for, thus we can't be slashed; note, we do this check for the block we are getting from Beacon node too - as we should), but in terms of security nothing really changes, I think, and this probably should work both for blinded blocks as well as full data blocksWe can even try to combine both (although, it doesn't seem to be any better than just doing approach 2, if leader can't receive any peer messages with data to propose - it's unlikely he'll do better at sending "yield" messages out; I don't know enough about libp2p to make a conclusion on this), WDYT ?
Btw, it seems other duty types suffer from similar issue (although they probably have more time allocated for execution than proposer-runner)
And to go one step further: