Closed karolinagrzeszkiewicz closed 7 months ago
This merged PR introduces the event handling feature for hotstuff-rs
, as well as logging rewritten as a set of possible default event handlers. The major contributions include:
events.rs
: definitions of key blockchain and protocol eventsevent_bus.rs
: EventHandlers
struct for storing handlers for each event type, including a method to add default logging handlers, event_bus
thread for receiving events from algorithm
and sync_server
threads and triggering the execution of appropriate handlers stored in its instance of the EventHandlers
structlogging.rs
: Logger
trait with a required method get_logger()
which returns an object's default handler for logging, implementations of the Logger
trait for all event types from events.rs
replica.rs
: log_events: bool
and a vector of handlers for each event type passed as parameters to Replica.start()
, creating an EventHandlers
struct based on provided handlers and log_events
, if the struct instance is not empty then Replica.start()
launches an extra thread event_bus
for triggering the execution of event handlers upon receiving eventsalgorithm.rs
, state.rs
, sync_server.rs
: relevant functions now publish events by sending them on the event_publisher
end of the channel. The events are published after the corresponding action is completed, in particular events that modify the block tree are only published after the event takes effect, i.e, the change is written to the block tree (rather than a write batch)The PR merged above implements a builder pattern for replica, now a replica can be started as follows:
let replica =
ReplicaBuilder :: builder()
.app(app)
.pacemaker(pacemaker)
.network(network)
.kv_store(kv_store)
.configuration(configuration)
.on_insert_block(on_insert_block)
.on_commit_block(on_commit_block)
//...
.build()
.start();
Where on_<event>
setters are optional, but if used they should be passed arguments of type HandlerPtr<event_type>
, e.g. the on_insert_block
setter takes an argument of type HandlerPtr<InsertBlockEvent>
, which is the same as Box<dyn Fn(&InsertBlockEvent) + Send>
. Therefore, the argument must be a pointer to the closure, rather than the closure itself (the popinter can be created with Box::new(handler)
where handler is a closure that implements Fn(&event_type)
). This is because Fn(&T)
is a trait which can be implemented by different types thus with size unknown at compile time, and so cannot be stored in a struct's field which should have a size known at compile time.
Note that even though we pass only one closure pointer per event, the closure can in fact be defined to combine multiple different closures. Hence, passing one closure can be equivalent to passing multiple closures.
However, internally the handlers for each event are stored as a vector Vec<HandlerPtr<event_type>>
which can include (1) (optionally) a custom handler passed to the replica builder, and (2) (optionally, if log_events
is set to true
in configuration) a predefined log handler for the event (from logging.rs
). This is to avoid combining the user-provided closure with the logging closure by dereferencing ("unboxing") the user-provided closure, just to "box" the resulting closure again after combing the functionalities of the two.
Other changes:
ed25519-dalek
update, event handlers and the builder pattern for replica (this required including the rand_core
feature for ed25519-dalek
in Cargo.toml
, and adding rand_core
to dependencies)log::info
in all loggers in logging.rs
chain_id
method from App
(now this is passed via Configuration
)Implemented the program behavior for the following parameters, now passed via Configuration
:
sync_request_limit
, sync_response_timeout
: now the config parameters replace the corresponding methods of the Pacemaker
traitprogress_msg_buffer_capacity
: now this config parameter is stored as a field of ProgressMessageStub
(as msg_buffer_capacity
), along with msg_buffer_size
which keeps track of the number of bytes of the values stored in the msg_buffer
. To avoid overloading the buffer, now the recv
method of ProgressMessageStub
checks if adding a given message to the buffer would result in overload, and if so it either drops the message (if it is of a higher or equal view than that of a highest-viewed message in the current buffer) or (otherwise) removes just the number of highest-viewed messages in the buffer that need to be removed to make space for the new message.validate_block_for_sync
method in App, so that apps can distinguish between validate block requests in sync vs. progress modechain_id
to Configuration, removed from App
safe_qc
takes replica's chain_id
as an extra parameter and checks whether qc.chain_id == chain_id
Some thoughts on implementing sync_trigger_timeout
:
If no "progress" happens for the duration of time defined by t = sync_trigger_timeout
, we'd like the replica to trigger sync. For this we need to define progress, or find an event such that if the event doesn't happen for t
then this means no progress happened for t.
Claim: "progress" during t
implies updating highest_qc
(at least once) during t
(.i.e. highest_qc
not updated during t
implies no progress during t
, in which case we should trigger sync)
Proof sketch: If I (replica) haven't updated my highest_qc
for big enough t, then one of the following must be the case:
highest_qc
is for a higher view than all of the highest_qc
s I received or collected during t
I obtained QC's via either collecting votes as a leader or receiving them from other replicas, but I rejected those QCs, because:
In all of these cases, I was not only unable to update my highest_qc
, but also to insert a block or vote for a proposal or a nudge, meaning that all proposal and nudges during t
were unacceptable for me so I made no progress.
Note: t
should be big enough, in particular t >> view_timeout
Issue: In case (2) clearly no progress was made, but does it mean I need to sync? I am somehow far ahead of everyone else, so do they need to catch up with me or do I need to catch up with them? Yet my highest_qc
is valid which means the majority voted for its block, but perhaps the validator set was updated in the meantime, all other replicas are new and they don't know of this qc. However, they must have received it via new_view
or sync, therefore for long enough t
(2) should not happen.
I think this, combined with not triggering sync upon receiving QC from the future, solves the issue from issue #12. Now even if a lagging replica enters sync with a peer that is in the process of validating a block b
, and hence only receives the blocks up to b.parent
, it will not trigger sync when it receives a "future" QC for b
because other replicas are ahead. Instead it will update its highest QC to the QC for b
(because "progress" was made), and insert b
into its block tree as soon as the proposal message for b
is returned from the cache and the proposal is validated.
There are multiple issues with the approach described above:
It has liveness issues, e.g., in a minimal example with only two validators, the second one started later and hence being a few views behind the first one, both might fail to obtain a QC, and sync with each other (this was the case when running default_pacemaker_view_sync_integration_test
) instead of waiting until their views are in sync (which is guaranteed by the current DefaultPacemaker
's view synchronization mechanism)
It is too strict and doesn't take into account the case when the quorum is only about to be formed, when most of validators are just starting they might not be updating their highest_qc
because they are out of sync with each other, but they might be making progress in the sense of inserting valid blocks that might be committed later once there is a quorum in sync
*(EDIT: not sure if this is an issue): Our previous sync mechanism which relied on triggering sync when a replica sees a QC from the future enforced some monotonicity between the replica that syncs and the replicas it should sync with, i.e. the latter are ahead in terms of view number. However, with failing to update highest_qc
as a marker of "lagging" behind we fail to obtain that monononicity, e.g., if 1/2 replicas are failing to update highest_qc
, and so does the other 1/2 replicas, we don't know which replicas should sync with which replicas.
This commit implements the following solution for sync_trigger_timeout
:
Progress =
Indeed progress implies one of the three, since progress involves extending the blockchain with new blocks in a safe manner. At least this is the kind of "progress" we are concerned with for the sake of the sync mechanism, which involves a replica catching up with the state of the blockchain (and the highest_qc
, but this is of secondary concern since now that we do not discard QC's from higher views it is easier for a replica to stay up-to-date with highest_qc
's even when it is behind on view number of blocks).
Nudge needs to be considered too, since validator-set-updating blocks take 3 voting rounds to commit during which no other blocks might be proposed but progress on the validator-set-updating block is made.
Lack of progress as defined above would indicate that received proposals/nudges/blocks are not acceptable i.e,:
Note: unlike the old sync mechanism based on receiving QC from the future, this makes it possible for a replica that's ahead in terms of view number to sync with the majority of replicas making progress at lower view numbers
Note: this approach also solves issue #12
Implementation:
Whenever progress is made, the algorithm thread's SyncTriggerTimer
is updated
SyncTriggerTimer.last_update
is checked via the timeout()
method after every iteration of the loop in execute view. If timeout()
is true, we return Err(ShouldSync)
which triggers the sync, but we do not update the timer since if there was actually a need to sync then the timer will be updated during sync (case 3 of definition of "progress").
The decision to call the timeout()
method after every iteration of the loop in execute view is motivated by the fact that after sync we'd like the replica to get a chance to update the timer (if an acceptable proposal or nudge is received) and hence check if there's still need to sync. This is useful in cases when sync is triggered e.g., because of lack of progress after starting the protocol, but no blocks are received by the syncing replica and hence timer is not updated via sync. Then, we give the replica the chance to update the timer by returning to the progress protocol execution, rather than triggering a cycle of sync which would violate liveness.
However, it is not the best design for cases when the sync server of the sync peer is not responding or malicious and hence the replica fails to insert any new blocks via sync - in this case we'd ideally want to trigger a sync with a new peer immediately after, but realistically a replica can't distinguish between the cases of not inserting new blocks (1) because peers made no progress either, and (2) because the peer is malicious/not responsive. So all things considered, I think this is the best possible solution.
TODO:
sync_trigger_timeout
should be, in general it should be significantly longer than the timeout for given ("normal") view.highest_qc
for a while, say for 5 views, then view_timeout
= $2^{5} = 32$ for the 5th view after highest_qc
update. highest_qc
, sync will be triggered at the end of the 5th view. This also means that a new replica on start will trigger sync after 32 seconds (unless it is a leader, then it might propose its own block and insert, so it will take a bit longer)hotstuff-rs
0.3 changesSingle interface for the user to set the values of parameters and group them in a single struct, without having to pass an excessive number of arguments to public functions
pub struct Configuration {
pub me: SigningKey,
pub chain_id: ChainID,
pub sync_trigger_timeout: Duration,
pub sync_request_limit: u32,
pub sync_response_timeout: Duration,
pub progress_msg_buffer_capacity: u64,
pub log_events: bool,
}
chain_id
sync_request_limit
, sync_response_timeout
start
method: me
sync_trigger_timeout
, progress_message_buffer_capacity
, log_events
This feature allows the user to register event handlers for a pre-defined set of events upon starting the replica, such that whenever an event is published, appropriate handlers are executed.
Implementation:
Pre-defined events
Events, defined as an enum in events.rs
:
pub enum Event {
// Events that change persistent state.
InsertBlock(InsertBlockEvent),
CommitBlock(CommitBlockEvent),
PruneBlock(PruneBlockEvent),
UpdateHighestQC(UpdateHighestQCEvent),
UpdateLockedView(UpdateLockedViewEvent),
UpdateValidatorSet(UpdateValidatorSetEvent),
// Events that involve broadcasting/sending a Progress Message.
Propose(ProposeEvent),
Nudge(NudgeEvent),
Vote(VoteEvent),
NewView(NewViewEvent),
// Events that involve receiving a Progress Message.
ReceiveProposal(ReceiveProposalEvent),
ReceiveNudge(ReceiveNudgeEvent),
ReceiveVote(ReceiveVoteEvent),
ReceiveNewView(ReceiveNewViewEvent),
// Progress mode events.
StartView(StartViewEvent),
ViewTimeout(ViewTimeoutEvent),
CollectQC(CollectQCEvent),
// Sync mode events.
StartSync(StartSyncEvent),
EndSync(EndSyncEvent),
ReceiveSyncRequest(ReceiveSyncRequestEvent),
SendSyncResponse(SendSyncResponseEvent),
}
Event subscriber implemented as an optional separate thread
event_bus
thread is started if the user registered at least one handlerevent_bus
thread is started on replica.start(...)
algorithm
and sync_server
threads by sending them on the event_publisher
end of the channelevent_bus
thread receives them on the event_subscriber
end of the channel and calls the appropriate user-defined event handlersJustification:
algorithm
and sync_server
threads with computation not relevant to the protocol execution, and algorithm
and sync_server
threads, receiving them via a Multiple Producer Single Consumer channelEvent handlers defined as pointers to closures
i.e., type HandlerPtr<T> = Box<dyn Fn(&T) + Send>
, where T = one of Event struct types defined in events.rs
Justification:
It is necessary to store pointers, since Fn(&T)
is a trait and its heap size is not known statically. Moreover, the Send
trait is required so that the closure pointers can be passed from the thread that takes user input and starts the replica to the event_bus
thread.
EventHandlers
struct for storing handlers as vectors for each event type
The following methods are defined for the struct:
add_logging_handlers
: a method to add default logging handlers, fire_handlers
: a method for receiving events from algorithm and sync_server
threads and triggering the execution of appropriate handlers stored in its instance of the EventHandlers
structis_empty
: a method to check if there are no handlers stored in EventHandlers
, in which case the event_bus
thread is not startedJustification:
EventHandlers struct
instance is defined based on user input in Replica's start
method, but they never have to be directly accessed again since this is handled by EventHandlers
and its methodsThe events are published after the corresponding action is completed
In particular events that modify the block tree are only published after the event takes effect, i.e, the change is written to the block tree (rather than a write batch).
Consequence:
insert_block
and commit_block
had to be refactored to enable publishing events after the changes in the write batch are written to the BlockTree.
Justification:
BlockTree
, and those updates only happen once the changes in the write batch are written to the BlockTree
BlockTree
upon being notified about an event the updates it, in which case the BlockTree
must be up-to-date with what the handler assumes.With this feature logging becomes optional and logs implemented as event handlers. The user can register pre-defined logging handlers simply by setting log_events = true
in Configuration
.
Implementation:
All logs are log::info
Justification:
All pre-defined events, and hence logs, provide information on the progress of the protocol, i.e., events expected from a safe and live execution of the hotstuff-rs
protocol.
Logging handlers defined by implementing a Logger
trait for each event type
In logging.rs
we define:
pub(crate) trait Logger {
fn get_logger() -> Box<dyn Fn(&Self) + Send>;
}
and implement the trait for each event type.
Justification:
This gives us a uniform and clean way of obtaining a logging handler for each event with <event>::get_logger()
. It gives us more type safety and makes refactoring or future extensions of the event set much easier and safer.
Now users can start the replica with a user-friendly builder pattern, and likewise define the configuration with a builder pattern, for example in tests/test.rs
:
let configuration =
Configuration :: builder()
.me(keypair)
.chain_id(0)
.sync_request_limit(10)
.sync_trigger_timeout(Duration::new(40, 0))
.sync_response_timeout(Duration::new(3, 0))
.progress_msg_buffer_capacity(10000)
.log_events(true)
.build();
let replica =
ReplicaSpec :: builder()
.app(NumberApp::new(tx_queue.clone()))
.pacemaker(pacemaker)
.network(network)
.kv_store(kv_store)
.configuration(configuration)
.on_insert_block(insert_block_handler)
.build()
.start()
Note:
on_<event>
setters can be used to register a handler for a given event and are optionalReplicaSpec
struct cannot be constructed and hence, the replica cannot be startedImplementation:
Extra ReplicaSpec
struct to hold all parameters required to start a replica with:
ReplicaSpec
fields from the setters and their arguments, andstart
method which returns Replica<K>
Why TypedBuilder?
ReplicaSpec
can be constructed, but for this use case it is not necessary. For example, by being able to call a single on_<event>
setter multiple times to add more handler elements to the vector, but as mentioned above this is not necessary because multiple handlers can be equivalent to one handlerReplicaSpec
's on_<event>
fields store vectors of handler pointers
on_<event>
setters set the corresponding fields to a singleton vector containing a pointer to the handler passed as argument
Justification:
EventHandlers
struct instance from the ReplicaSpec
fields, and optionally add the logging handlers by calling add_logging_handlers
on the EventHandlers
struct instance, thus separating the process of adding user-defined handlers from the process of adding pre-defined handlersevent_bus
thread, but the closure would still be owned by the spawning thread. This could be solved with a design that passes the user-defined handlers directly to the event_bus
thread so that it owns them, but this design does not work well for us since we first need to combine the user-defined and logging handlers and check if they are empty before spawning the event_bus
thread in case they're not empty.validate_block_for_sync
method of the App
traitWith this feature the user can define different method for validating a block for progress mode (validate_block
) and for sync mode (in this case called validate_block_for_sync
). Now both validate_block
and validate_block_for_sync
are methods of the App
trait, with the same type signature.
ed25519_dalek
upgrade for improved security of Public Key Cryptography and type-safety:Context:
ed25519_dalek
2.0.0 comes with improved security of Public Key cryptography, and for us also improved type safety, as the new implementation of VerifyingKey
allows us to use it in public interfaces. Previously, we used PublicKeyBytes
, an alias for[u8; 32]
, which might lead to type errors that cannot be detected by the compiler. For example, CryptoHash
is an alias for the same type.
Implementation:
Updated ed25519_dalek
to version 2.0.0 (latest version)
Consequence:
This also required importing the rand_core
feature for ed25519-dalek
and rand_core
crate in Cargo.toml
, because it is required to generate ed25519-dalek
keypairs in tests/test.rs
Replaced PublicKeyBytes
by ed25519-dalek::VerifyingKey
(aliased as PublicKey
) in all public interfaces
In particular replacing the key type in hashmaps and hashsets by PublicKey, concretely ValidatorSetUpdates
and ValidatorSet
.This is possible because ed25519-dalek::VerifyingKey
implements the Hash
trait.
Key design decision:
Defined ValidatorSetUpdatesBytes
and ValidatorSetBytes
, visible only within the crate, in addition to the public ValidatorSetUpdates
and ValidatorSet
, and define Into
and TryFrom
implementations to allow conversion between the structs using PublicKey
and the struct using PublicKeyBytes
.
Justification:
This is necessary because PublicKey
does not implement BorshSerialize
and BorshDeserialize
, so e.g., upon reading ValidatorSet
from BlockTree
, we need to first deserialize it into ValidatorSetBytes
, and then convert it to ValidatorSet
using try_from
. Likewise, when writing to the BlockTree
we first call into
, then serialize.
Consequence:
Improved type-safety. try_from
can return ed25519-dalek::SignatureError
when the PublicKeyBytes
do not correspond to a valid ed25519-dalek::VerifyingKey
, which is different from the std::io
errors possibly returned when deserializing and ensures that replicas only store cryptographically valid PublicKey
s in their memory.
Solution:
Progress =
1. Receiving an acceptable proposal (i.e. a proposal a replica can vote for and a block it can insert), or
2. Receiving an acceptable nudge, or
3. Receiving an acceptable block via sync
Indeed progress implies one of the three, since progress involves extending the blockchain with new blocks in a safe manner. At least this is the kind of "progress" we are concerned with for the sake of the sync mechanism, which involves a replica catching up with the state of the blockchain (and the highest_qc
, but this is of secondary concern since now that we do not discard QC's from higher views it is easier for a replica to stay up-to-date with highest_qc
's even when it is behind on view number of blocks).
Nudge needs to be considered too, since validator-set-updating blocks take 3 voting rounds to commit during which no other blocks might be proposed but progress on the validator-set-updating block is made.
Lack of progress as defined above would indicate that received proposals/nudges/blocks are not acceptable i.e,:
Assuming no Byzantine attack happened, each of the above implies that the replica's knowledge of the protocol execution is outdated and thus, the replica needs to sync.
Note:
Unlike the old sync mechanism based on receiving QC from the future, this makes it possible for a replica that's ahead in terms of view number to sync with the majority of replicas making progress at lower view numbers
Implementation:
SyncTriggerTimer
is updatedSyncTriggerTimer
's timeout()
method called after every iteration of the loop in execute_view
to check if sync should be triggeredtimeout()
is true:
Err(ShouldSync)
which triggers the sync, Why call the timeout()
method after every iteration of the loop in execute view?
Limitation:
This is not the best design for cases when the sync server of the sync peer is not responding or malicious and hence the replica fails to insert any new blocks via sync - in this case we'd ideally want to trigger a sync with a new peer immediately after failed sync.
However, realistically a replica can't distinguish between the cases of not inserting new blocks (1) because peers made no progress either, and (2) because the peer is malicious/not responsive. So all things considered, I think this is the best possible solution.
_Why not "progress = updating highest_qc
"?_
highest_qc
is not a precondition for for accepting proposals, but having up-to-date blocks in the BlockTree
ishighest_qc
" approach leads to triggering unnecessary sync cycles at the start of the protocol when replicas views are not synchronised and no QC's are collected or sent (this is in fact what happens in one of our test cases)Usage:
Deciding how big sync_trigger_timeout
can be is important and needs to consider the Pacemaker
's view synchronization mechanism via the view_timeout
method. In general, it should be bigger than expected view_timeout.
For our DefaultPacemaker
, view_timeout
= $2^{curView - highestQCView}$ seconds so it is biggest when a replica hasn't updated its highest_qc
for a while. Say for 5 views, then view_timeout
= $2^{5} = 32$ for the 5th view after highest_qc
update.
Context:
Previously, out message buffer could overflow, as described in issue #1
Solution:
Make sure that the buffer doesn't grow beyond the user-defined message_buffer_capacity
by checking if adding a given message to the buffer would result in overload, and
Implementation:
progress_msg_buffer_capacity: now this config parameter is stored as a field of ProgressMessageStub (as msg_buffer_capacity), along with msg_buffer_size which keeps track of the number of bytes of the values stored in the msg_buffer
Two extra fields for ProgressMessageStub
(in addition to msg_buffer
):
msg_buffer_capacity
: user-defined progress_message_buffer_capacitymsg_buffer_size
: keeps track of the number of bytes of the values (but not keys) stored in the msg_buffer
, i.e., for every (sender, msg)
added to the buffer the msg_buffer_size
is incremented by mem::size_of::<PublicKey>() as u64 + Self::size_of_msg(&msg)
Two extra methods for ProgressMessageStub
:
remove_from_overloaded_buffer
: given a buffer and number of bytes to remove, it removes highest-viewed messages from the buffer until at least the required number of bytes is removedsize_of_progress_msg
: computes the number of bytes required to store a particular ProgressMessage, by pattern-matching on the enum and calling mem::size_of
Now ProgressMessageStub
's recv
method, upon receiving a message of higher view than current view, also checks if messages need to be removed from the buffer to make space for the new message and performs the removal if needed.
commit_block
Reason:
Previous version is non-tail-recursive and hence vulnerable to stack overflow, see issue #2
Implementation:
Implemented commit_block
with iterators and closures, concretely by defining an iterator over blocks and folding over it, which is logically equivalent to tail recursion (but not computationally, because of Rust compiler)
Justification:
In Rust iterators are "the way". They are:
Limitation:
Because we need to commit blocks from oldest to newest (to apply app and validator set updates in correct order), and because we can only obtain the uncommitted blocks
from newest to oldest (by obtaining a block's parent starting from the inserted block), we need to reverse the latter.
This requires storing a vector of uncommitted_blocks
(since single-ended iterators cannot be reversed), but since it is stored on the heap and goes out of scope after the function returns it is fine. Then we use an iterator over uncommitted_blocks
, which only stores pointers to the heap memory on stack, so it should avoid stack overflow.
chain_id
Context:
Upon receiving a QC, replicas do not check if it matches their chain_id
Solution:
Now safe_qc
takes replica's chain_id
as an extra parameter and checks whether qc.chain_id == chain_id
locked_view
after receiving a nudge with prepareQC (issue #9 )Context:
Whenever we receive a prepareQC for a validator-set-updating block b
, b.justify
should be locked, but previously it wasn't.
The reasons for updating the lockedQC (which we think is equivalent to our locked_view
) are described in the HotStuff paper, but in short locks guard safety of the commit and guarantee liveness. The invariant is that if a value is locked it will be committed eventually, and also honest replicas know of a QC for the value.
Solution and Implementation:
Whenever we receive a prepareQC for a validator-set-updating block b
, set locked_view = b.justify.view
if b.justify.view > locked_view
. This applies to (1) receiving a nudge with prepare phase QC, (2) receiving new view with a highest_qc
from the prepare phase, and (3) collecting a QC for the prepare phase.
pending_validator_set_updates
(issue #15)Context:
Previously pending_validator_set_updates
called on a block's hash did not return None in case the deserializing failed, instead it panicked, and in case of success returned Some(...)
.
Solution:
Because of how is_some()
and is_none()
are called on pending_validator_set_updates(&qc.block)
in safe_qc
, the expected behaviour is that pending_validator_set_updates
returns None when there are no updates and the deserializng fails.
Hence, if deserializing fails we return None
.
tests/test.rs
to account for the updates described aboveblock_exists
since the BlockTree
already has a contains(&self, block: &CryptoHash) -> bool
methodI raised one protocol-significant comment on the pull request, as well as a few style comments.
The definition of “progress” used in this pull request is “receiving a proposal, nudge, or sync response that causes a block to be inserted into the block tree, or a QC to be set as the highest QC”.
I suggested that this definition could be made stricter to only admit blocks that have a block height higher than the previously highest block.
Fn(Event)
for the event handler registration methods, as opposed to a Box(dyn Fn(Event))
.size_of_progress_msg
as a method of ProgressMessage
, as opposed to an associated function of ProgressMessageStub
.PublicKey
for VerifyingKey
.Scenario 1:
highest_qc
highest_qc.block
as a parent thus extending the list of its children but never committing any of the childrenScenario 2:
highest_qc.block
is a validator-set-updating block with a prepareQC for it being the minority's highest_qc
Scenario 3 (indistinguishable from scenario 1):
Scenario 4 (indistinguishable from scenario 2):
That solves the issue with scenarios 1 and 2, and maybe we can make it work for 3 and 4?
Progress =
1. Receiving an acceptable proposal for a block **that extends the blockchain**
2. Receiving an acceptable nudge **with a new QC**
3. Receiving an acceptable block **that extends the blockchain via sync**
where:
b'
extends the blockchain if and only if parent_block.is_none() || block_tree.children(parent_block.unwrap()).is_none()
(where parent_block
is the parent of b
)qc
is new if and only if the highest_qc
is updated to qc
Implies that in scenarios 3 and 4 replicas will also try to sync, in particular at the start of the protocol when replicas are generally out of sync
timeout()
is checked at the end of each iteration in execute_view
, this will not trigger endless sync cyclessync_response_timeout
, (3) returning to execute_view
for the subsequent view, checking if it made progress by receiving one message, and if not then entering sync mode againsync_response_timeout
and sync_trigger_timeout
min(sync_response_timeout, remaining_time_in_view)
but this is very restrictive and effectively makes sync mode part of progress mode...To avoid sync in scenarios 3 and 4 altogether, we need to answer the question: What is the difference between scenarios 1 and 3 (or 2 and 4) from the perspective of the "minority replica"?
As described above, we check if there is sync trigger timeout after every iteration of the loop in execute_view. However, the implication of this is that in the scenario when replicas' views are out of sync, and eventually they synchronise and vote on a proposal (on_receive_proposal
), the leader will not get a chance to collect these votes (via on_receive_vote
), but rather they will go into sync mode, and hence never collect the votes.
Receiving the above-mentioned proposal will not update the timer, since the proposed block does not extend the blockchain, being one of multiple children of the parent block (in the beginning of the protocol, the genesis block), but the first one which can obtain a QC because the validators are finally in sync. This is why sync is triggered again.
Check if there is sync trigger timeout upon every view timeout.
Unfortunately this means checking if there is sync trigger timeout quite rarely, but it seems to be necessary if we want to go with this definition of "progress" without compromising liveness?
Alternatively, we can treat the children of the genesis block differently (i.e. consider any valid proposal that extends genesis as progress), although in theory this scenario can happen for other blocks too.
Define progress as:
Progress =
1. Receiving an acceptable proposal for a block **that extends the blockchain**
2. Receiving an acceptable nudge **with a new QC**
3. Receiving an acceptable block **that extends the blockchain via sync**
where a block block
extends the blockchain if and only if:
block_tree.children(&block.justify.block).is_none() ||
(block_tree.children(&block.justify.block).is_some() && block_tree.children(&block.justify.block).unwrap().is_empty())
and a QC qc
is new if and only if a replica's highest_qc
becomes updated to qc
in response to receiving the nudge.
On each of these events we update the sync trigger timer.
We checker the value of the timer, and trigger sync if the value exceeds sync_trigger_timeout
, upon every view timeout.
Definition of progress:
Implies that sync will be triggered when replicas' views are out of sync and there is no quorum ahead (scenarios 3 and 4 here), but after giving it some thought I concluded that this is fine, i.e., (1) is necessary for any timeout-based trigger mechanism (when sync is triggered in response to lack of progress, as opposed to sync being triggered in response to some event, like receiving QC from the future), since when views are out of sync replicas are indeed not making progress, and (2) in general should not compromise liveness since:
sync_response_timeout
is much smaller than the view_timeout
for a replica that hasn't made progress for at least sync_trigger_timeout
(this should generally be the case, e.g., in tests we set sync_response_timeout
equal to 3 seconds), or (2) sync peers respond fastChecking the timer on every view timeout:
As described here if we check the timer (and potentially trigger sync) after every iteration of the loop in execute_view
, in other words after processing each progress message then:
Moreover, since sync_trigger_timeout
should be bigger than expected view_timeout
(as described here) it's okay for the timer to be checked as rarely as after every view timeout
Conceptually, this solution means that we are giving a given view a chance to bring us progress, and if that doesn't happen for at least 1 view (or more views, depending on sync_trigger_timeout
and for how many views a replica hasn't updated its highest_qc
) then we trigger timeout (perhaps a number-of-views-based, rather than timeout-based sync trigger mechanism could work too)
Overall, I think this timeout-based sync trigger mechanism is a bit slower than the previous event-based one, based on receiving QC from the future, but I believe that, unlike the previous sync trigger mechanism, it should not lead to violations of liveness.
Those are very valid concerns, especially the concern that sync_trigger_timeout
is meaningless as a parameter if we only trigger sync after view_timeout
which might be much bigger. I agree that it is better to move this change to hotstuff-rs 0.4 - in hope that Cogsworth might resolve the problems with timeout-based sync trigger, but also to earn more time to make sure that our new sync trigger mechanism doesn't violate liveness.
One more concern I have about the solution above is that it makes liveness dependent on the config parameters set by the user. However, in general we would like that the protocol satisfies liveness regardless of the config parameters set by the user - the values of these parameters can affect the speed at which the protocol execution moves forward, but eventual progress should be ensured regardless of the values of the parameters. And if we can't achieve that, then we should enforce the desired relations between the config parameters through the code.
Also, in case we really want to resolve issue #12 in hotstuff-rs 0.3, one very quick (but not very neat) solution might be to change the definition of "future", as far as receiving a QC from the future is concerned, from highest_qc.view > cur_view
to highest_qc.view > cur_view + 1
here. I haven't really thought it through though.
FYI: chain_id
of Genesis QC always being 0 regardless of what’s provided by the user is intended behavior.
Ok, got it, I thought it's unintended. I guess the reason for that is to avoid passing chain_id
as an argument to genesis_qc()
or Replica's initialize
method?
The reason I did this is that the fullnode tests were failing because with hotstuff-rs 0.3 in safe_qc
we are checking of the replica's chain_id
and the chain_id
of the qc match - with the genesis chain_id
hardcoded as 0 they don't, and all proposals for the first block fail. However, the alternative solution to this problem is not to check the chain_id
of the genesis_qc
in safe_qc
, so I can do this instead.
commit_block
and delete_branch
commit_block
and delete_branch
are non-tail-recursive and hence, potentially prone to stack overflow, as described in issue #2 and issue #23
Re-write commit_block
and delete_branch
in iterative style.
Clearly, recursive commit_block
and delete_branch
are cleaner and easier to reason about. Moreover, by claim (1) below commit_block
can only commit a maximum of 3 blocks at a time, and by claim (2) delete_branch
can only by called on branches of a maximum depth of 3, giving us reasons to believe that stack overflow is unlikely.
However, the maximum breadth of a branch extending from a sibling of the pruned block is unbounded. This is because a replica is free to insert arbitrarily many children for a block supported by a QC as long as they are distinct and this is enabled by the user's App::produce_block
implementation. When the replicas are honest and their views are in sync this should not happen. Nevertheless, if a replica or a group of replicas lack a quorum present in the same view required to make progress, they will keep inserting children (with the last block supported by a QC as the parent) until the quorum is formed. I think we should account for this scenario - even though it's our responsibility to design the protocol to avoid such long periods of stagnation, I think we should hope for the best and design for the worst (protocol performance).
This affects both commit_block
and delete_branch
, because commit_block
calls delete_siblings
and delete_siblings
calls delete_branch
. Even if we keep commit_block
non-tail-recursive, and rewrite delete_branch
in iterative style, a stack overflow from commit_block
might still happen because of the accumulation of stack frames due to the number of operations required to delete a branch that has a lot of children.
Even though I prefer clean non-tail-recursive code, I think performance should come first, even if the risk is small...
Claim 1: By the protocol specification commit_block
can commit a maximum of 3 blocks
Proof:
By induction on the block height of the inserted block (since commit_block is only called from insert_block).
Base case: If we are inserting a block at height $\leq 4$ this is trivial.
Inductive step: Suppose we are inserting a block, call it $b_5$ at some height $h$. WLOG assume it's parent is $b4$ etc. so that we get a subchain $b{5} \rightarrow b{4} \rightarrow b{3} \rightarrow b{2} \rightarrow b{1}$. Then:
Case 1: $b_{5}.justify$ is in generic phase
Then commit_block
is called on $b_{2}$, and:
insert_block
on $b_{4}$ called commit_block
on $b_{1}$. So calling commit_block
on $b{2}$ commits only $b{2}$insert_block
on $b_{4}$ called commit_block
on $b{3}$, also committing $b{2}$. So calling commit_block
on $b_{2}$ does not commit any blocks.Case 2: $b_{5}.justify$ is in commit phase
Then commit_block
is called on $b_{4}$, and:
insert_block
on $b_{4}$ called commit_block
on $b_{1}$. So calling commit_block
on $b{4}$ commits $b{2}, b{3}, b{4}$insert_block
on $b_{4}$ called commit_block
on $b_{3}$. So calling commit_block
on $b{4}$ commits only $b{4}$.Q.E.D
Claim 2: By the protocol specification any "speculative" branch can have a depth of maximum 3.
speculative branch = branch which has a sibling of a pruned block as its root
Proof:
For the sake of contradiction, assume that $b{4} \rightarrow b{3} \rightarrow b{2} \rightarrow b{1}$ is a subchain of a branch of depth 4 with $b_{1}$, a sibling of a committed and pruned block $b$ (i.e., about to be pruned), as its root.
Then on inserting $b_{4}$ either:
insert_block
$b_{1}$ was committed, and hence $b$ was deleted. This contradicts the fact that $b$ is about to be committed.commit_block
on $b{3}$ also $b{1}$ was committed, since it is at a lower height. Again, contradiction.Q.E.D
The last commit reverts the fix for pending_validator_set_updates
. As explained here pending_validator_set_updates
has been falsely labelled as exhibiting incorrect behavior. In short, the old version correctly returns None
when getting the value for the respective key from the key-value store fails, and throws an error when deserializing the obtained value fails.
This affects both commit_block and delete_branch, because commit_block calls delete_siblings and delete_siblings calls delete_branch. Even if we keep commit_block non-tail-recursive, and rewrite delete_branch in iterative style, a stack overflow from commit_block might still happen because of the accumulation of stack frames due to the number of operations required to delete a branch that has a lot of children.
Quick comment from me: delete_siblings
calls delete_branch
inside a for loop, and a single iteration of the for loop completes only after the delete_branch
call completes. Similarly, delete_branch
also calls itself from within a for loop. In other words, we will completely delete a single branch (which has a max depth of 2) before “moving on” to deleting another branch, so the stack frames will not accumulate, no?
Don’t forget documentation.
This affects both commit_block and delete_branch, because commit_block calls delete_siblings and delete_siblings calls delete_branch. Even if we keep commit_block non-tail-recursive, and rewrite delete_branch in iterative style, a stack overflow from commit_block might still happen because of the accumulation of stack frames due to the number of operations required to delete a branch that has a lot of children.
Quick comment from me:
delete_siblings
callsdelete_branch
inside a for loop, and a single iteration of the for loop completes only after thedelete_branch
call completes. Similarly,delete_branch
also calls itself from within a for loop. In other words, we will completely delete a single branch (which has a max depth of 2) before “moving on” to deleting another branch, so the stack frames will not accumulate, no?
From my understanding this would have been the case if delete_siblings
was not called by a recursive function, by the reasons you have mentioned. However, if delete_siblings
is called on the result of a recursive call to commit_block
in commit_block
, then all computation performed by delete_siblings
, even if it's a for loop, is accumulated on the stack. Basically, the threat of stack overflow in commit_block
might arise not because of the depth of recursion (which is at most 3), but because of the number of stack allocations on each recursion level before we get to the "base case" and stack frames start getting popped. If we call commit_block
on $b{3}$ where the uncommitted subchain is $b{3} \rightarrow b{2} \rightarrow b{1}$, then until commit_block
is called on $b_{1}$ all computation associated with calling commit_block
on $b{1}$ and $b{2}$ will be pushed to the stack.
So I guess if our reasoning here is correct there are two solutions:
commit_block
iterative.**This is fine assuming that a branch extending from a sibling of a pruned block has a limited breadth, otherwise by the same logic as above the stack frames of the for loop will accumulate because the for loop takes a result of the recursive call. I think it's very unlikely that the breadth at this level is very big, but possible, I'd need to think about this.
This commit includes the following:
ed25519-dalek
version 2.0.0 and related shift from exposingPublicKeyBytes
as validator identifiers to usingPublicKey
(ed25519-dalek::VerifyingKey
) insteadcommit_block
(in response to Issue #2)pending_validator_set_updates
(Issue #15)logging.rs
, the bug appearing due toPublicKeyBytes
andCryptoHash
being type aliases for the same typeComment on (2): The previous version of
commit_block
was non-tail-recursive, which means potential stack overflow as more and more stack frames are added and there is nothing the compiler can do about it. In some other languages, implementing a tail-recursive version instead would avoid the stack overflow problem and be optimised by the compiler to performance equivalent to that of a loop (Tail Call Optimisation). However, Rust does not implement Tail Call Optimisations,. Moreover, it has iterators (and closures) which are marginally faster than loops, and folding over an iterator is logically equivalent to tail-recursion. Hence, my solution involves folding over an iterator with a closure that updates the state thereby committing a corresponding block, from oldest blocks to newest blocks.