Implement a journal, to persist and restore the algorithms' state?

afck commented 5 years ago

For Parity (https://github.com/poanetwork/parity-ethereum/issues/181) and other applications, we need to make it possible to persist the state of consensus to disk. That allows a node to e.g. restart without becoming faulty.

We will add a journaling feature to hbbft: Each Step returned by the algorithms will contain journal log entries that represent the internal state change. These can be persisted, and even after a crash, the full state can be recovered by replaying the journal. That way, a node can continue to operate without being marked as faulty or skipping an epoch.

To keep the journal from growing indefinitely, the entries' epoch number should be user-visible: The upper layer can always delete all entries from previous epochs.

~~So let's make all algorithm structs implement the serde traits (Serialize and Deserialize).~~

afck commented 5 years ago

On the other hand, that makes it easy to mistakenly persist secret keys and key shares, since those are part of the algorithm structs. Also, NetworkInfo is shared between the subalgorithms, and the default Deserialize implementation would create a lot of duplicates.

We should probably encourage the user to store the NetworkInfo — and in particular the secret keys — separately, so they can take measures to protect them.

So we could implement a serializable state representation for each algorithm, which wouldn't include the NetworkInfo.

Alternatively maybe we could add a journal to Step, containing individual state change logs. That would allow persisting the state update in every turn, and restoring the state from a journal even after a crash.

dforsten commented 5 years ago

As discussed privately the "journal" function would be best.

An essential feature for "production-ready" status.

afck commented 5 years ago

Conceptually, the event handler methods (handle_input, handle_message) take:

the old state and NetworkInfo (&mut self),
the event (user input or message),
an RNG (&mut R where R: Rng),

and they output:

the new state (&mut self),
the algorithm's output, the fault log and outgoing messages (Step).

The journal would be added to Step as a sequence of entries that fully define the difference between the old state and the new state.

The naive way of implementing that is not an option: If we try to go through the code and create a journal entry in every one of the hundreds of places where we modify the mutable algorithm state, we'll never get it right, and the code wouldn't be maintainable; every change where we forget to add or update the corresponding journal entry creation would break it. I see three possible ways forward:

1. Only implement checkpoints.

Instead, just make sure that the event handlers are all fully deterministic. In fact, this should already be the case!

So the user can already emulate journaling functionality by:

persisting all incoming messages and user inputs, in the order in which they were handled, and
using a deterministic RNG (probably initialized with a truly random seed, and possibly reseeded periodically, as long as all those seeds are persisted, too).

Then after a crash or restart, the algorithm's state can be restored by replaying all inputs and using the same seeds. The output, outgoing messages and fault logs can all be thrown away, since they have already been handled.

To avoid having to store all of this forever, it would suffice to make the algorithm state serializable. Then the log can periodically be deleted and replaced with the current snapshot.

I suspect that this option would actually not be very wasteful: hbbft's message types don't contain much redundant information, so in normal operation the log might not even require more space than journal entries would. (To avoid storing messages that didn't change the state, we could add some flag to Step.) If that's true, I don't think there is a good justification for the other options:

2. Diff the states after event handling.

All handle_event and handle_message implementations could clone the old state, and, before returning, produce log entries based on the difference between the old and new states. That way, at least we wouldn't have to touch the internal logic too much.

However, it can be wasteful to clone the full state every time. In some places, this could be optimized a bit, e.g. we'd certainly put the HoneyBadger contributions, which can be large, into Rcs instead of cloning them. But there would still be a lot of cloning going on…

3. Encapsulate the algorithm states.

This would be the most invasive change, but it would avoid the cloning:

Separate all the algorithms' states, and make their fields private. They can only be modified by calling state.apply(&journal_entry). That way we are forced to always produce the correct journal entries. We can make them #[must_use], too: They must always be added into the Step.

This would mean almost a complete rewrite, though.

dforsten commented 5 years ago

Interesting approach using deterministic replay of hbbft input to get back to the state before shutdown.

In Parity we already will have to persist outgoing, but unsent, messages - persisting incoming messages as long as a hbbft epoch is active should be similar in implementation.

afck commented 5 years ago

I don't think it makes sense to implement snapshot functionality for all the one-shot subalgorithms. And for HoneyBadger it suffices to make a "snapshot" on each epoch change, which essentially consists of only the epoch number.

So I'm actually not sure we need to add any functionality to hbbft at all. Here's what I think should already work:

Persist each incoming message and each input to HoneyBadger before sending out the messages from the returned Step. For each input, use a new PRNG initialized with a random seed, and store the seed.
Whenever epoch number n completes (block #n is imported), all messages msg with msg.epoch() <= n can be deleted, as well as all inputs for epochs <= n. The current epoch number n + 1 needs to be stored on disk.
After a crash or restart, initialize with HoneyBadgerBuilder::new(network_info).epoch(n + 1).build() (and any additional parameters that were set), and make it handle all saved inputs and messages with epoch >= n again. For the inputs, use PRNGs initialized with the stored seeds.

afck commented 5 years ago

Please reopen if there's anything we need to add to hbbft to support restarts and crash recovery.

poanetwork / hbbft