Fix sender queue messaging when a validator is removed and then rejoined

vkomenda commented 5 years ago

The sender queue currently does not handle the edge case of removal and rejoining the same validator in the next epoch correctly. The bug is as follows. The removed validator announces the epoch in which it is no longer a validator. Other nodes start sending messages for that epoch. If we start a ballot for adding back the same node as validator in that next epoch, we can no longer correctly restart the removed validator with the join plan because it has already had processed some of the messages for that epoch.

With the current implementation of the sender queue this use case requires additional processing of messages, e.g., replaying those next epoch messages on the restarted node. There are better options.

Messages to the removed validator should be queued correctly on all other validators starting from the epoch when it is no longer a validator. If the removed validator is going to rejoin in that epoch, it must not send the EpochStarted message for the epoch when it is no longer a validator until it receives a JoinPlan out of band and restarts. Once restarted, it sends out the EpochStarted message and other validators send the queued messages to it.

This solution changes how removed validators are treated. They no longer become observers since they don't send any messages until they receive a JoinPlan.

If you think there are better solutions or that we simply have to keep the removed validator as an observer by default, let's discuss it in this issue.

afck commented 5 years ago

I don't think the other validators should queue any messages for a removed node. If node Bob is removed in epoch 10 and added back in epoch 12, there's no danger of Bob handling any messages from epoch 12 before his removal. Also, messages from epoch 11 never need to reach Bob at all.

The only problem that might occur with the current implementation is that another node, Alice, already sent EpochStarted(12) to Bob before he disconnected, and won't send it again when he reconnects.

So the solution is to:

Either always respond to an EpochStarted that contains a new epoch (or even just a new era), i.e. one that we didn't know about for this particular peer yet, with our own EpochStarted, even if that's duplicated.
Or add something to the API so that the user can resend another EpochStarted whenever a node reconnects.

I'd prefer the first option, since it doesn't complicate the API and keeps the abstract algorithm more independent of the networking.

afck commented 5 years ago

I just tried the following, which I think implements the first point, but it didn't fix the net_dynamic_hb failure. :pensive:

--- a/src/sender_queue/mod.rs
+++ b/src/sender_queue/mod.rs
@@ -207,19 +207,16 @@ where

     /// Handles an epoch start announcement.
     fn handle_epoch_started(&mut self, sender_id: &D::NodeId, epoch: D::Epoch) -> DaStep<Self> {
-        self.peer_epochs
-            .entry(sender_id.clone())
-            .and_modify(|e| {
-                if *e < epoch {
-                    *e = epoch;
-                }
-            })
-            .or_insert(epoch);
-        if !self.remove_participant_if_old(sender_id) {
-            self.process_new_epoch(sender_id, epoch)
-        } else {
-            Step::<D>::default()
+        if self.peer_epochs.get(sender_id) == Some(&epoch) {
+            return Step::<D>::default();
+        }
+        self.peer_epochs.insert(sender_id.clone(), epoch);
+        if self.remove_participant_if_old(sender_id) {
+            return Step::<D>::default();
         }
+        let msg = Message::EpochStarted(self.algo.epoch());
+        let step = Target::All.message(msg).into();
+        self.process_new_epoch(sender_id, epoch).join(step)
     }

vkomenda commented 5 years ago

The only problem that might occur with the current implementation is that another node, Alice, already sent EpochStarted(12) to Bob before he disconnected, and won't send it again when he reconnects.

I think the problem is rather that other messages get sent as well to the node that is about to restart in the same epoch, and those messages get missing when the node restarts.

So, I don't think that resending only EpochStarted would fix that.

afck commented 5 years ago

Ah, sorry, I probably misunderstood then. This issue is related to #365 and not to the net_dynamic_hb failure, right? (Because in the test, no node is re-added in the same epoch in which it leaves, I think.)

vkomenda commented 5 years ago

This is related to net_dynamic_hb and to the failures in the now decommissioned old DHB test.

Because in the test, no node is re-added in the same epoch in which it leaves, I think.

That's right. And that is why the restarted node sometimes doesn't output a batch in that epoch. A part of the messages from other nodes had gone to the stopped instance on that node, so the restarted instance doesn't get those.

afck commented 5 years ago

Sorry, I'm still confused: If a node leaves in epoch 10, it will never send an EpochStarted(12) before leaving. Also, it will only shut down once it has output batch 10 (otherwise it wouldn't know it has to shut down). So if it rejoins in epoch 12, how could any of the relevant messages already have been sent to it before it left?

vkomenda commented 5 years ago

In net_dynamic_hb the chosen node leaves in epoch 4 and restarts with a join plan in epoch 5. Before restarting though it sends an EpochStarted message for epoch 5 and other nodes send some messages for epoch 5 to it. Then it restarts and never gets the messages for epoch 5 that it had received before restarting.

afck commented 5 years ago

Ah, right, that's it! So we just need to not send EpochStarted(5) if we know we just got removed?

vkomenda commented 5 years ago

I think so!

vkomenda commented 5 years ago

We should be careful though not to send EpochStarted(6) and then restart! :)

poanetwork / hbbft

Fix sender queue messaging when a validator is removed and then rejoined #369