Closed vkomenda closed 5 years ago
I don't think the other validators should queue any messages for a removed node. If node Bob is removed in epoch 10 and added back in epoch 12, there's no danger of Bob handling any messages from epoch 12 before his removal. Also, messages from epoch 11 never need to reach Bob at all.
The only problem that might occur with the current implementation is that another node, Alice, already sent EpochStarted(12)
to Bob before he disconnected, and won't send it again when he reconnects.
So the solution is to:
EpochStarted
that contains a new epoch (or even just a new era), i.e. one that we didn't know about for this particular peer yet, with our own EpochStarted
, even if that's duplicated.EpochStarted
whenever a node reconnects.I'd prefer the first option, since it doesn't complicate the API and keeps the abstract algorithm more independent of the networking.
I just tried the following, which I think implements the first point, but it didn't fix the net_dynamic_hb
failure. :pensive:
--- a/src/sender_queue/mod.rs
+++ b/src/sender_queue/mod.rs
@@ -207,19 +207,16 @@ where
/// Handles an epoch start announcement.
fn handle_epoch_started(&mut self, sender_id: &D::NodeId, epoch: D::Epoch) -> DaStep<Self> {
- self.peer_epochs
- .entry(sender_id.clone())
- .and_modify(|e| {
- if *e < epoch {
- *e = epoch;
- }
- })
- .or_insert(epoch);
- if !self.remove_participant_if_old(sender_id) {
- self.process_new_epoch(sender_id, epoch)
- } else {
- Step::<D>::default()
+ if self.peer_epochs.get(sender_id) == Some(&epoch) {
+ return Step::<D>::default();
+ }
+ self.peer_epochs.insert(sender_id.clone(), epoch);
+ if self.remove_participant_if_old(sender_id) {
+ return Step::<D>::default();
}
+ let msg = Message::EpochStarted(self.algo.epoch());
+ let step = Target::All.message(msg).into();
+ self.process_new_epoch(sender_id, epoch).join(step)
}
The only problem that might occur with the current implementation is that another node, Alice, already sent EpochStarted(12) to Bob before he disconnected, and won't send it again when he reconnects.
I think the problem is rather that other messages get sent as well to the node that is about to restart in the same epoch, and those messages get missing when the node restarts.
So, I don't think that resending only EpochStarted
would fix that.
Ah, sorry, I probably misunderstood then. This issue is related to #365 and not to the net_dynamic_hb
failure, right? (Because in the test, no node is re-added in the same epoch in which it leaves, I think.)
This is related to net_dynamic_hb
and to the failures in the now decommissioned old DHB test.
Because in the test, no node is re-added in the same epoch in which it leaves, I think.
That's right. And that is why the restarted node sometimes doesn't output a batch in that epoch. A part of the messages from other nodes had gone to the stopped instance on that node, so the restarted instance doesn't get those.
Sorry, I'm still confused:
If a node leaves in epoch 10, it will never send an EpochStarted(12)
before leaving. Also, it will only shut down once it has output batch 10 (otherwise it wouldn't know it has to shut down).
So if it rejoins in epoch 12, how could any of the relevant messages already have been sent to it before it left?
In net_dynamic_hb
the chosen node leaves in epoch 4 and restarts with a join plan in epoch 5. Before restarting though it sends an EpochStarted
message for epoch 5 and other nodes send some messages for epoch 5 to it. Then it restarts and never gets the messages for epoch 5 that it had received before restarting.
Ah, right, that's it! So we just need to not send EpochStarted(5)
if we know we just got removed?
I think so!
We should be careful though not to send EpochStarted(6)
and then restart! :)
The sender queue currently does not handle the edge case of removal and rejoining the same validator in the next epoch correctly. The bug is as follows. The removed validator announces the epoch in which it is no longer a validator. Other nodes start sending messages for that epoch. If we start a ballot for adding back the same node as validator in that next epoch, we can no longer correctly restart the removed validator with the join plan because it has already had processed some of the messages for that epoch.
With the current implementation of the sender queue this use case requires additional processing of messages, e.g., replaying those next epoch messages on the restarted node. There are better options.
Messages to the removed validator should be queued correctly on all other validators starting from the epoch when it is no longer a validator. If the removed validator is going to rejoin in that epoch, it must not send the
EpochStarted
message for the epoch when it is no longer a validator until it receives aJoinPlan
out of band and restarts. Once restarted, it sends out theEpochStarted
message and other validators send the queued messages to it.This solution changes how removed validators are treated. They no longer become observers since they don't send any messages until they receive a
JoinPlan
.If you think there are better solutions or that we simply have to keep the removed validator as an observer by default, let's discuss it in this issue.