[BFT] Investigate why consensus nodes fail DKG

Context

In scope of addressing comments for https://github.com/onflow/flow-go/pull/6632, specifically when addressing comment: https://github.com/onflow/flow-go/pull/6632#discussion_r1838670355 I wasn't able to address it without further time investment.

Leaving some details for future investigation:

Reference commit: https://github.com/onflow/flow-go/pull/6632/commits/14e37fae103c704cabf565dfbd0aa0825f5ac898

This doesn't happen every time but I can reproduce it quite reliable simple by re-runing TestRecoverEpoch integration test.

First DKG fails as planned because we stop collection node. After entering recovery epoch we expect that DKG will succeed and we will be able to enter epoch after the recovery. In this test setup epochs are numbered in next way:

0 - starting epoch which enters EFM
1 - recovery epoch
2 - epoch after recovery.

DKG process is always started when setup phase starts for epoch 1, the problem manifests itself later. Some error logs during DKG:

This log is always reported when DKG fails, and it is reported for every node.

{"level":"error","node_role":"consensus","node_id":"3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-1","error":"wrong DKG instance. Got dkg-flow-localnet-2, want dkg-flow-localnet-1","time":"2024-11-15T17:13:15.06699973Z","message":"bad message"}

Often I have seen such logs from all nodes, seems like nodes are flagging each other:

{"level":"warn","node_role":"consensus","node_id":"086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:22.200603269Z","message":"participant 0 (this node) is flagging participant (index=2, node_id=4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b) because: building a complaint"}

Flagging of participant leads to next log:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:30.513419339Z","message":"participant 2 (this node) is disqualifying participant (index=0, node_id=086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91) because: there are 2 complaints, they exceeded the threshold 1"}

Additionally there is extra flagging:

{"level":"warn","node_role":"consensus","node_id":"086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:33.278872521Z","message":"participant 0 (this node) is flagging participant (index=1, node_id=3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba) because: complaint received after the complaint timeout"}

In this end each node reports the same error:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","engine":"dkg_reactor","error":"Joint-Feldman failed because the disqualified participants number is high: 2 disqualified, threshold is 1, size is 3","time":"2024-11-15T17:13:38.790678258Z","message":"node 4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b with index 2 failed DKG locally"}

Which leads to submitting empty DKG result:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:38.791592325Z","message":"submitting empty dkg result because I completed the DKG unsuccessfully"}

Possible reasons for failed DKG based on logs:

Collection node doesn't catch up after pause
Previously failed DKG interferes with node state and doesn't allow to perform DKG in next epoch
Invalid configuration of test

Definition of done

Investigate reason for failing DKG after failed DKG in previous epoch

Of these errors, I am most concerned about: {"level":"error","node_role":"consensus","node_id":"3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-1","error":"wrong DKG instance. Got dkg-flow-localnet-2, want dkg-flow-localnet-1","time":"2024-11-15T17:13:15.06699973Z","message":"bad message"}

This indicates that:

2 Consensus nodes are simultaneously running different versions of the DKG engines (one for epoch 1, one for epoch 2)
a DKG message for epoch 2 was routed to an engine operating the DKG for epoch 1

I can see two possible reasons for this:

1. The receiver Consensus Node has a lagging local state.

This is an expected possibility in general, but unlikely/impossible to be the case here. Because there are very few Consensus Nodes in the test case (I believe 3), in practice all would need to be online to maintain liveness and a consistent block rate. So I assume that all Consensus Nodes are synchronized on the latest view.
(Maybe it is possible that the asynchronous event handlers are lagging on one node, despite it being synchronized on the latest view? Seems unlikely though.)

2. The receiver Consensus Node failed to properly tear down its DKG engine for epoch 1 (even after the DKG for epoch 2 had already started)

If true, this is dangerous because an old "hanging" DKG engine could process messages intended for a newer DKG.
- The sender would think the message was successfully delivered, and no re-send.
- The "hanging" engine would discard the message, because it is for the wrong DKG instance
- The correct DKG instance would never receive the message

Suspected Cause

After looking through the code, I suspect Controller.Shutdown() -- which tears down the DKG instance broker -- is not being called when we enter, then recover from EFM.

Update: however, it isn't clear why.

Controller.Shutdown() is called when transitioning into EndState
We transition into EndState when Controller.End() is called
Controller.End() is called by ReactorEngine.end(...)
ReactorEngine.end(...) is scheduled to be called on the final view of phase 3 of the DKG here
We pass through the triggering view regardless of whether EFM has been triggered.

onflow / flow-go