Open durkmurder opened 1 week ago
Of these errors, I am most concerned about:
{"level":"error","node_role":"consensus","node_id":"3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-1","error":"wrong DKG instance. Got dkg-flow-localnet-2, want dkg-flow-localnet-1","time":"2024-11-15T17:13:15.06699973Z","message":"bad message"}
This indicates that:
I can see two possible reasons for this:
After looking through the code, I suspect Controller.Shutdown()
-- which tears down the DKG instance broker -- is not being called when we enter, then recover from EFM.
Update: however, it isn't clear why.
Controller.Shutdown()
is called when transitioning into EndState
EndState
when Controller.End()
is calledController.End()
is called by ReactorEngine.end(...)
ReactorEngine.end(...)
is scheduled to be called on the final view of phase 3 of the DKG here
Context
In scope of addressing comments for https://github.com/onflow/flow-go/pull/6632, specifically when addressing comment: https://github.com/onflow/flow-go/pull/6632#discussion_r1838670355 I wasn't able to address it without further time investment.
Leaving some details for future investigation:
Reference commit: https://github.com/onflow/flow-go/pull/6632/commits/14e37fae103c704cabf565dfbd0aa0825f5ac898
This doesn't happen every time but I can reproduce it quite reliable simple by re-runing
TestRecoverEpoch
integration test.First DKG fails as planned because we stop collection node. After entering recovery epoch we expect that DKG will succeed and we will be able to enter epoch after the recovery. In this test setup epochs are numbered in next way:
DKG process is always started when setup phase starts for epoch 1, the problem manifests itself later. Some error logs during DKG:
This log is always reported when DKG fails, and it is reported for every node.
Often I have seen such logs from all nodes, seems like nodes are flagging each other:
Flagging of participant leads to next log:
Additionally there is extra flagging:
In this end each node reports the same error:
Which leads to submitting empty DKG result:
Possible reasons for failed DKG based on logs:
Definition of done