onflow / flow-go

A fast, secure, and developer-friendly blockchain built to support the next generation of games, apps, and the digital assets that power them.
GNU Affero General Public License v3.0
533 stars 179 forks source link

[BFT] Investigate why consensus nodes fail DKG #6729

Open durkmurder opened 1 week ago

durkmurder commented 1 week ago

Context

In scope of addressing comments for https://github.com/onflow/flow-go/pull/6632, specifically when addressing comment: https://github.com/onflow/flow-go/pull/6632#discussion_r1838670355 I wasn't able to address it without further time investment.

Leaving some details for future investigation:

Reference commit: https://github.com/onflow/flow-go/pull/6632/commits/14e37fae103c704cabf565dfbd0aa0825f5ac898

This doesn't happen every time but I can reproduce it quite reliable simple by re-runing TestRecoverEpoch integration test.

First DKG fails as planned because we stop collection node. After entering recovery epoch we expect that DKG will succeed and we will be able to enter epoch after the recovery. In this test setup epochs are numbered in next way:

DKG process is always started when setup phase starts for epoch 1, the problem manifests itself later. Some error logs during DKG:

This log is always reported when DKG fails, and it is reported for every node.

{"level":"error","node_role":"consensus","node_id":"3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-1","error":"wrong DKG instance. Got dkg-flow-localnet-2, want dkg-flow-localnet-1","time":"2024-11-15T17:13:15.06699973Z","message":"bad message"}

Often I have seen such logs from all nodes, seems like nodes are flagging each other:

{"level":"warn","node_role":"consensus","node_id":"086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:22.200603269Z","message":"participant 0 (this node) is flagging participant (index=2, node_id=4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b) because: building a complaint"}

Flagging of participant leads to next log:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:30.513419339Z","message":"participant 2 (this node) is disqualifying participant (index=0, node_id=086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91) because: there are 2 complaints, they exceeded the threshold 1"}

Additionally there is extra flagging:

{"level":"warn","node_role":"consensus","node_id":"086ee2bbea1aa36181523595acffd096ce7557aca80bde6ee2e5be1bae690b91","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:33.278872521Z","message":"participant 0 (this node) is flagging participant (index=1, node_id=3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba) because: complaint received after the complaint timeout"}

In this end each node reports the same error:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","engine":"dkg_reactor","error":"Joint-Feldman failed because the disqualified participants number is high: 2 disqualified, threshold is 1, size is 3","time":"2024-11-15T17:13:38.790678258Z","message":"node 4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b with index 2 failed DKG locally"}

Which leads to submitting empty DKG result:

{"level":"warn","node_role":"consensus","node_id":"4d254e362657f592af16708e6aadf6232d689e4a072e4369f4841860d9de732b","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-2","time":"2024-11-15T17:13:38.791592325Z","message":"submitting empty dkg result because I completed the DKG unsuccessfully"}

Possible reasons for failed DKG based on logs:

Definition of done

jordanschalm commented 1 week ago

Of these errors, I am most concerned about: {"level":"error","node_role":"consensus","node_id":"3ec567126474f024f2fff57b1e5e626e803ec56f3664c10ef8e49360dafc40ba","component":"dkg_broker","dkg_instance_id":"dkg-flow-localnet-1","error":"wrong DKG instance. Got dkg-flow-localnet-2, want dkg-flow-localnet-1","time":"2024-11-15T17:13:15.06699973Z","message":"bad message"}

This indicates that:

I can see two possible reasons for this:

1. The receiver Consensus Node has a lagging local state.

2. The receiver Consensus Node failed to properly tear down its DKG engine for epoch 1 (even after the DKG for epoch 2 had already started)

Suspected Cause

After looking through the code, I suspect Controller.Shutdown() -- which tears down the DKG instance broker -- is not being called when we enter, then recover from EFM.

Update: however, it isn't clear why.