poanetwork / hbbft

An implementation of the paper "Honey Badger of BFT Protocols" in Rust. This is a modular library of consensus.
Other
357 stars 96 forks source link

Remove a random subset of validators in net_dynamic_hb #385

Closed RicoGit closed 5 years ago

RicoGit commented 5 years ago

related issue #374

do_drop_and_re_add chooses at random at least 1 node for removing from the network and then re-adding removed nodes back. Cluster always remain correct. In other words before and after removing validators network correctness condition (N=3f+1) is always satisfied.

Removed nodes can be as faulty nodes as well as correct ones. You can see in the log of the test how much is actually nodes will be removed.

Max number of nodes for removing is 8, was chosen 2 nodes
Will remove and re-add nodes {1, 2}
RicoGit commented 5 years ago

Running tests on the CI server caught an error:

thread 'drop_and_re_add' panicked at 'crank: node failed to process step: Fault { reported_by: 3, faulty_id: 7, fault_kind: HbFault(SubsetFault(BaFault(DuplicateBVal))) }', libcore/result.rs:1009:5

Trying to reproduce the error locally I caught another error:

thread 'drop_and_re_add' panicked at 'crank: network queue empty', src/libcore/option.rs:1008:5

Second error appears when only one node left in the network after removing nodes. I'm trying to figure out what's wrong.

vkomenda commented 5 years ago

@RicoGit, to reproduce locally, add the proptest seed from the failing test run on Travis to your local seeds.

RicoGit commented 5 years ago

Should I add a static test case for the failed inputs? like this:

#[test]
fn drop_and_re_add_one_node_left() {
    let cfg = TestConfig {
        dimension: NetworkDimension::new(4, 0),
        total_txs: 20,
        batch_size: 10,
        contribution_size: 1,
        seed: [151, 234, 50, 31, 109, 65, 28, 122, 82, 93, 226, 143, 185, 27, 195, 133]
    };
    do_drop_and_re_add(cfg)
}
vkomenda commented 5 years ago

No, just add the seed to net_dynamic_hb.proptest-regressions.

vkomenda commented 5 years ago

This one: https://travis-ci.org/poanetwork/hbbft/jobs/499202391#L2980.

RicoGit commented 5 years ago

I've got it. Awesome! Thanks! But local test (with the seed from CI) produces a different error.

vkomenda commented 5 years ago

Have a look at crank error variants if your local failure is 'crank: network queue empty'. Check that the removed nodes finished sending all their messages before removal.

afck commented 5 years ago

Does the second error always happen whenever we're left with only a single node? That one may be related to the test (or the test net framework) and not the production code: Of course there are no messages in the queue if there's only one node. A single node will always immediately deliver a batch whenever we provide input. If it's difficult to make the test handle that case, I'd be happy with making that a TODO for a separate PR, and restricting the test to at least two nodes for now.

The first error is strange, though. We are a bit aggressive with our fault reports and sending the same BVal twice wouldn't break anything in production; but it still shouldn't happen in theory… :thinking:

RicoGit commented 5 years ago

"restricting the test to at least two nodes for now." - Yes! It fixed all the tests. Thanks! I've run a test about ten times for all known seeds. "I'd be happy with making that a TODO" - I'll make an issue for that. ok?

afck commented 5 years ago

Great, thanks! @vkomenda: Feel free to merge once you're happy with it.

vkomenda commented 5 years ago

@afck, are you OK with versioning tests/net_dynamic_hb.proptest-regressions? It's reasonable but wasn't done before.

afck commented 5 years ago

Ah, right, that file should be removed, since with a single node it always fails anyway. Good catch!