Closed RicoGit closed 5 years ago
Running tests on the CI server caught an error:
thread 'drop_and_re_add' panicked at 'crank: node failed to process step: Fault { reported_by: 3, faulty_id: 7, fault_kind: HbFault(SubsetFault(BaFault(DuplicateBVal))) }', libcore/result.rs:1009:5
Trying to reproduce the error locally I caught another error:
thread 'drop_and_re_add' panicked at 'crank: network queue empty', src/libcore/option.rs:1008:5
Second error appears when only one node left in the network after removing nodes. I'm trying to figure out what's wrong.
@RicoGit, to reproduce locally, add the proptest seed from the failing test run on Travis to your local seeds.
Should I add a static test case for the failed inputs? like this:
#[test]
fn drop_and_re_add_one_node_left() {
let cfg = TestConfig {
dimension: NetworkDimension::new(4, 0),
total_txs: 20,
batch_size: 10,
contribution_size: 1,
seed: [151, 234, 50, 31, 109, 65, 28, 122, 82, 93, 226, 143, 185, 27, 195, 133]
};
do_drop_and_re_add(cfg)
}
No, just add the seed to net_dynamic_hb.proptest-regressions
.
I've got it. Awesome! Thanks! But local test (with the seed from CI) produces a different error.
Have a look at crank error variants if your local failure is 'crank: network queue empty'
. Check that the removed nodes finished sending all their messages before removal.
Does the second error always happen whenever we're left with only a single node? That one may be related to the test (or the test net framework) and not the production code: Of course there are no messages in the queue if there's only one node. A single node will always immediately deliver a batch whenever we provide input. If it's difficult to make the test handle that case, I'd be happy with making that a TODO for a separate PR, and restricting the test to at least two nodes for now.
The first error is strange, though. We are a bit aggressive with our fault reports and sending the same BVal
twice wouldn't break anything in production; but it still shouldn't happen in theory… :thinking:
"restricting the test to at least two nodes for now." - Yes! It fixed all the tests. Thanks! I've run a test about ten times for all known seeds. "I'd be happy with making that a TODO" - I'll make an issue for that. ok?
Great, thanks! @vkomenda: Feel free to merge once you're happy with it.
@afck, are you OK with versioning tests/net_dynamic_hb.proptest-regressions
? It's reasonable but wasn't done before.
Ah, right, that file should be removed, since with a single node it always fails anyway. Good catch!
related issue #374
do_drop_and_re_add
chooses at random at least 1 node for removing from the network and then re-adding removed nodes back. Cluster always remain correct. In other words before and after removing validators network correctness condition (N=3f+1) is always satisfied.Removed nodes can be as faulty nodes as well as correct ones. You can see in the log of the test how much is actually nodes will be removed.