hyperledger-archives / fabric

THIS IS A READ-ONLY historic repository. Current development is at https://gerrit.hyperledger.org/r/#/admin/projects/fabric . pull requests not accepted
https://gerrit.hyperledger.org/
Apache License 2.0
1.17k stars 1.01k forks source link

Test performed on the pull request https://github.com/openblockchain/obc-peer/pull/636 failed #646

Closed rameshthoomu closed 8 years ago

rameshthoomu commented 8 years ago

Issue: One of the test scenario failed in behave tests.

Summary: Latest code pulled from the https://github.com/openblockchain/obc-peer/pull/636 pull request and performed the behave test.

Failing scenarios: peer_basic.feature:133 chaincode example02 with 4 peers and 1 obcca, issue #567 -- @1.3 Consensus Options 0 features passed, 1 failed, 0 skipped 7 scenarios passed, 1 failed, 0 skipped 96 steps passed, 1 failed, 7 skipped, 0 undefined Took 3m15.279s

Behave_#518Failed.txt go test.txt

srderson commented 8 years ago

This is a timeout. I hit these all the time on my machine. I increased the timeouts in PR #634

srderson commented 8 years ago

Looks like PR #634 was merged. @rameshthoomu could you retest please?

rameshthoomu commented 8 years ago

I pulled the #634 build today and ran the behave tests. I Could see "Scenario Outline: chaincode example02 with 4 peers and 1 obcca, issue #567 -- @1.3 Consensus Options" scenario is with 90 secs timeout.

BDDTimeout#634_Failed.txt

tuand27613 commented 8 years ago

@rameshthoomu , can you

rameshthoomu commented 8 years ago

I ran the specific failed scenario @1.3 Consensus Options and no issues observed. As requested by @tuand27613, I will perform the tests for each scenario and will share the results.

srderson commented 8 years ago

@kchristidis @jyellick @corecode On my machine the following behave test "Scenario Outline: chaincode example02 with 4 peers and 1 obcca, issue #567" consistently fails when using docker-compose-4-consensus-classic.yml or docker-compose-4-consensus-batch.yml. It deploys the chaincode, but then fails at "Then I wait up to "<WaitTime>" seconds for transaction to be committed to peers:". Any ideas?

This isn't every time. Sometimes they run quickly, but it happens a lot of the time.

jeffgarratt commented 8 years ago

@jyellick @kchristidis @srderson Wanted to make sure we get your guys input for setting these timeouts, as we should probably define exceeding them as a defect. Thoughts?

kchristidis commented 8 years ago

@srderson @jeffgarratt Same here. 1.2 and 1.3 almost always fail in the Then I wait up to "30" seconds for transaction to be committed to peers part. Will ping @jeffgarratt on Slack to see what we can do about this.

kchristidis commented 8 years ago

Both of these scenarios also fail with the timeout set to 60 seconds.

srderson commented 8 years ago

for me, sometimes it's fast and other times it hangs indefinitely, regardless of what I set the timeout to. Is there a potential bug in consensus? If there is a bug, I wonder why we don't see it outside of the behave tests.

kchristidis commented 8 years ago

I think the bug in consensus scenario is unlikely (for this specific thing...) because for other folks the behave tests run flawlesssly all the time.

tuand27613 commented 8 years ago

a reminder for @corecode to post his firewall settings.

He mentioned yesterday that he got behave working after fiddling with his Linux firewall settings. He was getting grpc transport failed/404/timeouts and other problems

tuand27613 commented 8 years ago

also, @rameshthoomu noted above that running the single test works as opposed to running the whole suite.

Ramesh, did you try running the other tests individually like we talked about ?

rameshthoomu commented 8 years ago

Yes @tuand27613 , I ran the tests individually and whole suite multiple times but didn't see test failures today.

kchristidis commented 8 years ago

@tuand27613: Do you confirm that you don't see any test failure in the behave tests?

tuand27613 commented 8 years ago

the behave tests work for me. I tried on both my thinkpad and my mac.

Note that @rameshthoomu and @corecode can run the behave tests now.

@kchristidis , when you get a chance, can you run the failing tests individually ? behave -n scenarioName

srderson commented 8 years ago

behave -n "chaincode example02 with 4 peers and 1 obcca, issue #567" fails for me with a timeout on docker-compose-4-consensus-batch.yml. I edited the feature file to remove the other compose files and it still fails.

srderson commented 8 years ago

Attaching logs from behave -n "chaincode example02 with 4 peers and 1 obcca, issue #567" with only docker-compose-4-consensus-batch.yml enabled. Each log file is a docker container VP.

log1.txt log2.txt log3.txt log4.txt

corecode commented 8 years ago

first feedback: let's please not log full transaction contents. on vp0 there is a complete base64 encoded chaincode (17MB worth). :)

corecode commented 8 years ago

from the consensus side it seems like there is a network partition, and replicas can only communicate with vp0 (log4). The reason for this is that pairwise links only get established step by step, when new peers are discovered. By that time, the replicas have already received the pre-prepare from the leader, and they already broadcasted their messages (which only arrived at the leader, vp0).

The leader discovers that something is wrong, and initiates a view-change. However, there seems to be a bug, and the other replicas, despite having received the pre-prepare, did not start a timeout. That's why they do not join the view change, and everything is stuck, because everyone (except vp0) behaves as if there was no request.

@kchristidis didn't we wait for the arrival of all replicas before starting consensus?

kchristidis commented 8 years ago

@corecode: no, that would have been necessary if the names of the replicas were not pre-determined in advance. In our case they are; vp2 is going to be replica 2 so we had concluded that waiting for N connections wasn't necessary.

Seems like we'll have to do this nonetheless though. Working on this now.

kchristidis commented 8 years ago

@corecode, @srderson (and anyone else who might be interested): Commit https://github.com/kchristidis/obc-peer/commit/64d064ffa64d7987644964817699f91dd02e0f24 should take care of any issues you see in obc-batch. It forces all replicas to wait for a fully connected network before running consensus, so we don't run into the issue that @corecode brought up earlier.

If you agree with the code, we can apply it to classic and sieve if necessary.

Do note that this code (a) will allow us to get rid of the flaky vpX naming scheme, but it also (b) exposes us to a Byzantine attack where if a node refuses to connect to the network, the network will be stuck. (If the network doesn't become fully connected momentarily, it won't kick off the consensus process.)

Let me know if behave for obc-batch runs smoothly with this patch.

@corecode: Can you look into the bug that prevents the view-change timeout in the replicas that received the pre-prepare?

corecode commented 8 years ago

I think your fix is just obscuring a bug we have. What about adding

    if !instance.timerActive {
        instance.startTimer(instance.requestTimeout)
    }

after https://github.com/openblockchain/obc-peer/blob/master/openchain/consensus/obcpbft/pbft-core.go#L651

That should then trigger a view change after some time.

srderson commented 8 years ago

Test - run all behave tests with default timeouts

  1. Baseline (no patches) - Scenario Outline: chaincode example02 with 4 peers and 1 obcca, issue #567 fails due to timeout with docker-compose-4-consensus-batch.yml
  2. Applying only https://github.com/kchristidis/obc-peer/commit/64d064ffa64d7987644964817699f91dd02e0f24 from @kchristidis - All behave tests pass
  3. Applying only suggestion from @corecode in comment https://github.com/openblockchain/obc-peer/issues/646#issuecomment-183640197 - All behave tests pass.
corecode commented 8 years ago

fantastic!

kchristidis commented 8 years ago

@corecode @srderson: Not so fast, hold on, writing a lengthy response now.

kchristidis commented 8 years ago

@corecode @srderson:

https://github.com/kchristidis/obc-peer/commit/acd7c5b669a7db6207c0e70d36e9658d0424a65d implements @corecode's suggested change.

While batch (and classic) run without issues, sieve fails, even with the behave timeout bumped to 30 seconds. (In the commit above I pushed a change so that behave --tags=wip runs the sieve test only and allows you to inspect the Docker containers afterwards.)

Upon inspecting the logs of the Docker containers, vp2 cannot verify vp3's message in the verify-set: 15:48:13.572 [consensus/obcpbft] validateVerifySet -> WARN 1c0 verify-set invalid: Could not verify message from vp3 (unknown peer)

This makes sense because the link to vp3 is established later: 15:48:14.853 [peer] RegisterHandler -> DEBU 21d registered handler with key: name:"vp3"

vp2 eventually sends a viewChange message: 15:48:23.390 [consensus/obcpbft] sendViewChange -> INFO 261 Replica 2 sending view-change, v:1, h:0, |C|:1, |P|:0, |Q|:0

...which brings no change since the rest of the network runs fine (no quorum on that request).

So this behave test will fail but I don't see consensus doing anything wrong. vp2 is Byzantine so we can't expect to see that transaction to be committed to it.

We either need to rewrite the test so that we get the write response from 2f+1 peers, or apply my previous patch (with all the baggage that this brings).

@corecode let me know your thoughts on this.

srderson commented 8 years ago

@kchristidis I may have applied @corecode's suggestion incorrectly. I put it inside the if statement directly after line 651. It looks like you put it outside the if statement in https://github.com/kchristidis/obc-peer/commit/acd7c5b669a7db6207c0e70d36e9658d0424a65d. @corecode which is correct?

kchristidis commented 8 years ago

Unless I'm missing something, the precise location shouldn't matter much as long as it happens within recvPrePrepare. The issue I describe above however should occur regardless of where that statement is located.

kchristidis commented 8 years ago

Waiting a confirmation from @corecode on this. My proposal is to put https://github.com/kchristidis/obc-peer/commit/acd7c5b669a7db6207c0e70d36e9658d0424a65d in, and rewrite the behave tests so that if 2f+1 (i.e. 3) VPs have the transaction posted, we're good. Let me know if you think I'm missing anything.