Closed rameshthoomu closed 8 years ago
This is a timeout. I hit these all the time on my machine. I increased the timeouts in PR #634
Looks like PR #634 was merged. @rameshthoomu could you retest please?
I pulled the #634 build today and ran the behave tests. I Could see "Scenario Outline: chaincode example02 with 4 peers and 1 obcca, issue #567 -- @1.3 Consensus Options" scenario is with 90 secs timeout.
@rameshthoomu , can you
chaincode example02 with 4 peers and 1 obcca, issue #567 -- @1.3 Consensus Options
test and gather the vp0,1,2,3 logs ?I ran the specific failed scenario @1.3 Consensus Options and no issues observed. As requested by @tuand27613, I will perform the tests for each scenario and will share the results.
@kchristidis @jyellick @corecode On my machine the following behave test "Scenario Outline: chaincode example02 with 4 peers and 1 obcca, issue #567"
consistently fails when using docker-compose-4-consensus-classic.yml
or docker-compose-4-consensus-batch.yml
. It deploys the chaincode, but then fails at "Then I wait up to "<WaitTime>" seconds for transaction to be committed to peers:".
Any ideas?
This isn't every time. Sometimes they run quickly, but it happens a lot of the time.
@jyellick @kchristidis @srderson Wanted to make sure we get your guys input for setting these timeouts, as we should probably define exceeding them as a defect. Thoughts?
@srderson @jeffgarratt Same here. 1.2
and 1.3
almost always fail in the Then I wait up to "30" seconds for transaction to be committed to peers
part. Will ping @jeffgarratt on Slack to see what we can do about this.
Both of these scenarios also fail with the timeout set to 60 seconds.
for me, sometimes it's fast and other times it hangs indefinitely, regardless of what I set the timeout to. Is there a potential bug in consensus? If there is a bug, I wonder why we don't see it outside of the behave tests.
I think the bug in consensus scenario is unlikely (for this specific thing...) because for other folks the behave
tests run flawlesssly all the time.
a reminder for @corecode to post his firewall settings.
He mentioned yesterday that he got behave working after fiddling with his Linux firewall settings. He was getting grpc transport failed/404/timeouts and other problems
also, @rameshthoomu noted above that running the single test works as opposed to running the whole suite.
Ramesh, did you try running the other tests individually like we talked about ?
Yes @tuand27613 , I ran the tests individually and whole suite multiple times but didn't see test failures today.
@tuand27613: Do you confirm that you don't see any test failure in the behave
tests?
the behave tests work for me. I tried on both my thinkpad and my mac.
Note that @rameshthoomu and @corecode can run the behave
tests now.
@kchristidis , when you get a chance, can you run the failing tests individually ? behave -n scenarioName
behave -n "chaincode example02 with 4 peers and 1 obcca, issue #567"
fails for me with a timeout on docker-compose-4-consensus-batch.yml
. I edited the feature file to remove the other compose files and it still fails.
first feedback: let's please not log full transaction contents. on vp0 there is a complete base64 encoded chaincode (17MB worth). :)
from the consensus side it seems like there is a network partition, and replicas can only communicate with vp0 (log4). The reason for this is that pairwise links only get established step by step, when new peers are discovered. By that time, the replicas have already received the pre-prepare from the leader, and they already broadcasted their messages (which only arrived at the leader, vp0).
The leader discovers that something is wrong, and initiates a view-change. However, there seems to be a bug, and the other replicas, despite having received the pre-prepare, did not start a timeout. That's why they do not join the view change, and everything is stuck, because everyone (except vp0) behaves as if there was no request.
@kchristidis didn't we wait for the arrival of all replicas before starting consensus?
@corecode: no, that would have been necessary if the names of the replicas were not pre-determined in advance. In our case they are; vp2
is going to be replica 2
so we had concluded that waiting for N
connections wasn't necessary.
Seems like we'll have to do this nonetheless though. Working on this now.
@corecode, @srderson (and anyone else who might be interested): Commit https://github.com/kchristidis/obc-peer/commit/64d064ffa64d7987644964817699f91dd02e0f24 should take care of any issues you see in obc-batch
. It forces all replicas to wait for a fully connected network before running consensus, so we don't run into the issue that @corecode brought up earlier.
If you agree with the code, we can apply it to classic
and sieve
if necessary.
Do note that this code (a) will allow us to get rid of the flaky vpX
naming scheme, but it also (b) exposes us to a Byzantine attack where if a node refuses to connect to the network, the network will be stuck. (If the network doesn't become fully connected momentarily, it won't kick off the consensus process.)
Let me know if behave
for obc-batch
runs smoothly with this patch.
@corecode: Can you look into the bug that prevents the view-change timeout in the replicas that received the pre-prepare?
I think your fix is just obscuring a bug we have. What about adding
if !instance.timerActive {
instance.startTimer(instance.requestTimeout)
}
after https://github.com/openblockchain/obc-peer/blob/master/openchain/consensus/obcpbft/pbft-core.go#L651
That should then trigger a view change after some time.
Test - run all behave tests with default timeouts
Scenario Outline: chaincode example02 with 4 peers and 1 obcca, issue #567
fails due to timeout with docker-compose-4-consensus-batch.yml
fantastic!
@corecode @srderson: Not so fast, hold on, writing a lengthy response now.
@corecode @srderson:
https://github.com/kchristidis/obc-peer/commit/acd7c5b669a7db6207c0e70d36e9658d0424a65d implements @corecode's suggested change.
While batch
(and classic
) run without issues, sieve
fails, even with the behave
timeout bumped to 30 seconds. (In the commit above I pushed a change so that behave --tags=wip
runs the sieve
test only and allows you to inspect the Docker containers afterwards.)
Upon inspecting the logs of the Docker containers, vp2
cannot verify vp3
's message in the verify-set
:
15:48:13.572 [consensus/obcpbft] validateVerifySet -> WARN 1c0 verify-set invalid: Could not verify message from vp3 (unknown peer)
This makes sense because the link to vp3
is established later:
15:48:14.853 [peer] RegisterHandler -> DEBU 21d registered handler with key: name:"vp3"
vp2
eventually sends a viewChange
message:
15:48:23.390 [consensus/obcpbft] sendViewChange -> INFO 261 Replica 2 sending view-change, v:1, h:0, |C|:1, |P|:0, |Q|:0
...which brings no change since the rest of the network runs fine (no quorum on that request).
So this behave
test will fail but I don't see consensus
doing anything wrong. vp2
is Byzantine so we can't expect to see that transaction to be committed to it.
We either need to rewrite the test so that we get the write response from 2f+1
peers, or apply my previous patch (with all the baggage that this brings).
@corecode let me know your thoughts on this.
@kchristidis I may have applied @corecode's suggestion incorrectly. I put it inside the if statement directly after line 651. It looks like you put it outside the if statement in https://github.com/kchristidis/obc-peer/commit/acd7c5b669a7db6207c0e70d36e9658d0424a65d. @corecode which is correct?
Unless I'm missing something, the precise location shouldn't matter much as long as it happens within recvPrePrepare
. The issue I describe above however should occur regardless of where that statement is located.
Waiting a confirmation from @corecode on this. My proposal is to put https://github.com/kchristidis/obc-peer/commit/acd7c5b669a7db6207c0e70d36e9658d0424a65d in, and rewrite the behave
tests so that if 2f+1
(i.e. 3) VPs have the transaction posted, we're good. Let me know if you think I'm missing anything.
Issue: One of the test scenario failed in behave tests.
Summary: Latest code pulled from the https://github.com/openblockchain/obc-peer/pull/636 pull request and performed the behave test.
Failing scenarios: peer_basic.feature:133 chaincode example02 with 4 peers and 1 obcca, issue #567 -- @1.3 Consensus Options 0 features passed, 1 failed, 0 skipped 7 scenarios passed, 1 failed, 0 skipped 96 steps passed, 1 failed, 7 skipped, 0 undefined Took 3m15.279s
Behave_#518Failed.txt go test.txt