hyperledger-archives / fabric

THIS IS A READ-ONLY historic repository. Current development is at https://gerrit.hyperledger.org/r/#/admin/projects/fabric . pull requests not accepted
https://gerrit.hyperledger.org/
Apache License 2.0
1.17k stars 1.01k forks source link

One behave test scenario fails with 4 CPUs, fine with 2 CPUs #1922

Open gongsu832 opened 8 years ago

gongsu832 commented 8 years ago

Description

On a zLinux guest (debian 8) with 4 CPUs, behave test fails for either "peer_basic.feature:846 chaincode example02 with 4 peers and 1 membersrvc, test crash fault -- @1.1 Consensus Options" or "peer_basic.feature:968 chaincode example02 with 4 peers, two stopped". All tests pass with 2 CPUs (after taking 2 CPUs offline). Full logs for the two failures attached.

behave.zip

Describe How to Reproduce

make behave with 4 CPUs.

gongsu832 commented 8 years ago

Forgot to mention the code version:

commit 35326c25f99b038286a58330fdef87d23fe5f473
Merge: 0d05bf3 a626e43
Author: Binh Q Nguyen <binhn@us.ibm.com>
Date:   Sat Jun 18 08:57:17 2016 -0400

    Merge pull request #1877 from jyellick/keep-state-if-can-execute

    Stabilize PBFT under stress with periodic viewchange
rameshthoomu commented 8 years ago

Same issue observed in #1886

tuand27613 commented 8 years ago

I can't reproduce on my vagrant box . @gongsu832 could you run behave -D logs=y and attach the container logs ?

gongsu832 commented 8 years ago

@tuand27613 The latest commit 72a7cbf9d3f49ee79d71c494f8aef916b7376251 now adds an additional dependency behave-grpc to target behave-deps, which runs the command sudo pip install -q 'grpcio==0.13.1'. It fails on zLinux (both Debian 8 and RHEL 7.2) with a message similar to the following:

# pip install -q 'grpcio==0.13.1'
Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-UkqENb/grpcio/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-86V1pw-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-UkqENb/grpcio/

Can I temporarily skip this dependency (by touching build/behave/.grpc-dummy)? Thanks.

gongsu832 commented 8 years ago

@tuand27613 Removing "-q" in the pip install command yields more output. When compiling grpcio-0.13.1.tar.gz, it failed with:

    In file included from ./third_party/boringssl/include/openssl/asn1.h:68:0,
                     from ./third_party/boringssl/include/openssl/rsa.h:62,
                     from ./src/core/security/json_token.h:38,
                     from ./src/core/security/credentials.h:43,
                     from src/core/security/client_auth_filter.c:43:
    ./third_party/boringssl/include/openssl/bn.h:161:2: error: #error "Must define either OPENSSL_32_BIT or OPENSSL_64_BIT"
     #error "Must define either OPENSSL_32_BIT or OPENSSL_64_BIT"
      ^
    ./third_party/boringssl/include/openssl/bn.h:222:44: error: unknown type name 'BN_ULONG'

Looks like s390x (unsurprisingly) isn't recognized as OPENSSL_64_BIT.

gongsu832 commented 8 years ago

@tuand27613 I reverted back to 35326c25f99b038286a58330fdef87d23fe5f473 so I can run the behave tests. Here are the behave run log along with container logs.

behave.zip

jeffgarratt commented 8 years ago

@gongsu832 is there any way to get this to install properly on s390x? FYI @rameshthoomu

gongsu832 commented 8 years ago

@jeffgarratt I looked at this briefly. I downloaded boringssl. Fixing the OPENSSL_64_BIT and getting it to compile is easy. The problem is that when I run the tests that come with boringssl, several fail (most likely due to endian issue). So fixing boringssl will probably take some nontrivial amount of time. After that, getting grpcio to pick up the fixed boringssl is another hurdle.

rameshthoomu commented 8 years ago

@vpaprots also observed the same issue while installing grpcio package.

jkirke commented 8 years ago

I am encountering issues installing grpcio on both z systems. I will continue to look at it but wanted to update the issues with current results.

pip install grpcio

On 148.100.105.200 File "build/bdist.linux-s390x/egg/setuptools/command/build_ext.py", line 187, in build_extension _build_ext.build_extension(self, ext) File "/usr/lib64/python2.7/distutils/command/build_ext.py", line 498, in build_extension depends=ext.depends) File "/usr/lib64/python2.7/distutils/ccompiler.py", line 574, in compile self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts) File "/usr/lib64/python2.7/distutils/unixccompiler.py", line 132, in _compile raise CompileError, msg CompileError: command 'gcc' failed with exit status 1

On 148.100.107.97 Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-ivQZoN/grpcio/

rameshthoomu commented 8 years ago

@jkirke : Below is the command you have to use to install grpcio package:

pip install -U 'grpcio==0.13.1'

jkirke commented 8 years ago

Same result using that command.

rameshthoomu commented 8 years ago

Created new issue #1956: please log your comments in the new issue..

gongsu832 commented 8 years ago

As I mentioned above in response to @jeffgarratt , boringssl (which grpcio depends on) is just not written with big endian in mind. I tracked down one of the ec_test failures to the file crypto/ec/p256-64.c. This is an excerpt from the beginning of the file:

/* bin32_to_felem takes a little-endian byte array and converts it into felem
 * form. This assumes that the CPU is little-endian. */
static void bin32_to_felem(felem out, const u8 in[32]) {
  out[0] = *((const u64 *)&in[0]);
  out[1] = *((const u64 *)&in[8]);
  out[2] = *((const u64 *)&in[16]);
  out[3] = *((const u64 *)&in[24]);
}

/* smallfelem_to_bin32 takes a smallfelem and serialises into a little endian,
 * 32 byte array. This assumes that the CPU is little-endian. */
static void smallfelem_to_bin32(u8 out[32], const smallfelem in) {
  *((u64 *)&out[0]) = in[0];
  *((u64 *)&out[8]) = in[1];
  *((u64 *)&out[16]) = in[2];
  *((u64 *)&out[24]) = in[3];
}

As you can see, the code is specifically assuming the CPU is little endian, which is rather strange. You'd expect something better from google. I fixed this particular case so now the ec_test proceeds further but is still failing in other places.

Until boringssl is properly fixed for big endian (and assuming there is no other package that grpcio depends on has similar endian problems), even if you can manage to get grpcio installed (i.e., compiled) on big endian, it's not going to work properly.

gongsu832 commented 8 years ago

OK I fixed boringssl so it passes all tests on zLinux and I managed to install grpcio 0.13.1. The behave tests now fail on a different scenario:

peer_basic.feature:1097 verify reconnect of disconnected peer, issue #1851 -- @1.1 Composition options

All tests pass on 2 CPUs. Logs attached.

behave.zip

PS. @tuand27613 This is in the behave run log:

      ['Starting', 'vp0', '...']
      ['ESC[1AESC[2K']
      ['Starting', 'vp0', '...', 'done']
      ['ESC[1B']
      Containers started:
      ['bddtests_vp0_1']

It appears that one of the container failed to start. But I thought you fixed this problem a while ago.

jkirke commented 8 years ago

Good news on getting passed the grpcio 0.13.1 issue. What do I need to do to get this updated on the zLinux build systems?

tuand27613 commented 8 years ago

@gongsu832 , looks like both containers are up, or at least behave thinks [vp1, vp0] are up Can you try turning DoNotDecompose , run only the @issue_1951 test and see what docker says ?

['Starting', 'vp0', '...']
      ['']
      ['Starting', 'vp0', '...', 'done']
      ['']
      Containers started: 
      ['bddtests_vp0_1']
      dockerComposeService = vp0
      container bddtests_vp0_1 has env = ['CORE_PEER_ID=vp0', 'CORE_LOGGING_LEVEL=DEBUG', 'CORE_PEER_DISCOVERY_TOUCHPERIOD=1s', 'CORE_VM_ENDPOINT=http://172.17.0.1:2375', 'CORE_PEER_ADDRESSAUTODETECT=true', 'CORE_PEER_DISCOVERY_PERIOD=1s', 'PATH=/opt/go/bin:/opt/gopath/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'LD_LIBRARY_PATH=/opt/rocksdb:', 'GOROOT=/opt/go', 'GOPATH=/opt/gopath']

      After starting, the container service list is = ['vp1', 'vp0']
      Requesting path = http://172.17.0.3:5000/network/peers
jeffgarratt commented 8 years ago

@gongsu832 @tuand27613 Hey @gongsu832, please let me know if I can assist. Perhaps we can do a hangout and I can help you troubleshoot.

jeffgarratt commented 8 years ago

@ghaskins @rameshthoomu @jkirke @gongsu832 I think we are good to go on accepting this PR as it appears @gongsu832 and @jkirke will be able to resolve z related issues. @rameshthoomu agrees with this assessment, please let me know if any other concerns @ghaskins . Thanks.

gongsu832 commented 8 years ago

@tuand27613 uncomment DoNotDecompose for @issue_1851:

      Requesting path = http://172.17.0.2:5000/network/peers

      After stoping, the container service list is = ['vp1']
      Requesting path = http://172.17.0.3:5000/network/peers

        ']
      ['Starting', 'vp0', '...', 'done']
      ['']
      Containers started:
      ['bddtests_vp0_1']
      dockerComposeService = vp0
      container bddtests_vp0_1 has env = ['CORE_PEER_ID=vp0', 'CORE_LOGGING_LEVEL=DEBUG', 'CORE_PEER_DISCOVERY_TOUCHPERIOD=1s', 'CORE_VM_ENDPOINT=http://172.17.0.1:2375', 'CORE_PEER_ADDRESSAUTODETECT=true', 'CORE_PEER_DISCOVERY_PERIOD=1s', 'PATH=/opt/go/bin:/opt/gopath/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'LD_LIBRARY_PATH=/opt/rocksdb:', 'GOROOT=/opt/go', 'GOPATH=/opt/gopath']

      After starting, the container service list is = ['vp1', 'vp0']
      Requesting path = http://172.17.0.3:5000/network/peers

And indeed both containers are running:

root@debian2:/opt/openchain/src/github.com/hyperledger/fabric/bddtests# docker ps -a
CONTAINER ID        IMAGE                     COMMAND             CREATED              STATUS              PORTS               NAMES
fa119e85b9d6        hyperledger/fabric-peer   "peer node start"   About a minute ago   Up About a minute                       bddtests_vp1_1
a55f3a38a3f6        hyperledger/fabric-peer   "peer node start"   2 minutes ago        Up About a minute                       bddtests_vp0_1

Yet the scenario still fails with the same message:

2016/06/25 04:30:10 grpc: ClientConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 172.17.0.2:30303: getsockopt: connection refused"; Reconnecting to "vp0:30303"`

in vp1 log. If you want to take a look yourself, you can ssh root@debian2.watson.ibm.com without password. The machine is the one you used before and has your public key. Hyperledger code is under /opt/openchain/src/github.com/hyperledger/fabric.

gongsu832 commented 8 years ago

@jkirke I'm maintaining a fork of grpc so that it can picked the fixed boringssl. To install on zLinux, pick a directory where you want to clone and do the following:

   # git clone https://github.com/gongsu832/grpc.git
   # cd grpc
   # git submodule update --init
   # pip install -rrequirements.txt
   # git checkout tags/release-0_13_1
   # GRPC_PYTHON_BUILD_WITH_CYTHON=1 pip install .
jkirke commented 8 years ago

Thank you. I tried this out this morning on both build systems. They both failed to install with the following: python_build/temp.linux-s390x-2.7/third_party/boringssl/crypto/bytestring/asn1_compat.o -fvisibility=hidden -pthread -std=gnu99 gcc: error: third_party/boringssl/crypto/bytestring/asn1_compat.c: No such file or directory gcc: fatal error: no input files compilation terminated. creating tmp creating tmp/tmp47U7QF gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -march=z196 -mtune=zEC12 -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -march=z196 -mtune=zEC12 -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -c /tmp/tmp47U7QF/a.c -o tmp/tmp47U7QF/a.o Traceback (most recent call last): File "", line 1, in File "/tmp/pip-2eIdKr-build/setup.py", line 258, in test_runner=TEST_RUNNER, File "/usr/lib64/python2.7/distutils/core.py", line 152, in setup dist.run_commands() File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands self.run_command(cmd) File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command cmd_obj.run() File "build/bdist.linux-s390x/egg/setuptools/command/install.py", line 61, in run File "/usr/lib64/python2.7/distutils/command/install.py", line 563, in run self.run_command('build') File "/usr/lib64/python2.7/distutils/cmd.py", line 326, in run_command self.distribution.run_command(command) File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command cmd_obj.run() File "/usr/lib64/python2.7/distutils/command/build.py", line 127, in run self.run_command(cmd_name) File "/usr/lib64/python2.7/distutils/cmd.py", line 326, in run_command self.distribution.run_command(command) File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command cmd_obj.run() File "build/bdist.linux-s390x/egg/setuptools/command/build_ext.py", line 54, in run File "/usr/lib64/python2.7/distutils/command/build_ext.py", line 339, in run self.build_extensions() File "/tmp/pip-2eIdKr-build/src/python/grpcio/commands.py", line 259, in build_extensions "Failed build_ext step:\n{}".format(formatted_exception)) commands.CommandError: Failed build_ext step: Traceback (most recent call last): File "/tmp/pip-2eIdKr-build/src/python/grpcio/commands.py", line 254, in build_extensions build_ext.build_ext.build_extensions(self) File "/usr/lib64/python2.7/distutils/command/build_ext.py", line 448, in build_extensions self.build_extension(ext) File "build/bdist.linux-s390x/egg/setuptools/command/build_ext.py", line 187, in build_extension _build_ext.build_extension(self, ext) File "/usr/lib64/python2.7/distutils/command/build_ext.py", line 498, in build_extension depends=ext.depends) File "/usr/lib64/python2.7/distutils/ccompiler.py", line 574, in compile self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts) File "/usr/lib64/python2.7/distutils/unixccompiler.py", line 132, in _compile raise CompileError, msg CompileError: command 'gcc' failed with exit status 4

----------------------------------------

Command "/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-2eIdKr-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-a2YFnO-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-2eIdKr-build/ [root@dmlinux1rhel72 grpc]#

gongsu832 commented 8 years ago

@jkirke Did you do git checkout tags/release-0_13_1?

jkirke commented 8 years ago

I thought I did but I must have missed that step. Sorry for the false alarm. Both build systems now have grpc. Thank you for your help.

tuand27613 commented 8 years ago

@jeffgarratt @ramesh see gongsu's comment above.

Is this a docker config issue ?

gongsu832 commented 8 years ago

@tuand27613 I recreated all the fabric-* images and now the test passes! Grrrh, I hate docker! :-)

jkirke commented 8 years ago

Well, that is good news, sort of.