hyperledger-archives / fabric

THIS IS A READ-ONLY historic repository. Current development is at https://gerrit.hyperledger.org/r/#/admin/projects/fabric . pull requests not accepted
https://gerrit.hyperledger.org/
Apache License 2.0
1.17k stars 1.01k forks source link

system / peer behavior appears to be dependent on logging levels -- below debug, things start to fail #614

Closed kletkeman closed 8 years ago

kletkeman commented 8 years ago

8 days ago, a logging change was made to split out several components for separate logging by level. The levels were also changed from full upper case to lower case. Note that comments regarding a global environment variable override were left in the yaml file, which has caused some wasted time already. The override of course does not actually work any more.

The new yaml categories have defaults that are much different from the previous default of DEBUG. They are:

peer:           info
status:        warning
stop:           warning
login:          warning
vm:             warning
chaincode: warning

If you build a new peer with these values set, you will get strange behavior on deploy, invoke and query. The logs tend to contain two types of error.

1) A log that has a permission error while trying to spin up "/go/bin", which should normally have the chaincode folder at the end but does not. Thus is cannot be found and fails. This appears to happen somewhere in the spin up code for the docker container where it must launch the container if necessary to service the request. Normally, it is always good. I have also observed it in the working code, although I cannot tell what stimulated it since my queries were working. This is what it looks like:

20:55:21.490 [consensus/noops] doTransactions -> DEBU 06a Executing batch of 1 transactions 20:55:21.490 [chaincode] getArgsAndEnv -> DEBU 06b Executable is /go/bin/ 20:55:21.490 [chaincode] launchAndWaitForRegister -> DEBU 06c start container: dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248 20:55:21.490 [container] lockContainer -> DEBU 06d waiting for container(dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248) lock 20:55:21.490 [container] lockContainer -> DEBU 06e got container (dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248) lock

Note that, when trying to use anything less than debug everywhere, these show up a lot more often and little seems to work.

2) A sequence where errors creating an aufs mount happen frequently. When I build with the default logging levels, I seem to see these on pretty much every transaction:

20:54:38.305 [consensus/noops] broadcastConsensusMsg -> DEBU 032 Broadcasting CONSENSUS 20:54:38.300 [rest] Deploy -> INFO 031 Successfuly deployed chainCode: 2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248. 20:54:38.321 [consensus/noops] doTransactions -> DEBU 033 Executing transactions 20:54:38.321 [consensus/noops] doTransactions -> DEBU 034 Starting TX batch with timestamp: seconds:1454619278 nanos:221146309 20:54:38.321 [consensus/noops] doTransactions -> DEBU 035 Executing batch of 1 transactions 20:54:38.326 [chaincode] getArgsAndEnv -> DEBU 036 Executable is /go/bin/chaincode_freight_contract_demo_1 20:54:38.327 [chaincode] DeployChaincode -> DEBU 037 deploying chaincode dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248 20:54:38.328 [container] lockContainer -> DEBU 038 waiting for container(dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248) lock 20:54:38.328 [container] lockContainer -> DEBU 039 got container (dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248) lock 20:54:38.328 [container] newDockerClient -> INFO 03a Creating VM with endpoint: http://172.17.0.1:4243 20:54:51.027 [container] build -> DEBU 03b Created image: dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248 20:54:51.027 [container] createContainer -> DEBU 03c Create container: dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248 20:54:51.039 [container] unlockContainer -> DEBU 03d container lock deleted(dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248) 20:54:51.039 [state] TxBegin -> DEBU 03e txBegin() for txUuid [2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248] 20:54:51.043 [chaincode] getArgsAndEnv -> DEBU 03f Executable is /go/bin/chaincode_freight_contract_demo_1 20:54:51.043 [chaincode] launchAndWaitForRegister -> DEBU 040 start container: dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248 20:54:51.043 [container] lockContainer -> DEBU 041 waiting for container(dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248) lock 20:54:51.043 [container] lockContainer -> DEBU 042 got container (dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248) lock 20:54:51.043 [container] newDockerClient -> INFO 043 Creating VM with endpoint: http://172.17.0.1:4243 20:54:51.052 [container] start -> DEBU 044 start-could not start container API error (500): Cannot start container dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248: Error getting container 1977bf95272dd308175d6cdebeb1267e003fc75bd0688a5f8679ac4b896cf83d from driver aufs: error creating aufs mount to /var/lib/docker/aufs/mnt/1977bf95272dd308175d6cdebeb1267e003fc75bd0688a5f8679ac4b896cf83d: invalid argument 20:54:51.052 [container] unlockContainer -> DEBU 045 container lock deleted(dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248) 20:54:51.052 [chaincode] LaunchChaincode -> DEBU 046 launchAndWaitForRegister failed Error starting container: API error (500): Cannot start container dev-vp1-2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248: Error getting container 1977bf95272dd308175d6cdebeb1267e003fc75bd0688a5f8679ac4b896cf83d from driver aufs: error creating aufs mount to /var/lib/docker/aufs/mnt/1977bf95272dd308175d6cdebeb1267e003fc75bd0688a5f8679ac4b896cf83d: invalid argument 20:54:51.052 [state] TxFinish -> DEBU 047 txFinish() for txUuid [2df04ea6a5fbdc1ef25c9a38b8827431858220a3fd58d88dfc52aeac60071f93424dced3a4a1c2c8d71648d3a8c69e535c0ab34f043b1e85c414c4414d47a248], txSuccessful=[false]


Other observations:

Various combinations of behaviors result from tweaking one of more of these, but it is hard to figure out since by definition there is almost no logging going on. I was able to follow the flow in the logs when I turned on peer and chaincode debug logging, but queries were still persistently failing. When I turned everything back on using the YAML file (and fully rebuilt my environment from vagrant up), all was well again. I have done this twice and my results have been consistently bad when using limited debug logging (peer alone, and with chaincode) and consistently good when setting everything to debug.

So somewhere in the code, a behavioral dependency on logging levels appears to exist, as incredible as that might sound.

bcbrock commented 8 years ago

@kletkeman - When you say that "comments regarding a global environment variable override were left in the yaml file", which comments on which lines? Is it possible that when you merged the changes into openchain.yaml that there was a conflict? I believe all of the comments related to logging are correct. Setting the env. var. OPENCHAIN_LOGGING_LEVEL does work (for me). The level specifications were always case-insensitive.

I don't quite follow your problem descriptions, but other people are also seeing crashes based on logging levels inside of Docker : #612 The person who submitted #612 commented in a meeting that they speculated it was due to race conditions caused by faster execution when no logging was happening (if I understood correctly).

binhn commented 8 years ago

@jyellick could you investigate. Note related #612 as mentioned by bcbrock.

muralisrini commented 8 years ago

Try this steps with the latest and could not recreate.

@kletkeman : am I running the test the same way you are ? We need to first identify what the differences are.

These are the steps

All running in vagrant with no changes to openchain.yaml. Cleaned docker and created new peer image.

Peer: docker run --rm -it -e OPENCHAIN_VM_ENDPOINT=http://172.17.0.1:4243 -e OPENCHAIN_PEER_ID=vp1 -e OPENCHAIN_PEER_ADDRESSAUTODETECT=true openchain-peer obc-peer peer

Deploy : OPENCHAIN_PEER_ADDRESS=172.17.0.2:30303 ./obc-peer chaincode deploy -p github.com/openblockchain/obc-peer/openchain/example/chaincode/chaincode_example02 -c '{"Function":"init", "Args": ["a","100", "b", "200"]}'

Query : OPENCHAIN_PEER_ADDRESS=172.17.0.2:30303 ./obc-peer chaincode query -n bb540edfc1ee2ac0f5e2ec6000677f4cd1c6728046d5e32dede7fea11a42f86a6943b76a8f9154f4792032551ed320871ff7b7076047e4184292e01e3421889c -c '{"Function":"query", "Args": ["a"]}'

I see the example02 container running. Snippet from its logs

2016/02/07 00:44:36 [d7e042c2]Sending GET_STATE Query Response:{"Name":"a","Amount":"100"} 2016/02/07 00:44:36 [d7e042c2]Received message RESPONSE from shim 2016/02/07 00:44:36 [d7e042c2]Handling ChaincodeMessage of type: RESPONSE(state:ready) 2016/02/07 00:44:36 [d7e042c2]before send 2016/02/07 00:44:36 [d7e042c2]after send 2016/02/07 00:44:36 [d7e042c2]Received RESPONSE, communicated (state:ready) 2016/02/07 00:44:36 [d7e042c2]GetState received payload RESPONSE 2016/02/07 00:44:36 [d7e042c2]Query completed. Sending QUERY_COMPLETED

bcbrock commented 8 years ago

Another thought: A couple weeks ago I had what may be a similar problem using the Docker 1.7 package distributed with Fedora 23/ppc64le. One day an unknown change in the OBC code made Docker flat stop working - Docker claimed the chaincode program did not exist in the container when it obviously did. I convinced myself that this was a bug in Docker so did not open an issue here. This is because of the nature of the failure, and because either of the following 2 workarounds solved the problem:

  1. Using BTRFS as the filesystem underlying Docker 1.7 rather than the default
  2. Using Docker 1.10 built from the latest sources, running on the default filesystem You might try either of these approaches to see what happens.
jyellick commented 8 years ago

Per discussions in today's scrum, this seems like it might be related to unsynchronized ledger access, will update as more information becomes available.

bcbrock commented 8 years ago

Based on my observations noted above, the fact that #612 only occurs when using Docker for the OBC peer, and the experience I had just now, my money is on a bug in Docker.

I had been running stress tests all afternoon on an X86 server when I started getting message like the following

18:40:25.601 [container] start -> DEBU 041 start-could not start container API error (500): Cannot start container dev-vp1-299a50b2770a788fd93f7df4c2ce0587748d02fd8184f256c6ac4d1c85fab50dfc2f9d4a0c956309be8acfdea6d24db39c5a5b33fd39a223303ee6a29beee0ed: Error getting container b09e2b7d38a5f6910903a4e541cc230ea63505a5c58beca15881e859f290861c from driver aufs: error creating aufs mount to /var/lib/docker/aufs/mnt/b09e2b7d38a5f6910903a4e541cc230ea63505a5c58beca15881e859f290861c: invalid argument

These errors were permanent and OBC would not run. Deleteing all Docker images and restarting the Docker service did not fix this issue. The only way to fix this problem and allow OBC to function again was to stop Docker, delete /var/lib/docker, and start Docker again.

jyellick commented 8 years ago

@bcbrock If recovery required deleting /var/lib/docker then I'd agree, this seems like it's definitely a docker bug. This would also match with the original post that recovery required rebuilding the vagrant image.

@binhn @srderson @ghaskins Since you three have been most involved in the development of the Vagrant image, what do you think the right course of action here is?

It seems like we could:

Of those options, my personal inclination would be to try switching to the btrfs backend first. The btrfs backend seems positioned to become the default storage backend in the future, and I suspect it is getting more fixes and testing. The downside here is that some older Linux distributions don't support the btrfs filesystem easily, and I assume we'd like the Vagrant image to reflect a standard deployment.

It might not be a bad idea to pick up docker 1.10 as well for any other fixes we might need, but depending on docker 1.10 might also limit the distributions where obc can successfully deploy.

ghaskins commented 8 years ago

@jyellick with the caveat that I don't have a good feel for what constraints we might face from a production/operations perspective, my inclination as a developer is to tend to favor newer, stable releases vs staying with older (often crufty) ones. So I have no specific objections to your proposals.

christo4ferris commented 8 years ago

I wonder which version of aufs we are using? a recent CVE mitigation updated 10.04 to a version of aufs that had a bug that affected docker and other container-based systems such as Cloud Foundry. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043

jyellick commented 8 years ago

My vagrant image shows the following:

vagrant@vagrant-ubuntu-trusty-64:~$ modinfo aufs
filename:       /lib/modules/3.13.0-55-generic/kernel/ubuntu/aufs/aufs.ko
alias:          fs-aufs
version:        3.13-20140303
description:    aufs -- Advanced multi layered unification filesystem
author:         Junjiro R. Okajima <aufs-users@lists.sourceforge.net>
license:        GPL
srcversion:     BE5BD47A94A27F96459B7B9
depends:        
intree:         Y
vermagic:       3.13.0-55-generic SMP mod_unload modversions 
signer:         Magrathea: Glacier signing key
sig_key:        13:DF:A5:11:F1:4D:87:06:C6:AC:E2:7E:0C:6A:7D:14:69:B9:EA:7F
sig_hashalgo:   sha512
parm:           brs:use <sysfs>/fs/aufs/si_*/brN (int)
vagrant@vagrant-ubuntu-trusty-64:~$ uname -a
Linux vagrant-ubuntu-trusty-64 3.13.0-55-generic #92-Ubuntu SMP Sun Jun 14 18:32:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
vagrant@vagrant-ubuntu-trusty-64:~$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"

According to the bug report, executing:

docker run -it --rm akihirosuda/test18180

will hang on affected configurations. In our vagrant image, this execution appears to fail, indicating that we are susceptible to this bug:

[INFO] Checking whether hitting docker#18180.
....................................................................................................
[INFO] OK. not hitting docker#18180.
[INFO] Checking whether sendfile(2) is killable.
[INFO] If the container hangs up here, you are still facing the bug that linux@296291cd tried to fix.

My inclination would still be to migrate off of aufs, because the Linux kernel community has rather directly indicated that aufs will never become a part of the mainline kernel. Still, updating aufs might fix our issues and be slightly less disruptive. Curious to hear thoughts from @binhn or @srderson

srderson commented 8 years ago

I agree with @ghaskins that we should be moving towards newer stable releases. Is btrfs stable at this point? Was reading the docker site and it stated "However, many of its features are still under heavy development and users should consider it a fast-moving target"

jyellick commented 8 years ago

@binhn Is this something that we want to push into master, or should this be targeted for a development branch? Hopefully upgrading docker and moving to the btrfs storage backend will not destabilize anything, but we have obviously been doing all our testing against the current docker version and aufs.

binhn commented 8 years ago

@jyellick I think we should create a separate branch (your branch) and we can rerun these tests before we commit to master. This problem seems to stop @bcbrock dead, so it will happen on Bluemix service. Can we get the branch up today so that we can at least spend 2 days testing on it? Thanks

jyellick commented 8 years ago

For moving to btrfs, we could just create a loopback mounted file with the btrfs filesystem, but that is likely to really hurt our performance, so I'm investigating adding a second disk via Vagrant. This seems a little trickier than I'd hoped, but doable. Once the disk is present, switching the storage backend should be pretty direct. I'll try to put something together in the next 8 hours or so.

muralisrini commented 8 years ago

@kletkeman, @jyellick, et al.... at the risk of adding confusion to this discussion on recent failures, let me add this bug #648 I found.

I think at least some of the problems might be related to this bug when users attempted to "clean up" docker containers and try. Some times it works and sometimes not ... perhaps the container was not deleted after the last deploy. Or the container deleted and deployed again. Or there was a genuine error during chaincode development that further caused more of this cleanup/retry process.

Even if there are other docker issues, please do check if some of the issues belong to this bug. If we don't face any issues after #648 is fixed then there probably are no docker issues at present we need to worry about.

jyellick commented 8 years ago

I've put together a commit which updates the dev env to use btrfs and updates docker to 1.10. https://github.com/jyellick/obc-dev-env/commit/649f6c47149db08b1c039d8dc0b61b66b9b4d0e7 (note, this is not intended to be pull ready, as it includes a commit to make it easier to use the local box image)

First cd to baseimage, and execute make, then make install. The make will rebuild the base image and will take quite a while, maybe around 30 minutes.

Next, back in the obc-dev-env root, execute vagrant destroy and then vagrant up. This will take a few minutes.

Finally, vagrant ssh as usual.

I've tested with a known good level of code from master, and all go tests, and all behave tests pass after this update.

I tried testing with the more recent behave tests, but received some failures, am re-downloading the original image now, will test it when it completes.

Edit:

The same failures seem to occur at the older level of docker with aufs, so there's nothing obvious to be concerned about with the update. @kletkeman can you try to reproduce in this new vagrant environment, or provide instructions for me to do so?

ghaskins commented 8 years ago

@jyellick +1

As an aside, I'll look into whether there is some kind of trick (like "profiles") that eases the transition between pulling the remote baseimage or utilizing the locally generated one (rather than needing to hack the Vagrantfile) Not that its a huge deal, but I am curious.

ghaskins commented 8 years ago

It looks like we can probably do something like

if (ENV['USE_LOCAL_OBC_BASEIMAGE'] != true) config.vm.box_version = "0.1.1" end

I'll put a PR together

jyellick commented 8 years ago

@ghaskins Great idea, would definitely make this sort of testing easier.

One quick thing I forgot to mention last night, this has only been tested on a Linux host. I did my best to make sure the new disk stuff would be portable, but have not been able to test on Mac or Windows.

kletkeman commented 8 years ago

@jyellick I'll take a shot at it today ...

kletkeman commented 8 years ago

@jyellick I apologize, but I just was not able to get to testing your stuff yesterday. I presume it is in the 22 that were pushed to master last night, and I will be working with all that stuff shortly ... just building the peer right now.

jyellick commented 8 years ago

@kletkeman This isn't yet in master, you'll still need to go through the process described in my comment above. Note, that this is a change to the obc-dev-env, not to obc-peer, as it is updating the version of Docker, as well as moving the storage backend from aufs to btrfs.

@binhn asked for people to try testing these changes yesterday via slack, but I've not heard any feedback from anyone.

jyellick commented 8 years ago

@kletkeman You may now more easily get the docker update, in your obc-dev-env simply

vagrant destroy
git pull
vagrant up

And you should pick up the new vagrant image.

jyellick commented 8 years ago

@kletkeman Are you still observing this behavior, or can this issue be closed?

With default logging levels, all behave tests pass for me. Is there some other recreation method I should attempt?

kletkeman commented 8 years ago

@jyellick Thanks for reminding me. No, I have not seen this in a long time and I agree that it has been fixed.

jyellick commented 8 years ago

Thanks @kletkeman, closing now.