hyperledger-archives / fabric

THIS IS A READ-ONLY historic repository. Current development is at https://gerrit.hyperledger.org/r/#/admin/projects/fabric . pull requests not accepted
https://gerrit.hyperledger.org/
Apache License 2.0
1.17k stars 1.01k forks source link

peer crashed during state transfer #1865

Closed yacovm closed 8 years ago

yacovm commented 8 years ago

PBFT classic, 4 nodes, no security

commit a94f5a7f559a4b7459e6ba37419e44bf335db81b

Scenario: invoked 5000 transactions of:

{
  "jsonrpc": "2.0",
  "method": "invoke",
  "params": {
      "type": 1,
      "chaincodeID":{
          "name":"ID"
      },
      "ctorMsg": {
         "function":"invoke",
         "args":["a", "b","1000"]
      }
  },
  "id": 3
}

All were OK, but the 4 peers chain size were different and they were doing state transfer. vp2 crashed during that.

08:07:09.306 [buckettree] computeCryptoHash -> DEBU 54984 Computing crypto-hash for bucket [level=[0], bucketNumber=[1]] by merging [2] children
08:07:09.306 [state] GetHash -> DEBU 54985 Exit - GetHash()
08:07:09.306 [state] GetHash -> DEBU 54986 Enter - GetHash()
08:07:09.306 [buckettree] ComputeCryptoHash -> DEBU 54987 Enter - ComputeCryptoHash()
08:07:09.306 [consensus/statetransfer] func1 -> DEBU 54988 name:"vp2"  successfully synced from block 197 to block 191
08:07:09.306 [consensus/statetransfer] syncBlocks -> DEBU 5498a name:"vp2"  returned from sync with block 191 and state hash a6a6e4177e17ac874c4d41c9de8b01aebe95a5daccf26aca8c1da6a0fac3af8c77fceb6993f1748a05a053a66b7ddd1e2e801518800a1792003e23ca7b94d81d
08:07:09.306 [consensus/statetransfer] syncBlockchainToCheckpoint -> DEBU 5498b name:"vp2"  replying to blockSyncReq on reply channel with : %!s(<nil>)
08:07:09.306 [consensus/statetransfer] attemptStateTransfer -> DEBU 5498c name:"vp2"  state transfer thread continuing
08:07:09.306 [state] GetHash -> DEBU 5498d Enter - GetHash()
08:07:09.306 [buckettree] ComputeCryptoHash -> DEBU 5498e Enter - ComputeCryptoHash()
08:07:09.306 [buckettree] ComputeCryptoHash -> DEBU 54989 Returing existing crypto-hash as recomputation not required
08:07:09.306 [state] GetHash -> DEBU 54990 Exit - GetHash()
08:07:09.306 [indexes] addIndexDataForPersistence -> DEBU 54991 Indexing block number [198] by hash = [e5df3977cd42bc97bb13c91527aebeebf43ffbab922c5f5cbdcb7c5b894cd4f3d312a633f16975b565eabae7516085844c35da00e7ec309ab0f3af3dd67f2ac7]
08:07:09.306 [buckettree] ComputeCryptoHash -> DEBU 5498f Returing existing crypto-hash as recomputation not required
08:07:09.306 [state] AddChangesForPersistence -> DEBU 54992 state.addChangesForPersistence()...start
08:07:09.306 [state] GetHash -> DEBU 54993 Exit - GetHash()
08:07:09.306 [buckettree] getAffectedBuckets -> DEBU 54994 Adding changed bucket [level=[9], bucketNumber=[879153]]
08:07:09.306 [buckettree] getAffectedBuckets -> DEBU 54995 Adding changed bucket [level=[9], bucketNumber=[101582]]
08:07:09.306 [buckettree] getAffectedBuckets -> DEBU 54996 Changed buckets are = [[level=[9], bucketNumber=[879153] level=[9], bucketNumber=[101582]]]
08:07:09.306 [consensus/statetransfer] stateThread -> ERRO 54997 name:"vp2"  believed its state for block 191 to be valid, but its hash (4ca5f67b976263f30687fdee0a84d43ef91f9b5f1eac62ba9e82806d9c4fc60fdee7d0dfa1bd4d837bb8cc745860ece48d414d7602eac623f33c0541fabe5552) did not match the recovered blockchain's (a6a6e4177e17ac874c4d41c9de8b01aebe95a5daccf26aca8c1da6a0fac3af8c77fceb6993f1748a05a053a66b7ddd1e2e801518800a1792003e23ca7b94d81d)
08:07:09.306 [consensus/handler] Errored -> WARN 54999 state transfer reported error for block 0, seqNo 0: name:"vp2"  believed its state for block 191 to be valid, but its hash (4ca5f67b976263f30687fdee0a84d43ef91f9b5f1eac62ba9e82806d9c4fc60fdee7d0dfa1bd4d837bb8cc745860ece48d414d7602eac623f33c0541fabe5552) did not match the recovered blockchain's (a6a6e4177e17ac874c4d41c9de8b01aebe95a5daccf26aca8c1da6a0fac3af8c77fceb6993f1748a05a053a66b7ddd1e2e801518800a1792003e23ca7b94d81d)
08:07:09.306 [consensus/statetransfer] syncStateSnapshot -> DEBU 5499a name:"vp2"  attempting to retrieve state snapshot from recovery from [name:"vp3"  name:"vp0" ]
08:07:09.307 [consensus/statetransfer] tryOverPeers -> DEBU 5499b name:"vp2"  in tryOverPeers, using peerIDs: [name:"vp3"  name:"vp0" ]
08:07:09.307 [consensus/statetransfer] func1 -> DEBU 5499c name:"vp2"  is initiating state recovery from name:"vp3" 
08:07:09.306 [buckettree] addDataNodeChangesForPersistence -> DEBU 54998 Adding data node with value = []byte{0x31, 0x39, 0x33, 0x30, 0x30, 0x30}
08:07:09.307 [buckettree] ClearWorkingSet -> DEBU 5499d Enter - ClearWorkingSet()
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0xa89438]

goroutine 2975 [running]:
panic(0xd5aec0, 0xc8200140a0)
    /opt/go/src/runtime/panic.go:464 +0x3e6
github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree.(*StateImpl).addDataNodeChangesForPersistence(0xc820269130, 0xc8203da020)
    /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree/state_impl.go:240 +0xd8
github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree.(*StateImpl).AddChangesForPersistence(0xc820269130, 0xc8203da020, 0x0, 0x0)
    /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree/state_impl.go:231 +0x98
github.com/hyperledger/fabric/core/ledger/statemgmt/state.(*State).AddChangesForPersistence(0xc820269220, 0xc6, 0xc8203da020)
    /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/statemgmt/state/state.go:289 +0xcd
github.com/hyperledger/fabric/core/ledger.(*Ledger).CommitTxBatch(0xc82028dc20, 0xbce6c0, 0xc82deaeb00, 0xc820076218, 0x1, 0x1, 0xc820076230, 0x1, 0x1, 0xc8213386b8, ...)
    /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/ledger.go:166 +0x429
github.com/hyperledger/fabric/consensus/helper.(*Helper).CommitTxBatch(0xc8202cc460, 0xbce6c0, 0xc82deaeb00, 0xc8213386b8, 0x3, 0x8, 0xc820217d40, 0x0, 0x0)
    /opt/gopath/src/github.com/hyperledger/fabric/consensus/helper/helper.go:213 +0x20a
github.com/hyperledger/fabric/consensus/obcpbft.(*obcClassic).execute.func1(0xc82da3fb80, 0x15f, 0x15f, 0xc3, 0xc8202dd5c0)
    /opt/gopath/src/github.com/hyperledger/fabric/consensus/obcpbft/obc-classic.go:147 +0x57f
created by github.com/hyperledger/fabric/consensus/obcpbft.(*obcClassic).execute
    /opt/gopath/src/github.com/hyperledger/fabric/consensus/obcpbft/obc-classic.go:150 +0x5d
yacovm commented 8 years ago

Now another error, in latest build ( eaec0807bcdfe31d010395851d7b2c9bef0cf93a ), vp0 crashed.

 NumChildren={1}
Childern crypto-hashes:
childNumber={1}, cryptoHash={0e3ef872a6f4b9e3b1264f74152512a6bb7e2fbda74e838e1872a8bc72b8cc282c147650082958e3be55d847d2d62a75a5cb62b30428b693574d583004cbd5d5}
]
panic: runtime error: index out of range

goroutine 37 [running]:
panic(0xd5aec0, 0xc820014050)
        /opt/go/src/runtime/panic.go:464 +0x3e6
github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree.(*bucketNode).mergeBucketNode(0xc837c5b3c0, 0xc837880240)
        /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree/bucket_node.go:73 +0x32b
github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree.(*StateImpl).processBucketTreeDelta(0xc82026d0e0, 0x0, 0x0)
        /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree/state_impl.go:156 +0x44b
github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree.(*StateImpl).ComputeCryptoHash(0xc82026d0e0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/statemgmt/buckettree/state_impl.go:115 +0x114
github.com/hyperledger/fabric/core/ledger/statemgmt/state.(*State).GetHash(0xc82026d1d0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/statemgmt/state/state.go:237 +0x11d
github.com/hyperledger/fabric/core/ledger.(*Ledger).GetTempStateHash(0xc820291d40, 0x0, 0x0, 0x0, 0x0, 0x0)
        /opt/gopath/src/github.com/hyperledger/fabric/core/ledger/ledger.go:212 +0x42
github.com/hyperledger/fabric/core/peer.(*PeerImpl).GetCurrentStateHash(0xc82026d9f0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /opt/gopath/src/github.com/hyperledger/fabric/core/peer/peer.go:660 +0xc7
github.com/hyperledger/fabric/core/peer/statetransfer.(*StateTransferState).attemptStateTransfer(0xc820202380, 0xc83282de48, 0xc83282de50, 0xc83282de58, 0xc83282de47, 0x0, 0x0)
        /opt/gopath/src/github.com/hyperledger/fabric/core/peer/statetransfer/statetransfer.go:883 +0xcaa
github.com/hyperledger/fabric/core/peer/statetransfer.(*StateTransferState).stateThread(0xc820202380)
        /opt/gopath/src/github.com/hyperledger/fabric/core/peer/statetransfer/statetransfer.go:937 +0x2c1
created by github.com/hyperledger/fabric/core/peer/statetransfer.NewStateTransferState
        /opt/gopath/src/github.com/hyperledger/fabric/core/peer/statetransfer/statetransfer.go:381 +0x52

In bucket_node.go:

func (bucketNode *bucketNode) mergeBucketNode(anotherBucketNode *bucketNode) {
    if !bucketNode.bucketKey.equals(anotherBucketNode.bucketKey) {
        panic(fmt.Errorf("Nodes with different keys can not be merged. BaseKey=[%#v], MergeKey=[%#v]", bucketNode.bucketKey, anotherBucketNode.bucketKey))
    }
    for i, childCryptoHash := range anotherBucketNode.childrenCryptoHash {
        if !bucketNode.childrenUpdated[i] {
            bucketNode.childrenCryptoHash[i] = childCryptoHash
        }
    }
}

I guess the other bucket's children crypto hash is too big for the childrenUpdated.

manish-sethi commented 8 years ago

@yacovm did you change the bucket tree configurations in core.yaml after inserting a few transactions?

yacovm commented 8 years ago

no

yacovm commented 8 years ago

I'll run some more tests, if I still won't be able to reproduce it after 1400 runs I'll close this because it may be related to a container image being created while an already existing database of a prior commit was present in the FS. Does anyone know if when I make images (via the Makefile) - it copies the /var/hyperledger/production/db to the docker image?

manish-sethi commented 8 years ago

I think that the reason of the first panic trace (panic: runtime error: invalid memory address or nil pointer dereference) above is because of concurrent modifications to state - one by the state transfer thread and another by the transaction execution thread. The logs above the panic trace suggest that state transfer is in progress and the panic stack trace contains the call consensus/helper.(*Helper).CommitTxBatch. @jyellick is it possible in the code that state transfer and transaction execution happen concurrently?

For the second panic trace (panic: runtime error: index out of range) also, I looked at the code again and again, modifying state in parallel could be one of the potential causes. @yacovm do you have full logs or dump of all the active threads so we can verify this.

jyellick commented 8 years ago

@manish-sethi In code at commit a94f5a7, it was possible for executions and state transfer to modify the state concurrently, this was addressed in 67396654dfc881219db395e02c1a57ff0b8f3e24, this would additionally have been included in eaec080.

I will note, that state transfer may fill in gaps of blocks in the blockchain concurrently, but state modifications should now be serialized.

manish-sethi commented 8 years ago

Thanks @jyellick. @yacovm so first panic is fixed. For the second, I would need debug logs + dump of other active goroutines to be able to infer.

yacovm commented 8 years ago

Can't reproduce. I'm closing it and will re-open if it'll pop up again.