synechron-finlabs / quorum-maker

Utility to create and monitor Quorum nodes
Apache License 2.0
196 stars 97 forks source link

Cluster's master node crashed, Server panics and not able to start the node again #116

Open TheBlockchainDevNeeraj opened 5 years ago

TheBlockchainDevNeeraj commented 5 years ago

I have a three node cluster, After 7500+ transactions, the master node suddenly stopped and crashed, and now not even starting, following is the stack trace, please suggest what is wrong:

rpc call eth_coinbase() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused rpc call eth_coinbase() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused rpc call eth_coinbase() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused rpc call eth_blockNumber() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused rpc call eth_blockNumber() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused rpc call eth_getBlockByNumber() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused 2019/08/12 13:07:22 http: panic serving 172.17.0.1:60370: runtime error: invalid memory address or nil pointer dereference goroutine 315 [running]: net/http.(conn).serve.func1(0xc420185b80) /usr/local/go/src/net/http/server.go:1726 +0xd0 panic(0x7e6900, 0xc5edb0) /usr/local/go/src/runtime/panic.go:502 +0x229 github.com/ybbus/jsonrpc.(RPCResponse).GetObject(0x0, 0x7af020, 0xc42029a780, 0x8e, 0x0) /go/src/github.com/ybbus/jsonrpc/jsonrpc.go:609 +0x26 github.com/synechron-finlabs/quorum-maker-nodemanager/client.(EthClient).GetBlockByNumber(0xc4200619f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/client/EthClient.go:154 +0x201 github.com/synechron-finlabs/quorum-maker-nodemanager/service.(NodeServiceImpl).latestBlockDetails(0xc420051250, 0x7fff2ca86e81, 0x16, 0xc4203ccf80, 0xc42004ebb8) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/service/NodeService.go:1026 +0x100 github.com/synechron-finlabs/quorum-maker-nodemanager/service.(NodeServiceImpl).LatestBlockHandler(0xc420051250, 0x8a7d20, 0xc4202a6460, 0xc4203e4600) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/service/NodeServiceHandler.go:442 +0x51 github.com/synechron-finlabs/quorum-maker-nodemanager/service.(NodeServiceImpl).LatestBlockHandler-fm(0x8a7d20, 0xc4202a6460, 0xc4203e4600) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/main.go:78 +0x48 net/http.HandlerFunc.ServeHTTP(0xc420051b70, 0x8a7d20, 0xc4202a6460, 0xc4203e4600) /usr/local/go/src/net/http/server.go:1947 +0x44 github.com/gorilla/mux.(Router).ServeHTTP(0xc420126180, 0x8a7d20, 0xc4202a6460, 0xc42031b600) /go/src/github.com/gorilla/mux/mux.go:212 +0xcd net/http.serverHandler.ServeHTTP(0xc420090ea0, 0x8a7d20, 0xc4202a6460, 0xc42031b600) /usr/local/go/src/net/http/server.go:2694 +0xbc net/http.(conn).serve(0xc420185b80, 0x8a80a0, 0xc42006b0c0) /usr/local/go/src/net/http/server.go:1830 +0x651 created by net/http.(Server).Serve /usr/local/go/src/net/http/server.go:2795 +0x27b rpc call eth_blockNumber() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused rpc call eth_blockNumber() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused rpc call eth_getBlockByNumber() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused 2019/08/12 13:07:22 http: panic serving 172.17.0.1:60796: runtime error: invalid memory address or nil pointer dereference goroutine 333 [running]: net/http.(conn).serve.func1(0xc420300140) /usr/local/go/src/net/http/server.go:1726 +0xd0 panic(0x7e6900, 0xc5edb0) /usr/local/go/src/runtime/panic.go:502 +0x229 github.com/ybbus/jsonrpc.(RPCResponse).GetObject(0x0, 0x7af020, 0xc420175b80, 0x8e, 0x0) /go/src/github.com/ybbus/jsonrpc/jsonrpc.go:609 +0x26 github.com/synechron-finlabs/quorum-maker-nodemanager/client.(EthClient).GetBlockByNumber(0xc4200619f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/client/EthClient.go:154 +0x201 github.com/synechron-finlabs/quorum-maker-nodemanager/service.(NodeServiceImpl).latestBlockDetails(0xc420051250, 0x7fff2ca86e81, 0x16, 0xc4203fb080, 0xc42004ebb8) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/service/NodeService.go:1026 +0x100 github.com/synechron-finlabs/quorum-maker-nodemanager/service.(NodeServiceImpl).LatestBlockHandler(0xc420051250, 0x8a7d20, 0xc420272b60, 0xc42031ba00) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/service/NodeServiceHandler.go:442 +0x51 github.com/synechron-finlabs/quorum-maker-nodemanager/service.(NodeServiceImpl).LatestBlockHandler-fm(0x8a7d20, 0xc420272b60, 0xc42031ba00) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/main.go:78 +0x48 net/http.HandlerFunc.ServeHTTP(0xc420051b70, 0x8a7d20, 0xc420272b60, 0xc42031ba00) /usr/local/go/src/net/http/server.go:1947 +0x44 github.com/gorilla/mux.(Router).ServeHTTP(0xc420126180, 0x8a7d20, 0xc420272b60, 0xc42031b800) /go/src/github.com/gorilla/mux/mux.go:212 +0xcd net/http.serverHandler.ServeHTTP(0xc420090ea0, 0x8a7d20, 0xc420272b60, 0xc42031b800) /usr/local/go/src/net/http/server.go:2694 +0xbc net/http.(conn).serve(0xc420300140, 0x8a80a0, 0xc42025bf40) /usr/local/go/src/net/http/server.go:1830 +0x651 created by net/http.(Server).Serve /usr/local/go/src/net/http/server.go:2795 +0x27b rpc call eth_coinbase() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused rpc call eth_coinbase() on http://localhost:22000: Post http://localhost:22000: dial tcp 127.0.0.1:22000: connect: connection refused

TheBlockchainDevNeeraj commented 5 years ago

I have tried looking into the constellation logs which reads as: gas limit reached **

is this related.

zhjzcbm commented 5 years ago

Has it been solved?

TheBlockchainDevNeeraj commented 5 years ago

No I am still seeing this issues popping up all the time when ever I crossed the 6K or 7k block count. Can you sense any reason for this. I guess there is something with syncing node as it some times shows the error in syncing the raft ID.

zhjzcbm commented 5 years ago

我使用raft.remove() 删除故障节点,然后重新加入网络来解决这个问题

I use raft. remove () to delete the fault node and then rejoin the network to solve this problem.

TheBlockchainDevNeeraj commented 5 years ago

My Master node was down but the slaves were working fine, showing 2 nodes in the network. Can I remove the Master node from slaves, and even if it is possible i still don't know how to do that. Can you please help me in figuring this out in terms of what will be the command to remove the node using said method. Then I will be able to rejoin on my own :smile:

zhjzcbm commented 5 years ago

使用geth打开geth.ip文件 这个文件在你的节点目录里node/qdate/ attach ./geth.ipc 如果你没安装geth需要安装一个。你需要去https://github.com/jpmorganchase/quorum.git 下载 这里还需要安装 sudo apt install make -y| sudo yum install make -y 然后cd quorum geth make sudo cp ./build/bin/geth /usr/bin/

当你使用attach ./geth.ipc进入控制台后 使用raft查看节点信息 raft.remove(节点ID) 来删除故障节点 以上是在正常节点上进行

Use geth to open the geth.ipc file. This file is in your node directory node/qdate/ code: attach ./geth.ipc If you don't install geth, you need to install one. You need to download it at https://github.com/jpmorganchase/quorum.git

You also need to install sudo apt install make -y | sudo yum install make -y

cd quorum

make geth

sudo cp ./build/bin/geth /usr/bin/

When you enter the console using attach ./geth.ipc.

Use raft to view node information

raft.remove(node ID)

To delete the fault node

This is done on normal nodes.

fullkomnun commented 4 years ago

Just ran into the same issue yesterday, while running a tessera-based quorum network of 3 nodes as generated by quorum-maker(with some amendments to tessera-config.json) as part of integration testing using test-containers docker-compose module. Running on macOS.

The crash of node3: Waiting for Node 1 to deploy NetworkManager contract... {"level":"info","msg":"Node Manager listening on :22004...","time":"2020-01-21T17:03:24Z"} {"level":"info","msg":"Adding whitelisted IPs","time":"2020-01-21T17:03:28Z"} rpc call eth_getBlockByNumber() on http://localhost:22000: Post http://localhost:22000: EOF panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x703b86]

goroutine 19 [running]: github.com/ybbus/jsonrpc.(RPCResponse).GetObject(0x0, 0x7af020, 0xc420149040, 0x5c, 0x0) /go/src/github.com/ybbus/jsonrpc/jsonrpc.go:609 +0x26 github.com/synechron-finlabs/quorum-maker-nodemanager/client.(EthClient).GetBlockByNumber(0xc420061b60, 0xc420272810, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/client/EthClient.go:154 +0x201 github.com/synechron-finlabs/quorum-maker-nodemanager/service.(NodeServiceImpl).getContracts(0xc420051250, 0x7ffc2cecff21, 0x16) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/service/NodeService.go:1440 +0x736 github.com/synechron-finlabs/quorum-maker-nodemanager/service.(NodeServiceImpl).ContractCrawler.func1(0xc420360000, 0xc420051250, 0x7ffc2cecff21, 0x16) /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/service/NodeService.go:1427 +0xa1 created by github.com/synechron-finlabs/quorum-maker-nodemanager/service.(*NodeServiceImpl).ContractCrawler /go/src/github.com/synechron-finlabs/quorum-maker-nodemanager/service/NodeService.go:1424 +0x70

This crash happened (somewhat) consistently when running a bunch of tests on the same underlying network. I cannot reproduce it now for no apparent reason. It seems that performing the same test with only 2 nodes eliminates this issue.

Any insights regarding the reason for this crash?

fullkomnun commented 4 years ago

I think I might have solved this mystery.

Raft consensus that is used by quorum-maker is sensitive to clock sync issues and there were known time-drift issues on macOS. Making sure the latest version of docker for mac is installed and adding volume binding such as /etc/localtime:/etc/localtime:ro to docker-compose.yml seems to eliminate these issues.

Resources: https://stackoverflow.com/questions/22800624/will-docker-container-auto-sync-time-with-the-host-machine https://www.docker.com/blog/addressing-time-drift-in-docker-desktop-for-mac/