tendermint / tmkms

Key Management service for Tendermint Validator nodes
Apache License 2.0
142 stars 42 forks source link

tmkms v0.7.0 hang #389

Closed creamjun closed 4 years ago

creamjun commented 4 years ago

no error, just hang in tmkms v0.7.0

After tmkms stops first, I get the following error from gaiad:

E[2020-01-01|06:07:02.792] SignerClient::SignVote                       module=privval err="read tcp 127.0.0.1:26666->127.0.0.1:47828: i/o timeout: endpoint read timed out"
E[2020-01-01|06:07:02.792] Error signing vote                           module=consensus height=248863 round=0 vote="Vote{48:68F5BBEACEF1 248863/00/1(Prevote) DA64AD3E2B74 000000000000 @ 2019-12-31T21:06:59.79003623Z}" err="read tcp 127.0.0.1:26666->127.0.0.1:47828: i/o timeout: endpoint read timed out"
E[2020-01-01|06:07:05.792] SignerListener: Ping timeout                 module=privval
E[2020-01-01|06:07:08.793] SignerClient::GetPubKey                      module=privval err="endpoint connection timed out"
E[2020-01-01|06:07:08.797] CONSENSUS FAILURE!!!                         module=consensus err="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 883 [running]:\nruntime/debug.Stack(0xc000c8f358, 0x1047dc0, 0x1f50a00)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func2(0xc002aa6700, 0x13aa1e0)\n\t/home/jun/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:602 +0x57\npanic(0x1047dc0, 0x1f50a00)\n\t/usr/local/go/src/runtime/panic.go:679 +0x1b2\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).signAddVote(0xc002aa6700, 0x3cc02, 0xc00f754b20, 0x20, 0x20, 0x1, 0xc00f754b80, 0x20, 0x20, 0xb)\n\t/home/jun/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1745 +0x585\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).enterPrecommit(0xc002aa6700, 0x3cc1f, 0x0)\n\t/home/jun/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1155 +0x12ec\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc002aa6700, 0xc004c2adc0, 0xc0020a20c0, 0x28, 0xc000c8faa8, 0x4466f7, 0xc002aa6798)\n\t/home/jun/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1663 +0xdfa\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc002aa6700, 0xc004c2adc0, 0xc0020a20c0, 0x28, 0xcc2f5edad59daf76, 0x109, 0x101)\n\t/home/jun/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1527 +0x59\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc002aa6700, 0x15612c0, 0xc00fda7e48, 0xc0020a20c0, 0x28)\n\t/home/jun/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:691 +0x252\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc002aa6700, 0x0)\n\t/home/jun/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:633 +0x6eb\ncreated by github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart\n\t/home/jun/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:334 +0x13a\n"
tarcieri commented 4 years ago

This looks potentially related to #385

Can you post the KMS-side logs, preferably with debug logging enabled?

Which signing backend are you using?

creamjun commented 4 years ago

This looks potentially related to #385

Can you post the KMS-side logs, preferably with debug logging enabled?

nothing logs in kms. I'll post if there's a problem after debug logging enabled.

Which signing backend are you using?

yubihsm

tarcieri commented 4 years ago

nothing logs in kms.

Are you running it with tmkms start -v to enable verbose logging?

What do you see immediately before it stops?

Can you perhaps try running it under strace? What OS are you using?

creamjun commented 4 years ago

nothing logs in kms.

Are you running it with tmkms start -v to enable verbose logging?

tmkms did not start with -v when an error occurred. now restarted with -v. next time I'm having trouble with tmkms, I'll post log.

What do you see immediately before it stops?

Before stopping, there is right sign log. (signed PreVote & PreCommit)

Can you perhaps try running it under strace? What OS are you using?

OS : CentOS7

tarcieri commented 4 years ago

OS : CentOS7

For what it's worth, we also use CentOS 7 and don't experience this issue, so it doesn't seem to be specific to that.

creamjun commented 4 years ago

OS : CentOS7

For what it's worth, we also use CentOS 7 and don't experience this issue, so it doesn't seem to be specific to that.

more info.. i using 3 chain per 1 hsm. (cosmos, terra, kava)

creamjun commented 4 years ago

@tarcieri again, this time tmkms hang without log. (although given the -v option)

What's special is that there is a "yubihsm: session ...." log every second, and that log hasn't been taken five seconds before tmkms hang.

tarcieri commented 4 years ago

Any chance you can try running it under strace?

creamjun commented 4 years ago

I didn't think of strace. next time I will use strace to record the log.

creamjun commented 4 years ago

Since the v0.7.1 update, it doesn't seem to be a problem anymore. Even if prevote occurs continuously, it works wall by reconnecting. thank you.