palomachain / paloma

The fast blockchain messenger protocol
Apache License 2.0
291 stars 135 forks source link

ERR CONSENSUS FAILURE!!! err="non-recoverable error when signing vote after upgrade to Cosmos SDK 0.50.5 #1102

Closed taariq closed 7 months ago

taariq commented 7 months ago

What is happening?

Section description Provide as much context as you can. Give as much context as you can to make it easier for the developers to figure what is happening.
2:13PM ERR CONSENSUS FAILURE!!! err="non-recoverable error when signing vote (15277751/0)" module=consensus stack="goroutine 837 [running]:
runtime/debug.Stack()
    /opt/hostedtoolcache/go/1.21.8/x64/src/runtime/debug/stack.go:24 +0x5e
github.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:800 +0x46
panic({0x2d03080?, 0xc00a719cc0?})
    /opt/hostedtoolcache/go/1.21.8/x64/src/runtime/panic.go:914 +0x21f
github.com/cometbft/cometbft/consensus.(*State).signVote(0xc001961500, 0x2, {0xc007d79180, 0x20, 0x20}, {0xe0eb5c?, {0xc007d791a0?, 0xc00ca1f2ec?, 0x3035160?}}, 0xc00126af00)
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:2388 +0x639
github.com/cometbft/cometbft/consensus.(*State).signAddVote(0xc001961500, 0x0?, {0xc007d79180, 0x20, 0x20}, {0x1?, {0xc007d791a0?, 0xc008c41ec0?, 0x20?}}, 0xc00126af00)
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:2439 +0x212
github.com/cometbft/cometbft/consensus.(*State).enterPrecommit(0xc001961500, 0xe91eb7, 0x0)
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:1536 +0x1337
github.com/cometbft/cometbft/consensus.(*State).addVote(0xc001961500, 0xc000f6f040, {0xc002df0690, 0x28})
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:2296 +0x186f
github.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc001961500, 0xc000f6f040, {0xc002df0690?, 0xc00ca1fc08?})
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:2056 +0x26
github.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc001961500, {{0x3c0ebc0, 0xc0009e8a90}, {0xc002df0690, 0x28}})
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:928 +0x3ce
github.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc001961500, 0x0)
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:835 +0x3d1
created by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 321
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:397 +0x10c"

2:13PM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/home/validator/.paloma/data/cs.wal/wal
2:13PM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/home/validator/.paloma/data/cs.wal/wal
2:13PM INF Timed out dur=3000 height=15277751 module=consensus round=0 step=RoundStepPropose
2:13PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=4 numToDial=6

Paloma and pigeon versions and logs

Section description Write down paloma version. Write down pigeon version. Copy and paste pigeon config file as well as relevant ENV variables.

paloma: 1.13.0 pigeons: 1.11.0

How to reproduce?

Section description Please write detailed steps of what you were doing for this bug to appear.

Unable to repro with other validators. Validator unable to sign blocks.

What is the expected behaviour?

Section description If you know, please write down what is the expected behaviour. If you don't know, that's ok. We can have a discussion in comments.

Sign blocks without error.

frangrech commented 7 months ago

This is the full error we are seeing:

2:13PM ERR CONSENSUS FAILURE!!! err="non-recoverable error when signing vote (15277751/0)" module=consensus stack="goroutine 837 [running]:
runtime/debug.Stack()
    /opt/hostedtoolcache/go/1.21.8/x64/src/runtime/debug/stack.go:24 +0x5e
github.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:800 +0x46
panic({0x2d03080?, 0xc00a719cc0?})
    /opt/hostedtoolcache/go/1.21.8/x64/src/runtime/panic.go:914 +0x21f
github.com/cometbft/cometbft/consensus.(*State).signVote(0xc001961500, 0x2, {0xc007d79180, 0x20, 0x20}, {0xe0eb5c?, {0xc007d791a0?, 0xc00ca1f2ec?, 0x3035160?}}, 0xc00126af00)
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:2388 +0x639
github.com/cometbft/cometbft/consensus.(*State).signAddVote(0xc001961500, 0x0?, {0xc007d79180, 0x20, 0x20}, {0x1?, {0xc007d791a0?, 0xc008c41ec0?, 0x20?}}, 0xc00126af00)
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:2439 +0x212
github.com/cometbft/cometbft/consensus.(*State).enterPrecommit(0xc001961500, 0xe91eb7, 0x0)
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:1536 +0x1337
github.com/cometbft/cometbft/consensus.(*State).addVote(0xc001961500, 0xc000f6f040, {0xc002df0690, 0x28})
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:2296 +0x186f
github.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc001961500, 0xc000f6f040, {0xc002df0690?, 0xc00ca1fc08?})
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:2056 +0x26
github.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc001961500, {{0x3c0ebc0, 0xc0009e8a90}, {0xc002df0690, 0x28}})
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:928 +0x3ce
github.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc001961500, 0x0)
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:835 +0x3d1
created by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 321
    /home/runner/go/pkg/mod/github.com/cometbft/cometbft@v0.38.5/consensus/state.go:397 +0x10c"

2:13PM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/home/validator/.paloma/data/cs.wal/wal
2:13PM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/home/validator/.paloma/data/cs.wal/wal
2:13PM INF Timed out dur=3000 height=15277751 module=consensus round=0 step=RoundStepPropose
2:13PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=4 numToDial=6

The node catches up to the current height, but immediately halts once it is unable to sign. It afterwards shows peering info/errors. It happens irrespective of which height it is trying to sign.

verabehr commented 7 months ago

Also noting that this issue is not causing the validator to get jailed as far as I can tell

byte-bandit commented 7 months ago

@frangrech Are you using a remote signer? Are you using the release build or are you building from source?

frangrech commented 7 months ago

Yes, tmkms.

taariq commented 7 months ago

@frangrech COO! COO! Heads up. If you are using a remote signer like tmkms on Paloma with the upgrade to cosmos SDK 0.50.5, please make sure you are upgraded on the signer as well. https://github.com/iqlusioninc/tmkms/tags

taariq commented 7 months ago

According to @frangrech

we tried updating tmkms to v0.14.0, but the issue is still persisting. We're getting the same error as I previously reported.

taariq commented 7 months ago

https://github.com/cometbft/cometbft/issues/2711#issuecomment-2037466222 we need to upgrade and implement vote extensions

taariq commented 7 months ago

Superseded by https://github.com/palomachain/paloma/issues/1105