vegaprotocol / vega

A Go implementation of the Vega Protocol, a protocol for creating and trading derivatives on a fully decentralised network.
https://vega.xyz
GNU Affero General Public License v3.0
37 stars 19 forks source link

[Bug]: Panic during checkpoint tests #9275

Closed jgsbennett closed 10 months ago

jgsbennett commented 10 months ago

Problem encountered

We saw this panic during a test run. See slack thread: https://vegaprotocol.slack.com/archives/C03K403SZMM/p1693561610707909 It's really not clear what caused it, or if it's repeatable. There's weird things going on during this test run, so it's possibly caused by something devops-related. Raising in case somebody is able to deduce any more than me from the stack.

panic: invalid node wallet configurations, the genesis validator mapping differ to the wallets imported by the nodewallet [recovered]
    panic: invalid node wallet configurations, the genesis validator mapping differ to the wallets imported by the nodewallet

goroutine 112 [running]:
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).Run.func1.1()
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:154 +0x59
panic({0x4859960, 0xc001213c80})
    /usr/local/go/src/runtime/panic.go:884 +0x213
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x8?, 0x8?, {0x0?, 0x0?, 0xc00171c620?})
    /jenkins/GOPATH/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0016212b0, {0xc000df5600, 0x8, 0x8})
    /jenkins/GOPATH/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
go.uber.org/zap.(*Logger).Panic(0x62ea65c?, {0x52d2c70?, 0x210?}, {0xc000df5600, 0x8, 0x8})
    /jenkins/GOPATH/pkg/mod/go.uber.org/zap@v1.24.0/logger.go:258 +0x59
code.vegaprotocol.io/vega/core/validators.(*Topology).checkValidatorDataWithSelfWallets(0xc00123afc0, {{0xc0014f2bc0, 0x40}, {0xc0014f2c00, 0x40}, 0x1, {0xc001257b60, 0x2a}, {0xc001257b90, 0x2c}, ...})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/validators/topology.go:622 +0xbd7
code.vegaprotocol.io/vega/core/validators.(*Topology).LoadValidatorsOnGenesis(0xc00123afc0, {0x591a900, 0xc0015616e0}, {0xc0013b6000, 0x12f58, 0x14000})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/validators/topology.go:578 +0x2dd
code.vegaprotocol.io/vega/core/genesis.(*Handler).OnGenesis(0xc001732d00, {0x591a900, 0xc0015616e0}, {0x79a3ee0?, 0x4858aa0?, 0x0?}, {0xc0013b6000, 0x12f58, 0x14000})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/genesis/handler.go:64 +0x3bd
code.vegaprotocol.io/vega/core/processor.(*App).OnInitChain(0xc001238f00, {{0x15842892, 0xedc83ac86, 0x0}, {0xc00173df20, 0xb}, 0xc000d30660, {0xc000d45f00, 0x5, 0x5}, ...})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/processor/abci.go:715 +0x4ea
code.vegaprotocol.io/vega/core/blockchain/abci.(*App).InitChain(0xc001704210, {{0x15842892, 0xedc83ac86, 0x0}, {0xc00173df20, 0xb}, 0xc000d30660, {0xc000d45f00, 0x5, 0x5}, ...})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/blockchain/abci/abci.go:39 +0xbb
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*appW).InitChain(0x18?, {{0x15842892, 0xedc83ac86, 0x0}, {0xc00173df20, 0xb}, 0xc000d30660, {0xc000d45f00, 0x5, 0x5}, ...})
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/app_wrapper.go:60 +0x7d
github.com/tendermint/tendermint/abci/client.(*localClient).InitChainSync(0xc0014120c0, {{0x15842892, 0xedc83ac86, 0x0}, {0xc00173df20, 0xb}, 0xc000d30660, {0xc000d45f00, 0x5, 0x5}, ...})
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/abci/client/local_client.go:272 +0x118
github.com/tendermint/tendermint/proxy.(*appConnConsensus).InitChainSync(0xc00033ac40?, {{0x15842892, 0xedc83ac86, 0x0}, {0xc00173df20, 0xb}, 0xc000d30660, {0xc000d45f00, 0x5, 0x5}, ...})
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/proxy/app_conn.go:77 +0x55
github.com/tendermint/tendermint/consensus.(*Handshaker).ReplayBlocks(_, {{{0xb, 0x1}, {0x508dbdb, 0x7}}, {0xc00173df20, 0xb}, 0x1, 0x0, {{0x0, ...}, ...}, ...}, ...)
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/consensus/replay.go:319 +0xf18
github.com/tendermint/tendermint/consensus.(*Handshaker).Handshake(0xc00159ed80, {0x5944890, 0xc00134e4e0})
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/consensus/replay.go:268 +0x3d4
github.com/tendermint/tendermint/node.doHandshake({_, _}, {{{0xb, 0x0}, {0x508dbdb, 0x7}}, {0xc00173df20, 0xb}, 0x1, 0x0, ...}, ...)
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/node/node.go:329 +0x1b8
github.com/tendermint/tendermint/node.NewNode(0xc001786500, {0x5913050, 0xc0000ed7c0}, 0xc000dc0170, {0x58f6900, 0xc00029d680}, 0x304c5a5261594769?, 0x436b4f674475612f?, 0xc000dc01b0, {0x5918910, ...}, ...)
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/node/node.go:779 +0x597
code.vegaprotocol.io/vega/core/blockchain/abci.NewTmNode({{0x0}, 0x1, 0x1, 0x0, 0x0, {0xc0016a9d60, 0xa}, {{0x0}, {0xc0002a0c90, 0x15}}, ...}, ...)
    /jenkins/workspace/common/system-tests-wrapper/vega/core/blockchain/abci/tm_node.go:83 +0x725
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).startABCI(0xc001712480, 0x0?, {0x593e2e0, 0xc000e7c020}, {0x7ffd7cd9cfa2, 0x53}, {0x0?, 0x0?}, {0x0, 0x0})
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:426 +0x1b6
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).startBlockchain(0xc001712480, 0x0?, {0x7ffd7cd9cfa2?, 0x0?}, {0x0, 0x0}, {0x0, 0x0})
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:313 +0x3ab
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).Run.func1()
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:157 +0x99
created by code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).Run
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:148 +0x74f
panic: invalid node wallet configurations, the genesis validator mapping differ to the wallets imported by the nodewallet [recovered]
    panic: invalid node wallet configurations, the genesis validator mapping differ to the wallets imported by the nodewallet

goroutine 93 [running]:
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).Run.func1.1()
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:154 +0x59
panic({0x4859960, 0xc00153d020})
    /usr/local/go/src/runtime/panic.go:884 +0x213
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x8?, 0x8?, {0x0?, 0x0?, 0xc000f0b620?})
    /jenkins/GOPATH/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc001538000, {0xc0013b6000, 0x8, 0x8})
    /jenkins/GOPATH/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
go.uber.org/zap.(*Logger).Panic(0x62ea65c?, {0x52d2c70?, 0x210?}, {0xc0013b6000, 0x8, 0x8})
    /jenkins/GOPATH/pkg/mod/go.uber.org/zap@v1.24.0/logger.go:258 +0x59
code.vegaprotocol.io/vega/core/validators.(*Topology).checkValidatorDataWithSelfWallets(0xc00172aa80, {{0xc00007cbc0, 0x40}, {0xc00007cc40, 0x40}, 0x1, {0xc001624000, 0x2a}, {0xc001624030, 0x2c}, ...})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/validators/topology.go:622 +0xbd7
code.vegaprotocol.io/vega/core/validators.(*Topology).LoadValidatorsOnGenesis(0xc00172aa80, {0x591a900, 0xc000028150}, {0xc0027dc000, 0x11ff6, 0x12000})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/validators/topology.go:578 +0x2dd
code.vegaprotocol.io/vega/core/genesis.(*Handler).OnGenesis(0xc001646a40, {0x591a900, 0xc000028150}, {0x79a3ee0?, 0x4858aa0?, 0x0?}, {0xc0027dc000, 0x11ff6, 0x12000})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/genesis/handler.go:64 +0x3bd
code.vegaprotocol.io/vega/core/processor.(*App).OnInitChain(0xc000027800, {{0x15842892, 0xedc83ac86, 0x0}, {0xc0015c48a0, 0xb}, 0xc000d02660, {0xc001332d80, 0x5, 0x5}, ...})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/processor/abci.go:715 +0x4ea
code.vegaprotocol.io/vega/core/blockchain/abci.(*App).InitChain(0xc000f4ca50, {{0x15842892, 0xedc83ac86, 0x0}, {0xc0015c48a0, 0xb}, 0xc000d02660, {0xc001332d80, 0x5, 0x5}, ...})
    /jenkins/workspace/common/system-tests-wrapper/vega/core/blockchain/abci/abci.go:39 +0xbb
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*appW).InitChain(0x18?, {{0x15842892, 0xedc83ac86, 0x0}, {0xc0015c48a0, 0xb}, 0xc000d02660, {0xc001332d80, 0x5, 0x5}, ...})
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/app_wrapper.go:60 +0x7d
github.com/tendermint/tendermint/abci/client.(*localClient).InitChainSync(0xc0016fd860, {{0x15842892, 0xedc83ac86, 0x0}, {0xc0015c48a0, 0xb}, 0xc000d02660, {0xc001332d80, 0x5, 0x5}, ...})
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/abci/client/local_client.go:272 +0x118
github.com/tendermint/tendermint/proxy.(*appConnConsensus).InitChainSync(0xc000324cc0?, {{0x15842892, 0xedc83ac86, 0x0}, {0xc0015c48a0, 0xb}, 0xc000d02660, {0xc001332d80, 0x5, 0x5}, ...})
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/proxy/app_conn.go:77 +0x55
github.com/tendermint/tendermint/consensus.(*Handshaker).ReplayBlocks(_, {{{0xb, 0x1}, {0x508dbdb, 0x7}}, {0xc0015c48a0, 0xb}, 0x1, 0x0, {{0x0, ...}, ...}, ...}, ...)
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/consensus/replay.go:319 +0xf18
github.com/tendermint/tendermint/consensus.(*Handshaker).Handshake(0xc0005f0d80, {0x5944890, 0xc0016e71e0})
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/consensus/replay.go:268 +0x3d4
github.com/tendermint/tendermint/node.doHandshake({_, _}, {{{0xb, 0x0}, {0x508dbdb, 0x7}}, {0xc0015c48a0, 0xb}, 0x1, 0x0, ...}, ...)
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/node/node.go:329 +0x1b8
github.com/tendermint/tendermint/node.NewNode(0xc00032cf00, {0x5913050, 0xc00032b4a0}, 0xc000ea1870, {0x58f6900, 0xc000eb0060}, 0x0?, 0x0?, 0xc000ea1910, {0x5918910, ...}, ...)
    /jenkins/GOPATH/pkg/mod/github.com/vegaprotocol/cometbft@v0.34.28-0.20230322133204-3d8588de736e/node/node.go:779 +0x597
code.vegaprotocol.io/vega/core/blockchain/abci.NewTmNode({{0x0}, 0x1, 0x1, 0x0, 0x0, {0xc0015c5b00, 0xa}, {{0x0}, {0xc000079710, 0x15}}, ...}, ...)
    /jenkins/workspace/common/system-tests-wrapper/vega/core/blockchain/abci/tm_node.go:83 +0x725
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).startABCI(0xc001541680, 0x0?, {0x593e2e0, 0xc000f0a360}, {0x7ffd658fafa2, 0x53}, {0x0?, 0x0?}, {0x0, 0x0})
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:426 +0x1b6
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).startBlockchain(0xc001541680, 0x0?, {0x7ffd658fafa2?, 0x0?}, {0x0, 0x0}, {0x0, 0x0})
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:313 +0x3ab
code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).Run.func1()
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:157 +0x99
created by code.vegaprotocol.io/vega/cmd/vega/commands/node.(*Command).Run
    /jenkins/workspace/common/system-tests-wrapper/vega/cmd/vega/commands/node/node.go:148 +0x74f
Error: maximum number of possible restarts has been reached: failed after waiting for binary /jenkins/workspace/common/system-tests-wrapper/networkdata/testnet/visor/visor1/current/vega [node --home /jenkins/workspace/common/system-tests-wrapper/networkdata/testnet/vega/node1 --tendermint-home /jenkins/workspace/common/system-tests-wrapper/networkdata/testnet/tendermint/node1 --nodewallet-passphrase-file /jenkins/workspace/common/system-tests-wrapper/networkdata/testnet/vega/node1/node-vega-wallet-pass.txt]: exit status 2
Usage:
  vegavisor run [flags]

Flags:
  -h, --help          help for run
      --home string   Path to visor home folder

Jake 1 hour ago https://jenkins.vega.rocks/job/common/job/system-tests-wrapper/2217/artifact/testnet/l[…]dator/visor-0-with-vega.stdout-2023-09-01T10%3A27%3A33Z.log

Jake 1 hour ago The last network never started for... some reason

Jake 1 hour ago Can't explain the other failures (yet), but that explains why the soak test never got past block 0 :smile:

Jake 1 hour ago https://jenkins.vega.rocks/job/common/job/system-tests-wrapper/2217/artifact/testnet/l[…]dator/visor-1-with-vega.stderr-2023-09-01T10%3A27%3A33Z.log

Jake 1 hour ago BANG

Jake 1 hour ago Panic

Jake 43 minutes ago Hmm, it's not that simple though... the tests were already in a pretty horrible state by that point... that panic seems to be a symptom of everything else that happened, maybe. And still doesn't explain why all the other nodes sit there on block 0?

Jake 42 minutes ago Is this a thing: 2023-09-01T10:28:48.119Z ERROR core.cfgwatcher config/watcher.go:205 unable to load configuration {"error": "couldn't read configuration file at /jenkins/workspace/common/system-tests-wrapper/networkdata/testnet/vega/node0/config/node/config.toml: empty file"}

William Westgarth 42 minutes ago weird, looks like files going missing or something

Jake 36 minutes ago OK, I think one of the tests bails because the restart doesn't happen fast enough: [2023-09-01T10:23:52.209Z] 2023-09-01 10:23:51.891:[INFO] [MainProcess] [MainThread] [ network.py]:[wait_for_data_node_ready]:[ 215] Starting calls to data node to see if ready. [2023-09-01T10:23:52.506Z] 2023-09-01 10:23:52.237:[INFO] [MainProcess] [MainThread] [ conftest.py]:[pytest_runtest_makereport]:[ 66] Marking test as failed for use in tearDown ... [2023-09-01T10:23:58.149Z] 2023-09-01 10:23:57.535:[INFO] [MainProcess] [MainThread] [ conftest.py]:[check_for_network_errors]:[ 139] The following nodes do not appear to be producing new blocks: ['testnet-nodeset-validators-0-validator', 'testnet-nodeset-validators-1-validator', 'testnet-nodeset-validators-2-validator', 'testnet-nodeset-validators-clef-3-validator', 'testnet-nodeset-validators-clef-4-validator', 'testnet-nodeset-full-5-full'] (I'd have to double-check how trustworthy that check is, since prior to that it claims data node was alive): [2023-09-01T10:23:52.506Z] 2023-09-01 10:23:52.308:[INFO] [MainProcess] [MainThread] [ network.py]:[ node_up]:[ 255] testnet-nodeset-full-5-full Block height > 0: 31 ... (I was exploring the idea that maybe the checkpoint restart didn't complete and then it got stop again by the next test too soon? But I'd say that doesn't match up

Jake 35 minutes ago Although block 31 - 32 is a 10 second block, which might have tricked the check...

Jake 31 minutes ago So that's not a problem... the test a bit later on also just gets a 10 second block... moving on

Jake 24 minutes ago Next test to fail is this one: test_checkpoint_after_key_rotations key rotation doesn't give the right event... The next test after that fails with what could be a slow block, but the vega logs look kind of OK, but I note the times quoted for the events are weird: [2023-09-01T10:25:47.147Z] 64: 2023-09-01T10:25:15.881158567Z (6.275924921035767 secs) [2023-09-01T10:25:47.147Z] 65: 2023-09-01T10:25:22.157083659Z (3.1232659816741943 secs) [2023-09-01T10:25:47.147Z] 77: 2023-09-01T10:25:37.746773394Z (2.7761120796203613 secs) That's claiming a really short time, but I see a 15 second gap there?!

Jake 23 minutes ago Then it's just more of the same, until eventually something does a checkpoint restart and the network doesn't come back up. That's the best I can fathom.

Observed behaviour

Panic

Expected behaviour

No Panics

Steps to reproduce

Unknown

Software version

77e769f768e73c78df33c79df6f11332593aed53

Failing test

No response

Jenkins run

No response

Configuration used

No response

Relevant log output

No response

wwestgarth commented 10 months ago

The events that lead the the panic are:

If the rotation event had been found the system tests would call lnlMod.sync_genesis_with_network() and all would be fine, but there is no recovery in any teardown that does this, so we wander into the next test with a bad network state.

So the panic is not the issue here. The unknown is why the key-rotation command didn't make it to core and this is difficult to know because 1) all the key rotation logs are set to debug 2) the network state is removed by the subsequent tests so I can't replay locally.

My suspicion is that its a timing issue? but I've got no evidence, so I'll just run the test in a loop a load to see if I can make it happen again.

gordsport commented 10 months ago

Closing as this cant be reproduced