vegaprotocol / vega

A Go implementation of the Vega Protocol, a protocol for creating and trading derivatives on a fully decentralised network.
https://vega.xyz
GNU Affero General Public License v3.0
37 stars 22 forks source link

Core hangs during chain replay (possibly due to TM upgrade) #5606

Closed wwestgarth closed 2 years ago

wwestgarth commented 2 years ago

Problem encountered

The network-infra test test_validator_performance_score_with_removed_validator timeouts because a restarted node hangs while replaying the chain. This test has been working fine up until a couple of days ago, the only significant change that has happened since then is the tendermint upgrade/internalisation.

Summary of what the test does (because its not much):

when the node starts up again it replays the chain from 0, but eventually stops catching up blocks. In the below logs we can see that the node should catch up to block-height ~2500 but freezes at 1678.

2022-07-22 09:31:06.694:[INFO] MainProcess network.py:wait_until_ready:353 ----- waiting for testnet-nodeset-validators-3-validator to replay chain, max timeout 2413 seconds
2022-07-22 09:31:07.695:[INFO] MainProcess network.py:core_connection:240 ----- connecting to testnet-nodeset-validators-3-validator
2022-07-22 09:31:07.701:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1502
2022-07-22 09:31:08.703:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1598
2022-07-22 09:31:09.705:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:10.706:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:11.708:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:12.709:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:13.711:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:14.712:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:15.713:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:16.715:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:17.717:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:18.719:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678
2022-07-22 09:31:19.720:[INFO] MainProcess network.py:wait_until_ready:357 ----- chain status 2 block-height 1678

in the core logs there is not much to go on. In core we leave the call to app.OnEndBlock() where we then get a single log from tendermint saying the block was executed. After that we just have aync calls coming in to checkTx, and calls into the statistics endpoint. It looks like we're stuck in tendermint somewhere, because the call to OnCommit never comes.

Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   rewards.rewards rewards/engine.go:198   OnEpochEvent
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   oracles oracles/engine.go:94    no subscriber matches the oracle data   {"pub-keys": [], "data": "vegaprotocol.builtin.timestamp:1658481827"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   execution   execution/engine.go:760 updating engine on new time update
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   processor   processor/abci.go:633   ABCI service BEGIN completed    {"current-timestamp": 1658481827517468365, "previous-timestamp": 1658481826993466685, "current-datetime": "2022-07-22T09:23:47.517468365Z", "previous-datetime": "2022-07-22T09:23:46.993466685Z", "height": 1678}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   validators  validators/topology_key_rotate.go:184   Trying to apply pending key rotations   {"currentBlockHeight": 1678}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   processor   processor/abci.go:606   leaving begin block {"at": "2022-07-22T09:31:09.459Z"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   processor   processor/abci.go:573   entering end block  {"at": "2022-07-22T09:31:09.459Z"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   processor   processor/abci.go:576   ABCI service END block completed    {"current-timestamp": 1658481827517468365, "previous-timestamp": 1658481826993466685, "current-datetime": "2022-07-22T09:23:47.517468365Z", "previous-datetime": "2022-07-22T09:23:46.993466685Z"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   spam    spam/engine.go:184  Spam protection EndOfBlock called   {"blockHeight": 1678}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   processor   processor/abci.go:574   leaving end block   {"at": "2022-07-22T09:31:09.459Z"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   INFO    tendermint  state/execution.go:371  executed block  {"height": 1678, "num_valid_txs": 0, "num_invalid_txs": 0}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   processor   processor/abci.go:750   entering checkTx    {"tid": "", "command": "Node Signature"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   validators  validators/topology.go:379  requested existing validator    {"validators": ["b280f1f942608a0c9eacd7c47852d1c6ad8b1282b8c6dde19e0964ca21b7bfa1", "ebdb3e68932f8e01890a13371d7faf5a7789fd06bf1f5eca6212313ff10036bb", "d56d8f71c02583ca48e763e14a59eaefe8723b9d87d0d66e95ed2ddb2e5575f8", "7de9edb06c7dddcd2f90a003eace11db2fa5667570821b68002b7dcb9add314a", "39b72fe0b207e02257bbebf9c2ba7f0ba3854bfa0373ac84e6b0b9814c110f18"], "pubkey": "b280f1f942608a0c9eacd7c47852d1c6ad8b1282b8c6dde19e0964ca21b7bfa1"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   processor   processor/abci.go:764   transaction passed checkTx  {"tid": "", "command": "Node Signature"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   DEBUG   validators  validators/topology.go:379  requested existing validator    {"validators": ["ebdb3e68932f8e01890a13371d7faf5a7789fd06bf1f5eca6212313ff10036bb", "d56d8f71c02583ca48e763e14a59eaefe8723b9d87d0d66e95ed2ddb2e5575f8", "7de9edb06c7dddcd2f90a003eace11db2fa5667570821b68002b7dcb9add314a", "39b72fe0b207e02257bbebf9c2ba7f0ba3854bfa0373ac84e6b0b9814c110f18", "b280f1f942608a0c9eacd7c47852d1c6ad8b1282b8c6dde19e0964ca21b7bfa1"], "pubkey": "b280f1f942608a0c9eacd7c47852d1c6ad8b1282b8c6dde19e0964ca21b7bfa1"}
Job: testnet-nodeset-validators-3-validator, Task: vega-validator-3: 2022-07-22T09:31:09.459Z   ERROR   notary.notary   notary/notary.go:250    could not send the transaction to tendermint

Failing pipelines where I have seen this happen while doing full runs on branchs: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/5479/pipeline/269/ https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/5502/pipeline/ https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/5503/pipeline/269

Failing pipeline with no changes running the test in isolation: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/5512/pipeline

This is an intermittent problem as something the test will pass and replay works.

Evidence

Logs

If applicable, add logs and/or screenshots to help explain your problem.

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

After Merging

wwestgarth commented 2 years ago

full logs of a failing case: failing-node.log

gordsport commented 2 years ago

The bump of version on tendermint seems to have resolved this - leaving open this sprint to keep an eye and if we do not see this issue again we can close

jeremyletang commented 2 years ago

@ze97286 @wwestgarth what do we think about closing this one for now?

ze97286 commented 2 years ago

closing this as it seems to be resolved, will reopen if we come across it again.