Closed wwestgarth closed 2 years ago
full logs of a failing case: failing-node.log
The bump of version on tendermint seems to have resolved this - leaving open this sprint to keep an eye and if we do not see this issue again we can close
@ze97286 @wwestgarth what do we think about closing this one for now?
closing this as it seems to be resolved, will reopen if we come across it again.
Problem encountered
The network-infra test
test_validator_performance_score_with_removed_validator
timeouts because a restarted node hangs while replaying the chain. This test has been working fine up until a couple of days ago, the only significant change that has happened since then is the tendermint upgrade/internalisation.Summary of what the test does (because its not much):
when the node starts up again it replays the chain from 0, but eventually stops catching up blocks. In the below logs we can see that the node should catch up to block-height ~2500 but freezes at 1678.
in the core logs there is not much to go on. In core we leave the call to
app.OnEndBlock()
where we then get a single log from tendermint saying the block was executed. After that we just have aync calls coming in to checkTx, and calls into the statistics endpoint. It looks like we're stuck in tendermint somewhere, because the call to OnCommit never comes.Failing pipelines where I have seen this happen while doing full runs on branchs: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/5479/pipeline/269/ https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/5502/pipeline/ https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/5503/pipeline/269
Failing pipeline with no changes running the test in isolation: https://jenkins.ops.vega.xyz/blue/organizations/jenkins/common%2Fsystem-tests/detail/system-tests/5512/pipeline
This is an intermittent problem as something the test will pass and replay works.
Evidence
Logs
If applicable, add logs and/or screenshots to help explain your problem.
Additional context
Add any other context about the problem here including; system version numbers, components affected.
Definition of Done
Before Merging
After Merging
Done
if there is NO requirement for new system-tests