Closed wwestgarth closed 1 year ago
Just to echo a conversation @wwestgarth and I were having on slack: Why is block one 50s length? Seems like if that didn't happen (does it make sense?) or if we made it impossible to require heartbeats on block 1, this issue might never exist. If there's some way to fix this, it saves us from having to make some pretty awful workarounds in the test code, where either we have to ~ double a wait from 12 epochs to 22 epochs, or attempt to detect that we have an behind node and swap which we're using for the test.
Problem encountered
An edge case cropped up on the overnight system-tests where a validator was missing a heartbeat. That series of events are:
v.expectedNextHash == ""
expectedNextHash
is different. We fail the heartbeat and setv.expectedNextHash == ""
againv.expectedNextHash == ""
, so then it also does not verify and is invalid.We do not sent
expectedNextHash
along with the heartbeat transaction so there is no way to know the difference between a late one to ignore because its already been expired, and a real one that just doesn't verify. We could add it and just ignore heartbeats whos hash mismatchedexpectedNextHash
because it will be stale.In this particular system-test "which gets delayed for whatever reason" is a bit interesting but not worth worrying about. Essentially a node is picked to send a heart beat in block 1. For some reason the block time between block 1 and 2 is +50s (I'm thinking its just TM waiting for all the nodes to start up and reach consensus) so the heartbeat expires. The node is then picked to send a new heartbeat in block 2. Original heartbeat comes in just because it was sitting an waiting in the mempool.
https://jenkins.ops.vega.xyz/job/common/job/system-tests/19570/testReport/junit/tests.validators/rewards_split_test/Call_tests___network_infra_validators_r_z___test_ersatz_reward_split/
Observed behaviour
A clear and concise description of how the system is behaving.
Expected behaviour
A clear and concise description of what you expected to happen.
System response
Describe what the system response was, include the output from the command, automation, or else.
Steps to reproduce
Manual
Steps to reproduce the behaviour manually:
Automation
Link to automation and explanation on how to run it to reproduce the problem/bug
Evidence
Logs
If applicable, add logs and/or screenshots to help explain your problem.
Additional context
Add any other context about the problem here including; system version numbers, components affected.
Definition of Done
Before Merging
After Merging
Done
if there is NO requirement for new system-tests