vegaprotocol / vega

A Go implementation of the Vega Protocol, a protocol for creating and trading derivatives on a fully decentralised network.
https://vega.xyz
GNU Affero General Public License v3.0
38 stars 22 forks source link

Validator heartbeats: an expired heartbeat invalidates the next one if it eventually comes in #7908

Closed wwestgarth closed 1 year ago

wwestgarth commented 1 year ago

Problem encountered

An edge case cropped up on the overnight system-tests where a validator was missing a heartbeat. That series of events are:

We do not sent expectedNextHash along with the heartbeat transaction so there is no way to know the difference between a late one to ignore because its already been expired, and a real one that just doesn't verify. We could add it and just ignore heartbeats whos hash mismatched expectedNextHash because it will be stale.

In this particular system-test "which gets delayed for whatever reason" is a bit interesting but not worth worrying about. Essentially a node is picked to send a heart beat in block 1. For some reason the block time between block 1 and 2 is +50s (I'm thinking its just TM waiting for all the nodes to start up and reach consensus) so the heartbeat expires. The node is then picked to send a new heartbeat in block 2. Original heartbeat comes in just because it was sitting an waiting in the mempool.

https://jenkins.ops.vega.xyz/job/common/job/system-tests/19570/testReport/junit/tests.validators/rewards_split_test/Call_tests___network_infra_validators_r_z___test_ersatz_reward_split/

Observed behaviour

A clear and concise description of how the system is behaving.

Expected behaviour

A clear and concise description of what you expected to happen.

System response

Describe what the system response was, include the output from the command, automation, or else.

Steps to reproduce

Manual

Steps to reproduce the behaviour manually:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Automation

Link to automation and explanation on how to run it to reproduce the problem/bug

Evidence

Logs

If applicable, add logs and/or screenshots to help explain your problem.

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

After Merging

jgsbennett commented 1 year ago

Just to echo a conversation @wwestgarth and I were having on slack: Why is block one 50s length? Seems like if that didn't happen (does it make sense?) or if we made it impossible to require heartbeats on block 1, this issue might never exist. If there's some way to fix this, it saves us from having to make some pretty awful workarounds in the test code, where either we have to ~ double a wait from 12 epochs to 22 epochs, or attempt to detect that we have an behind node and swap which we're using for the test.