vegaprotocol / vega

A Go implementation of the Vega Protocol, a protocol for creating and trading derivatives on a fully decentralised network.
https://vega.xyz
GNU Affero General Public License v3.0
37 stars 22 forks source link

Loading from snapshot at block height `1` fails #4904

Closed wwestgarth closed 2 years ago

wwestgarth commented 2 years ago

Problem encountered

In real life this isn't going to be something people are going to be doing because the whole point of snapshots is that you can avoid replaying a whole chain from the beginning.

I feel it is worth investigating though because whatever issue is here might also be possible at another snapshot height.

I see two things happening when I try:

The first issue is a little bit of concern because the code the sends a heartbeat recurses forever:

func (t *Topology) sendHeartbeat(ctx context.Context, hb *commandspb.ValidatorHeartbeat) {
    t.cmd.CommandSync(ctx, txn.ValidatorHeartbeatCommand, hb, func(err error) {
        if err != nil {
            t.log.Error("couldn't send validator heartbeat", logging.Error(err))
            // we do a simple call again for it
            t.sendHeartbeat(ctx, hb)
        }
    })
}

and it is probable that this could happen in any snapshot. I've just not seen it yet because in the static set of validators I'm testing with, no heartbeats need to be sent.

It also opens the question of why are we even taking a snapshot at block one? With snapshot.interval.length set to 5 we create snapshots at block-heights 1, 6, 11, 16 when you would kind of expect 5, 10, 15, 20.

Observed behaviour

A clear and concise description of how the system is behaving.

Expected behaviour

A clear and concise description of what you expected to happen.

System response

Describe what the system response was, include the output from the command, automation, or else.

Logs

If applicable, add logs and/or screenshots to help explain your problem.

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

After Merging

jeremyletang commented 2 years ago

Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already.

Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.

wwestgarth commented 2 years ago

Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already.

Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.

Agree. And I've seen similar in the logs before with other transactions from the evt-forwarder that can't be sent in, but its mostly harmless because it'll be in the next block we're replaying from anyway. The heartbeat stuff though just fills my screen with logs! I guess all we can really do right now is to have a reasonable limit on how many times to try to send it.

jeremyletang commented 2 years ago

Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already. Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.

Agree. And I've seen similar in the logs before with other transactions from the evt-forwarder that can't be sent in, but its mostly harmless because it'll be in the next block we're replaying from anyway. The heartbeat stuff though just fills my screen with logs! I guess all we can really do right now is to have a reasonable limit on how many times to try to send it.

I think the heartbeat my try to resend without delay or anything... which might not be best.

 185// sendHeartbeat sends the hearbeat transaction.
 186func (t *Topology) sendHeartbeat(ctx context.Context, hb *commandspb.ValidatorHeartbeat) {
 187    t.cmd.CommandSync(ctx, txn.ValidatorHeartbeatCommand, hb, func(err error) {
 188        if err != nil {
 189            t.log.Error("couldn't send validator heartbeat", logging.Error(err))
 190            // we do a simple call again for it
 191            t.sendHeartbeat(ctx, hb)
 192        }
 193    })
 194}

A small sleep / ticker / exponential backoff would do really