wwestgarth commented 2 years ago

Problem encountered

In real life this isn't going to be something people are going to be doing because the whole point of snapshots is that you can avoid replaying a whole chain from the beginning.

I feel it is worth investigating though because whatever issue is here might also be possible at another snapshot height.

I see two things happening when I try:

we try to send validator hearbeats over and over again which always fail because when we're replaying TM doesn't accept incoming transactions
we fail to reach consensus after a few blocks

The first issue is a little bit of concern because the code the sends a heartbeat recurses forever:

func (t *Topology) sendHeartbeat(ctx context.Context, hb *commandspb.ValidatorHeartbeat) {
    t.cmd.CommandSync(ctx, txn.ValidatorHeartbeatCommand, hb, func(err error) {
        if err != nil {
            t.log.Error("couldn't send validator heartbeat", logging.Error(err))
            // we do a simple call again for it
            t.sendHeartbeat(ctx, hb)
        }
    })
}

and it is probable that this could happen in any snapshot. I've just not seen it yet because in the static set of validators I'm testing with, no heartbeats need to be sent.

It also opens the question of why are we even taking a snapshot at block one? With snapshot.interval.length set to 5 we create snapshots at block-heights 1, 6, 11, 16 when you would kind of expect 5, 10, 15, 20.

Observed behaviour

A clear and concise description of how the system is behaving.

Expected behaviour

A clear and concise description of what you expected to happen.

System response

Describe what the system response was, include the output from the command, automation, or else.

Logs

If applicable, add logs and/or screenshots to help explain your problem.

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

[ ] Code refactored to meet SOLID and other code design principles
[ ] Code is compilation error, warning, and hint free
[ ] Carry out a basic happy path end-to-end check of the new code
[ ] All APIs are documented so auto-generated documentation is created
[ ] All bug recreation steps can be followed without presenting the original error/bug
[ ] All Unit, Integration and BVT tests are passing
[ ] Implementation is peer reviewed (coding standards, meeting acceptance criteria, code/design quality)
[ ] Create front end or console tickets with feature labels (should be done when starting the work if dependencies known i.e. API changes)

After Merging

[ ] Move development ticket to Done if there is NO requirement for new system-tests
[ ] Resolve any issues with broken system-tests
[ ] Create documentation tickets with feature labels if functionality has changed, or is a new feature

jeremyletang commented 2 years ago

Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already.

Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.

wwestgarth commented 2 years ago

Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already.

Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.

Agree. And I've seen similar in the logs before with other transactions from the evt-forwarder that can't be sent in, but its mostly harmless because it'll be in the next block we're replaying from anyway. The heartbeat stuff though just fills my screen with logs! I guess all we can really do right now is to have a reasonable limit on how many times to try to send it.

jeremyletang commented 2 years ago

Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already. Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.

Agree. And I've seen similar in the logs before with other transactions from the evt-forwarder that can't be sent in, but its mostly harmless because it'll be in the next block we're replaying from anyway. The heartbeat stuff though just fills my screen with logs! I guess all we can really do right now is to have a reasonable limit on how many times to try to send it.

I think the heartbeat my try to resend without delay or anything... which might not be best.

 185// sendHeartbeat sends the hearbeat transaction.
 186func (t *Topology) sendHeartbeat(ctx context.Context, hb *commandspb.ValidatorHeartbeat) {
 187    t.cmd.CommandSync(ctx, txn.ValidatorHeartbeatCommand, hb, func(err error) {
 188        if err != nil {
 189            t.log.Error("couldn't send validator heartbeat", logging.Error(err))
 190            // we do a simple call again for it
 191            t.sendHeartbeat(ctx, hb)
 192        }
 193    })
 194}

A small sleep / ticker / exponential backoff would do really

vegaprotocol / vega

Loading from snapshot at block height `1` fails #4904