Closed wwestgarth closed 2 years ago
Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already.
Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.
Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already.
Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.
Agree. And I've seen similar in the logs before with other transactions from the evt-forwarder that can't be sent in, but its mostly harmless because it'll be in the next block we're replaying from anyway. The heartbeat stuff though just fills my screen with logs! I guess all we can really do right now is to have a reasonable limit on how many times to try to send it.
Yea that's one annoying thing. Basically tendermint do not instruct if it's replaying or not. Meaning that a node thinking it's a validator will try to send transaction when replaying the block. These transaction will be useless as the blocks have been process for long already. Things would be much easier if the core could be aware of tendermint replaying block, so we could avoid trying to send transaction etc I suppose.
Agree. And I've seen similar in the logs before with other transactions from the evt-forwarder that can't be sent in, but its mostly harmless because it'll be in the next block we're replaying from anyway. The heartbeat stuff though just fills my screen with logs! I guess all we can really do right now is to have a reasonable limit on how many times to try to send it.
I think the heartbeat my try to resend without delay or anything... which might not be best.
185// sendHeartbeat sends the hearbeat transaction.
186func (t *Topology) sendHeartbeat(ctx context.Context, hb *commandspb.ValidatorHeartbeat) {
187 t.cmd.CommandSync(ctx, txn.ValidatorHeartbeatCommand, hb, func(err error) {
188 if err != nil {
189 t.log.Error("couldn't send validator heartbeat", logging.Error(err))
190 // we do a simple call again for it
191 t.sendHeartbeat(ctx, hb)
192 }
193 })
194}
A small sleep / ticker / exponential backoff would do really
Problem encountered
In real life this isn't going to be something people are going to be doing because the whole point of snapshots is that you can avoid replaying a whole chain from the beginning.
I feel it is worth investigating though because whatever issue is here might also be possible at another snapshot height.
I see two things happening when I try:
The first issue is a little bit of concern because the code the sends a heartbeat recurses forever:
and it is probable that this could happen in any snapshot. I've just not seen it yet because in the static set of validators I'm testing with, no heartbeats need to be sent.
It also opens the question of why are we even taking a snapshot at block one? With
snapshot.interval.length
set to5
we create snapshots at block-heights1, 6, 11, 16
when you would kind of expect5, 10, 15, 20
.Observed behaviour
A clear and concise description of how the system is behaving.
Expected behaviour
A clear and concise description of what you expected to happen.
System response
Describe what the system response was, include the output from the command, automation, or else.
Logs
If applicable, add logs and/or screenshots to help explain your problem.
Additional context
Add any other context about the problem here including; system version numbers, components affected.
Definition of Done
Before Merging
After Merging
Done
if there is NO requirement for new system-tests