Stale events silently drop user transactions

OlegMazurov commented 3 weeks ago

Description

With additional logging to DefaultStaleEventDetector.addConsensusRound()to report stale events, I observe:

2024-09-05 18:37:17.135 33328    WARN  INVALID_EVENT_ERROR <platformForkJoinThread-2> DefaultStaleEventDetector: addConsensusRound gen: 110945  created: 2024-09-05T18:37:08.469161362Z  #txn: 109  threshold: 110953

This means that the event went stale 9 seconds after it was created. There were 109 transactions in the event, mostly user transactions. All those transactions were silently dropped (only system transactions are resubmitted). However, transaction records remained cached and TransactionReceiptQueries would return OK until transaction expiration - for another 171 seconds. Finally, client gets RECEIPT_NOT_FOUND. It has to check the status of the transaction with the mirror node only to find out that it has not been executed, so the transaction needs to be resubmitted by the client. All that creates poor user experience. It also affects performance testing as pending transactions decrease throughput.

Steps to reproduce

Stale events and their effect were observed in a performance network (engnet1) when running the NftTransferLoadTest benchmark at ~10K TPS.

Additional context

No response

Hedera network

other

Version

v0.54.0-SNAPSHOT

Operating system

Linux

lpetrovic05 commented 3 weeks ago

Stale events cannot be avoided. What we can do is provide these events to the app which can choose to resubmit them.

rbair23 commented 3 weeks ago

I agree, this is something we should fix. It is covered in the design for consensus nodes proposal. I don't know the timeline for the fix, but we definitely should do this.

poulok commented 2 weeks ago

@rbair23, I have marked this ticket as high priority in the Platform Backlog project

hashgraph / hedera-services