Performance: poor performance solution for rate measurement PcesReplayer.replayPces()

alex-kuzmin-hg commented 2 months ago

Description

By Tom's experiment engnet2-20240906_175600 (see details below), it was performance-triaged that the following rate measurement logic could be more optimized:

https://perf.analytics.eng.hashgraph.io/permanent/engnet2-20240906_175600/reports/com.swirlds.platform.event.preconsensus.PcesReplayer.html#176

Experiment details:

The mix2k should contain the entire idle period plus the load period includes at least the first 15 mins of the peak level
clients started to complain PLATFORM_TRANSACTION_NOT_CREATED like errors basically above 500 TPS level thus we lost about 1/5 to 1/4 of intended traffic(aiming for 2k total).  But all smartcontract tx went through as it was able to reach the 15M gas usage and had been able to stay just below it.  But looks like 2K is the limit for us given the elevated secC2C

Grafana:

Steps to reproduce

It will be easy to reproduce as perf/algorithm analysis will in GHA (to do:on me). Performance code-review is capable to deduce the logic by annotated sources..

Additional context

No response

Hedera network

No response

Version

v0.53

Operating system

Linux

alex-kuzmin-hg commented 1 month ago

@alex-kuzmin-hg To add more comments from prev triage

OlegMazurov commented 1 month ago

replayPces is another example is another example of inadequacy of the current health monitor based solution for back pressure. The high count value is for a busy loop that burns CPU due to rateLimiter until the system becomes unhealthy. The limit is statically configured for all nodes and can't be made efficient without a risk for the system to blow up. Once the system becomes unhealthy, 100ms delays between checks turn out to be another source of inefficiency. A proper adaptive back pressure mechanism is needed.

hashgraph / hedera-services