Open alex-kuzmin-hg opened 2 months ago
@alex-kuzmin-hg To add more comments from prev triage
replayPces
is another example is another example of inadequacy of the current health monitor based solution for back pressure. The high count value is for a busy loop that burns CPU due to rateLimiter
until the system becomes unhealthy. The limit is statically configured for all nodes and can't be made efficient without a risk for the system to blow up. Once the system becomes unhealthy, 100ms delays between checks turn out to be another source of inefficiency.
A proper adaptive back pressure mechanism is needed.
Description
By Tom's experiment engnet2-20240906_175600 (see details below), it was performance-triaged that the following rate measurement logic could be more optimized:
https://perf.analytics.eng.hashgraph.io/permanent/engnet2-20240906_175600/reports/com.swirlds.platform.event.preconsensus.PcesReplayer.html#176
Experiment details:
Grafana:
Steps to reproduce
It will be easy to reproduce as perf/algorithm analysis will in GHA (to do:on me). Performance code-review is capable to deduce the logic by annotated sources..
Additional context
No response
Hedera network
No response
Version
v0.53
Operating system
Linux