Open OlegMazurov opened 1 month ago
I used release/0.52
@010a5d3a
in the example.
I don't know how much we can read into the results from a single node network. The most important function of the health monitor is to regulate gossip, and second is the regulation of event creation.
Since a single node network doesn't gossip, this test doesn't represent real-world behavior well. Consequently, the results look quite different for multi-node networks.
@OlegMazurov did you modify the health monitor configuration, or are you running with the same configuration as in develop
? The configuration in develop
was tuned for 32 node networks, and I can promise you it's not going to work well in a single node environment. I suggest partnering with somebody who understands platform code if you aren't able to figure out a good configuration (@litt3 is the current resident expert).
One interpretation of this issue is exactly that: the health monitor has to be configured differently for various environments (single-node laptop, single-node large Linux box, 7-node network, 30-node network, etc.). It's going to be untenable and always sub-optimal in dynamic heterogeneous networks.
This is less a property of the health monitor, and more an emergent property of the hashgraph consensus algorithm and distributed systems. Gossip, event creation, and consensus function drastically differently depending on network size and configuration, and as a result provide workloads that are very different in the single node case vs. the many node case.
At the end of the day, a single node network is just not a very realistic way to test many important aspects of our system. Solutions that might work well on a single node often don't work with many nodes, and solutions that work well on many node systems (like the health monitor) may not be optimal choices for single node setups. It's hard to deny that the health monitor change resulted in major performance improvements in larger networks, where as many attempted solutions attacking the problem from a "single node in isolation" perspective were unable to make headway.
I agree there are certainly things that cannot be tested on a single-node network. I also agree with Oleg that the design we need, and the implementation we need, of the system is one that does not require different manual configuration changes to tune for different environments. Manual tuning is error prone and leads to less stable systems than automatic tuning. I think auto-tuning is feasible, and we need to work through ways to improve the design and implementation of the consensus node to permit this. We want a system with minimal tuning required.
Looks like the health monitor change finally went live.
@oleg, perhaps the health monitor is a little more efficient than you choose to believe? Personally, I'm not all that surprised. This is exactly the type of behavior I predicted ~10 months ago, and then subsequently observed during multiple rounds of testing.
The health monitor is significantly more efficient than the previous implementation, which was prone to throttling down the platform fork-join pool up to a deadlock. It's not a surprise that things look better after that source of instability has been removed. The health monitor inefficiency claim is still valid, however. There are many manifestations of that. The mere fact that the size of the platform fork-join pool had to be decreased to 8 and we can't restore it to 48 without causing stability issues is just another evidence.
Getting the platform thread pool back up to a larger number should actually be pretty simple. The transaction handling thread just needs to get a dedicated core that doesn't have to compete for CPU resources with lower priority tasks. Limiting the number of threads in the default platform thread pool was intended to be a bandaid fix until something like this could be done. I'm surprised nobody has done that yet.
This issue is about health monitor's interference with operations to the extent of keeping the transaction handling thread underutilized: transaction handling is not a bottleneck any more as it should be. Providing more threads for the platform fork-join pool makes other components even faster (pre-handle is the most important one) and task piling up in front of transaction handling more frequent and more severe. Transaction handling idling has been confirmed in performance networks as well, so it's not a single-node mode artifact any more. With that idling in mind, the hope that binding the transaction handling thread to a CPU would hep is not well founded. I did manual binding experiments when transaction handling was a true bottleneck with 100% utilization. The results were negative (no measurable effect) so that's another argument the proposed performance model is not adequate.
Have you combined the pinning of the transaction handling CPU with increasing the size of the platform thread pool?
I performed the pinning experiment when the platform thread pool was full size (48 threads) and the transaction handling thread was 100% busy (i.e. before health monitor). Again, there was no measurable effect then and I don't expect it now when the thread occasionally goes idle.
Problem
Below are metrics collected from a single node processing
NftTransferLoadTest
at full speed. Ideally, transaction handling, being the actual bottleneck, should be 100% busy all the time and the unhandled task queue should fluctuate around a steady number. That's not the case in this example as the health monitor discovers an unhealthy state with a time lag and its reaction is too harsh emptying the incoming queue for a prolonged period of time and allowing the handling thread to go idle.Solution
Tuning the health monitor may help to increase throughput in the short run. A longer term solution requires a better mechanism to limit buffering unhandled tasks for transaction handling and the entire pipeline.
Alternatives
No response