Health monitor is not efficient

OlegMazurov commented 1 month ago

Problem

Below are metrics collected from a single node processing NftTransferLoadTest at full speed. Ideally, transaction handling, being the actual bottleneck, should be 100% busy all the time and the unhandled task queue should fluctuate around a steady number. That's not the case in this example as the health monitor discovers an unhealthy state with a time lag and its reaction is too harsh emptying the incoming queue for a prolonged period of time and allowing the handling thread to go idle.

 time                   trans_per_sec unhealthyDuration TransactionHandler_unhandled_task_count TransactionHandler_busy_fraction
--------------------------------------------------------------------------------------------------------------------------------
2024-07-25 15:25:06 UTC 6441.49       0.200             3                                       0.743
2024-07-25 15:25:09 UTC 6309.36       0.000             1                                       0.720
2024-07-25 15:25:12 UTC 6711.45       1.100             63                                      0.951
2024-07-25 15:25:15 UTC 5758.57       0.200             38                                      1.000
2024-07-25 15:25:18 UTC 5132.63       3.200             47                                      1.000
2024-07-25 15:25:21 UTC 4167.24       6.200             37                                      1.000
2024-07-25 15:25:24 UTC 3537.85       2.300             31                                      1.000
2024-07-25 15:25:27 UTC 3083.83       0.000             19                                      0.928
2024-07-25 15:25:30 UTC 3916.83       0.000             1                                       0.916
2024-07-25 15:25:33 UTC 4591.21       0.000             14                                      0.895
2024-07-25 15:25:36 UTC 4770.38       0.000             5                                       0.884
2024-07-25 15:25:39 UTC 4898.66       0.000             11                                      0.912
2024-07-25 15:25:42 UTC 5509.99       0.500             36                                      1.000
2024-07-25 15:25:45 UTC 5189.37       0.000             2                                       0.917
2024-07-25 15:25:48 UTC 5616.71       0.000             9                                       0.886
2024-07-25 15:25:51 UTC 6059.27       0.000             0                                       0.998
2024-07-25 15:25:54 UTC 6000.89       0.000             0                                       0.624
2024-07-25 15:25:57 UTC 5880.47       0.000             0                                       0.445
2024-07-25 15:26:00 UTC 6030.48       0.000             8                                       0.843
2024-07-25 15:26:03 UTC 6029.67       0.000             1                                       0.956
2024-07-25 15:26:06 UTC 6273.28       0.000             2                                       0.953
2024-07-25 15:26:09 UTC 6356.39       0.000             1                                       0.852
2024-07-25 15:26:12 UTC 6229.92       0.300             43                                      0.888
2024-07-25 15:26:15 UTC 5957.35       0.000             18                                      1.000
2024-07-25 15:26:18 UTC 5992.18       0.000             1                                       0.917
2024-07-25 15:26:21 UTC 6115.42       0.000             7                                       0.732
2024-07-25 15:26:24 UTC 6087.66       0.000             1                                       0.828
2024-07-25 15:26:27 UTC 6297.59       0.000             1                                       0.861
2024-07-25 15:26:30 UTC 6276.00       0.000             1                                       0.578
2024-07-25 15:26:33 UTC 6208.90       0.000             1                                       0.679
2024-07-25 15:26:36 UTC 6344.00       0.000             17                                      0.755
2024-07-25 15:26:39 UTC 5551.75       2.400             40                                      1.000
2024-07-25 15:26:42 UTC 4988.60       1.900             50                                      1.000
2024-07-25 15:26:45 UTC 4575.24       0.000             30                                      1.000
2024-07-25 15:26:48 UTC 4900.30       0.000             28                                      1.000
2024-07-25 15:26:51 UTC 5115.38       0.000             19                                      1.000
2024-07-25 15:26:54 UTC 5181.09       0.900             60                                      1.000
2024-07-25 15:26:57 UTC 4645.69       0.000             8                                       1.000
2024-07-25 15:27:00 UTC 4971.97       0.000             2                                       0.606
2024-07-25 15:27:03 UTC 5178.77       0.000             1                                       0.562
2024-07-25 15:27:06 UTC 5336.48       1.000             62                                      0.853
2024-07-25 15:27:09 UTC 4618.90       0.000             20                                      1.000
2024-07-25 15:27:12 UTC 4871.70       0.000             17                                      0.915
2024-07-25 15:27:15 UTC 4871.62       0.000             20                                      1.000
2024-07-25 15:27:18 UTC 4990.09       0.000             1                                       0.761
2024-07-25 15:27:21 UTC 5079.62       0.000             1                                       0.574
2024-07-25 15:27:24 UTC 5145.22       0.000             0                                       0.657
2024-07-25 15:27:27 UTC 5203.69       0.000             1                                       0.549
2024-07-25 15:27:30 UTC 5249.63       0.000             0                                       0.528
2024-07-25 15:27:33 UTC 5160.38       0.000             1                                       0.454
2024-07-25 15:27:36 UTC 5267.93       0.000             0                                       0.894
2024-07-25 15:27:39 UTC 5360.08       0.000             0                                       0.618

Solution

Tuning the health monitor may help to increase throughput in the short run. A longer term solution requires a better mechanism to limit buffering unhandled tasks for transaction handling and the entire pipeline.

Alternatives

No response

OlegMazurov commented 1 month ago

I used release/0.52 @010a5d3a in the example.

litt3 commented 1 month ago

I don't know how much we can read into the results from a single node network. The most important function of the health monitor is to regulate gossip, and second is the regulation of event creation.

Since a single node network doesn't gossip, this test doesn't represent real-world behavior well. Consequently, the results look quite different for multi-node networks.

cody-littley commented 1 month ago

@OlegMazurov did you modify the health monitor configuration, or are you running with the same configuration as in develop? The configuration in develop was tuned for 32 node networks, and I can promise you it's not going to work well in a single node environment. I suggest partnering with somebody who understands platform code if you aren't able to figure out a good configuration (@litt3 is the current resident expert).

OlegMazurov commented 1 month ago

One interpretation of this issue is exactly that: the health monitor has to be configured differently for various environments (single-node laptop, single-node large Linux box, 7-node network, 30-node network, etc.). It's going to be untenable and always sub-optimal in dynamic heterogeneous networks.

cody-littley commented 1 month ago

This is less a property of the health monitor, and more an emergent property of the hashgraph consensus algorithm and distributed systems. Gossip, event creation, and consensus function drastically differently depending on network size and configuration, and as a result provide workloads that are very different in the single node case vs. the many node case.

At the end of the day, a single node network is just not a very realistic way to test many important aspects of our system. Solutions that might work well on a single node often don't work with many nodes, and solutions that work well on many node systems (like the health monitor) may not be optimal choices for single node setups. It's hard to deny that the health monitor change resulted in major performance improvements in larger networks, where as many attempted solutions attacking the problem from a "single node in isolation" perspective were unable to make headway.

rbair23 commented 4 weeks ago

I agree there are certainly things that cannot be tested on a single-node network. I also agree with Oleg that the design we need, and the implementation we need, of the system is one that does not require different manual configuration changes to tune for different environments. Manual tuning is error prone and leads to less stable systems than automatic tuning. I think auto-tuning is feasible, and we need to work through ways to improve the design and implementation of the consensus node to permit this. We want a system with minimal tuning required.

cody-littley commented 1 week ago

Looks like the health monitor change finally went live.

Screenshot 2024-08-27 at 7 49 00 AM

@oleg, perhaps the health monitor is a little more efficient than you choose to believe? Personally, I'm not all that surprised. This is exactly the type of behavior I predicted ~10 months ago, and then subsequently observed during multiple rounds of testing.

OlegMazurov commented 1 week ago

The health monitor is significantly more efficient than the previous implementation, which was prone to throttling down the platform fork-join pool up to a deadlock. It's not a surprise that things look better after that source of instability has been removed. The health monitor inefficiency claim is still valid, however. There are many manifestations of that. The mere fact that the size of the platform fork-join pool had to be decreased to 8 and we can't restore it to 48 without causing stability issues is just another evidence.

cody-littley commented 5 days ago

Getting the platform thread pool back up to a larger number should actually be pretty simple. The transaction handling thread just needs to get a dedicated core that doesn't have to compete for CPU resources with lower priority tasks. Limiting the number of threads in the default platform thread pool was intended to be a bandaid fix until something like this could be done. I'm surprised nobody has done that yet.

OlegMazurov commented 4 days ago

This issue is about health monitor's interference with operations to the extent of keeping the transaction handling thread underutilized: transaction handling is not a bottleneck any more as it should be. Providing more threads for the platform fork-join pool makes other components even faster (pre-handle is the most important one) and task piling up in front of transaction handling more frequent and more severe. Transaction handling idling has been confirmed in performance networks as well, so it's not a single-node mode artifact any more. With that idling in mind, the hope that binding the transaction handling thread to a CPU would hep is not well founded. I did manual binding experiments when transaction handling was a true bottleneck with 100% utilization. The results were negative (no measurable effect) so that's another argument the proposed performance model is not adequate.

cody-littley commented 3 days ago

Have you combined the pinning of the transaction handling CPU with increasing the size of the platform thread pool?

OlegMazurov commented 2 days ago

I performed the pinning experiment when the platform thread pool was full size (48 threads) and the transaction handling thread was 100% busy (i.e. before health monitor). Again, there was no measurable effect then and I don't expect it now when the thread occasionally goes idle.

hashgraph / hedera-services