Refactor thread pool in storaged for better performance.

cangfengzhs commented 2 years ago

Introduction In the process of testing performance, I found a relatively counterintuitive phenomenon: the latency of storage is much faster than the growth rate of qps, and it is not an order of magnitude at all.

E.g. For GetProp requests, in a 10 qps scenario, the latency of each request is about 20ms. On a 50-core CPU, considering that the CPU utilization is at 60%, the best throughput can be achieved, and it is assumed that there is only calculation and no disk IO within this 20ms. There shouldn't be a noticeable change in latency until qps hits 1500. But the reality is that when qps reaches 100, the delay has risen significantly (avg has risen by about 15ms).

After adding metrics analysis, the 15ms delay is due to requests queued in the thread pool. The easiest situation to think of is that there are not enough threads in the thread pool. After increasing the number of threads, the delay did not drop, but increased even more.

Further analysis found that with the increase of threads, the sys usage of the system CPU became higher, and a large number of spinlocks in the process consumed the CPU.

Therefore, I have a preliminary conclusion: the thread pool provided by folly is not suitable for our frequent IO scenario. Because rocksdb is all blocking IO, if our number of threads is equal to the number of CPU cores, the CPU will inevitably be idle. However, if the number of threads is too large, the competition within the thread pool will become very fierce, causing the CPU to be wasted on meaningless competition, which will slow down performance.

cangfengzhs commented 2 years ago

There is the idea of a simple thread pool.

First, we need to implement a thread pool, which will have hundreds of threads.

Then this thread pool is a simple producer-consumer model. We keep receiving new requests into the thread pool; threads in the thread pool keep consuming requests.

If you use a queue to save requests in the simplest way, there will be serious competition (even if you use a lock-free queue, you can only guarantee lock-free but not wait-free and cache miss).

So, we provide many queues. Producers and consumers randomly pick a queue each time. In this way, the competition between consuming threads will be reduced exponentially.

However, this brings another problem: a queue may be too long due to uneven randomness. Therefore, when the consumer thread selects the queue, it randomly selects two queues, then compares the queue lengths, and finally selects the longer queue. This ensures that the requested p99 delay is controllable.

wenhaocs commented 2 years ago

Good analysis. We may need to discuss about the solutions. Can you paste the draft PR for the experiments, and the related latency graphs here as well? Thx.

Shylock-Hg commented 2 years ago

Therefore, I have a preliminary conclusion: the thread pool provided by folly is not suitable for our frequent IO scenario.

As I known, now the storage processor will choose cpu/io bound Executor as below:

storage/StorageFlags.h
29:DECLARE_string(reader_handlers_type);

storage/StorageFlags.cpp
39:DEFINE_string(reader_handlers_type, "cpu", "Type of reader handlers, options: cpu,io");

storage/GraphStorageServiceHandler.cpp
36:  if (FLAGS_reader_handlers_type == "io") {
41:    if (FLAGS_reader_handlers_type != "cpu") {
42:      LOG(WARNING) << "Unknown value for --reader_handlers_type, using `cpu'";
➜  src git:(feature/match-path-pattern-expression)

But we could use only one type executor in whole of storage processor, and the workload of storage processor is mix of cpu(some computation and push-down computation) and io(read data) workload. So maybe we should sperate these two type workloads and transfer them to proper Executor. ps. Also benefit in graph layer scheduler to separate cpu/io workload.

cangfengzhs commented 2 years ago

But we could use only one type executor in whole of storage processor, and the workload of storage processor is mix of cpu(some computation and push-down computation) and io(read data) workload. So maybe we should sperate these two type workloads and transfer them to proper Executor. ps. Also benefit in graph layer scheduler to separate cpu/io workload.

There is another problem. EventLoop can be used for network IO in graphd. However, access to rocksdb in stored must be blocked, so many threads must be responsible for Io. In any case, we need a thread pool that can hold hundreds of threads. If we use different threads to process push-down computation and IO, it will increase the complexity of the system. And there will be a large number of callback functions in the implementation. At the same time, it can not guarantee that the computing thread can process the output of IO thread in time, resulting in uncontrollable p99 delay.

vesoft-inc / nebula

Refactor thread pool in storaged for better performance. #3941