kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
741 stars 39 forks source link

Busy loop in cpu_backend/task_queue.cpp keeps 1 thread at 100% CPU when queue is empty #80

Closed sayap closed 1 month ago

sayap commented 2 months ago

Hi, I notice that the busy loop in cpu_backend/task_queue.cpp will keep 1 thread at 100% CPU when the queue is empty:

    while (true) {
        mutex.lock();
        if (tasks.empty()) {
            if (exit_flag.load(std::memory_order_seq_cst)) {
                return;
            }
            mutex.unlock();
            continue;
        }
        ...
    }

On a laptop, this busy thread will keep the fans on all the time.

If I change the code to use a std::condition_variable, then all the threads will be at near 0% CPU when the queue is empty.

I suppose std::condition_variable also works on Windows, but I don't have a machine ready to test that. If it doesn't work on Windows, then we can add a short sleep before continue as a workaround. From my testing, a short sleep of 0.1ms can keep the threads at near 0% CPU without impacting the queue processing speed. However, this feels much less elegant compared to using a conditional variable :sweat_smile:

chenht2022 commented 2 months ago

You are right, this problem should be fixed.

I tried using std::condition_variable, but the delay in waking up the thread is not negligible, especially when individual tasks are relatively small, such as calculating a MoE layer for a single token, std::condition_variable can lead to significant performance degradation.

A better practice, as you mentioned, is to use short sleep to yield the core when idle waiting is detected. The modified code is as follows. It will be integrated into the repository.


void TaskQueue::processTasks() {
    auto start = std::chrono::steady_clock::now();
    while (true) {
        mutex.lock();
        if (tasks.empty()) {
            if (exit_flag.load(std::memory_order_seq_cst)) {
                return;
            }
            mutex.unlock();
            auto now = std::chrono::steady_clock::now();
            auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(now - start).count();
            if (duration > 50) {
                std::this_thread::sleep_for(std::chrono::milliseconds(1));
            }
            continue;
        }
        ...
        start = std::chrono::steady_clock::now();
    }
}
sayap commented 2 months ago

Actually I have been running a build with cond var for a few days (see https://github.com/sayap/ktransformers/commit/9c81098), and there is no performance degradation during prefill/decode. This is on kernel 6.10.

chenht2022 commented 2 months ago

I tried your code, and indeed there is no performance degradation; there might have been a mistake in my previous testing method.

It would be great if you could submit a Pull Request to our project.

By the way, in cpu_backend/backend.cpp, there is a group of threads (Backend::worker_thread) that also uses a "busy loop + short sleep" synchronization method, which also creates unnecessary background overhead for the CPU. I'm not sure if changing them all to "condition variable" will cause performance degradation because there are at least dozens of threads on the server environment. Maybe a "busy loop + condition variable" method can achieve both high performance and resource savings.I will study it later.

sayap commented 2 months ago

Cool, I created pull request #83. Haven't tested it on Windows though.

sayap commented 2 months ago

By the way, in cpu_backend/backend.cpp, there is a group of threads (Backend::worker_thread) that also uses a "busy loop + short sleep" synchronization method, which also creates unnecessary background overhead for the CPU.

Ah I see. When I searched for what was causing the 100% CPU, I did see a lot of 1ms sleeps in strace output. This is probably the reason. I don't think these 1ms sleeps occupy too much CPU time, but yeah it is probably better to replace with cond var as well :grin: