Closed sayap closed 1 month ago
You are right, this problem should be fixed.
I tried using std::condition_variable
, but the delay in waking up the thread is not negligible, especially when individual tasks are relatively small, such as calculating a MoE layer for a single token, std::condition_variable
can lead to significant performance degradation.
A better practice, as you mentioned, is to use short sleep to yield the core when idle waiting is detected. The modified code is as follows. It will be integrated into the repository.
void TaskQueue::processTasks() {
auto start = std::chrono::steady_clock::now();
while (true) {
mutex.lock();
if (tasks.empty()) {
if (exit_flag.load(std::memory_order_seq_cst)) {
return;
}
mutex.unlock();
auto now = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(now - start).count();
if (duration > 50) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
continue;
}
...
start = std::chrono::steady_clock::now();
}
}
Actually I have been running a build with cond var for a few days (see https://github.com/sayap/ktransformers/commit/9c81098), and there is no performance degradation during prefill/decode. This is on kernel 6.10.
I tried your code, and indeed there is no performance degradation; there might have been a mistake in my previous testing method.
It would be great if you could submit a Pull Request to our project.
By the way, in cpu_backend/backend.cpp, there is a group of threads (Backend::worker_thread
) that also uses a "busy loop + short sleep" synchronization method, which also creates unnecessary background overhead for the CPU. I'm not sure if changing them all to "condition variable" will cause performance degradation because there are at least dozens of threads on the server environment. Maybe a "busy loop + condition variable" method can achieve both high performance and resource savings.I will study it later.
Cool, I created pull request #83. Haven't tested it on Windows though.
By the way, in cpu_backend/backend.cpp, there is a group of threads (Backend::worker_thread) that also uses a "busy loop + short sleep" synchronization method, which also creates unnecessary background overhead for the CPU.
Ah I see. When I searched for what was causing the 100% CPU, I did see a lot of 1ms sleeps in strace
output. This is probably the reason. I don't think these 1ms sleeps occupy too much CPU time, but yeah it is probably better to replace with cond var as well :grin:
Hi, I notice that the busy loop in cpu_backend/task_queue.cpp will keep 1 thread at 100% CPU when the queue is empty:
On a laptop, this busy thread will keep the fans on all the time.
If I change the code to use a
std::condition_variable
, then all the threads will be at near 0% CPU when the queue is empty.I suppose
std::condition_variable
also works on Windows, but I don't have a machine ready to test that. If it doesn't work on Windows, then we can add a short sleep beforecontinue
as a workaround. From my testing, a short sleep of 0.1ms can keep the threads at near 0% CPU without impacting the queue processing speed. However, this feels much less elegant compared to using a conditional variable :sweat_smile: