libcpr / cpr

C++ Requests: Curl for People, a spiritual port of Python Requests.
https://docs.libcpr.org/
Other
6.29k stars 903 forks source link

Thread Pool Deadlock While Task Wait #1040

Open COM8 opened 2 months ago

COM8 commented 2 months ago

Description

There are potential cases where a cpr::ThreadPool can get stuck in a potential deadlock. They are both caused by a race condition involving task_cond.

It can happen that a thread gets stuck waiting for task_cond to get notified but during this there is no one who can notify him like if the caller invokes cpr::ThreadPool::Wait().

Example/How to Reproduce

Run the currently disabled ThreadPoolTests from within https://github.com/libcpr/cpr/issues/1035 over and over again. From time to time they will get stuck.

Possible Fix

No response

Where did you get it from?

GitHub (branch e.g. master)

Additional Context/Your Environment

Fedora 38, https://github.com/libcpr/cpr/issues/1035

baderouaich commented 2 months ago

The ThreadPool being stuck I encountered, It was never stopping, it seems like the issue has to do with the scope of the status_locks, if you notice the Stop() method:

int ThreadPool::Stop() {
    std::unique_lock status_lock(status_wait_mutex); // <- this will lock the entire scope of Stop(), after notifying, 
// the mutex is still held during the notification which will not give the threads a chance to acquire it and get the notification (in the thread loop)

    if (STOP == status) {
      return -1;
    }
    status = STOP;
    status_wait_cond.notify_all();
    task_cond.notify_all();

    for (auto& i : threads) {
        if (i.thread->joinable()) {
            i.thread->join();
        }
    }

    threads.clear();
    cur_thread_num = 0;
    idle_thread_num = 0;

    return 0;
}

I think it should be:

int ThreadPool::Stop() {
     { 
         // lock only this scope to ensure that the mutex is not held during the notification process, allowing the waiting threads to acquire it immediately after being awakened
          std::unique_lock status_lock(status_wait_mutex);
          if (STOP == status) {
            return -1;
          }
          status = STOP;
     }
    status_wait_cond.notify_all();
    task_cond.notify_all();

 ...
}

I tried that locally and it seems the thread pool is waiting and stopping normally so far, I will test with more use cases and get back to you if there is another issue.