progschj / ThreadPool

A simple C++11 Thread Pool implementation
zlib License
7.64k stars 2.21k forks source link

Deadlock spotted! #11

Closed Darelbi closed 9 years ago

Darelbi commented 10 years ago

basically I was just profiling to see how many empty jobs can be scheduled every frame to get idea of average overhead.. and it happened that the queue incurs in deadlocks! The following code reproduce the error on my machine (you just need to play with number of threads in pool and initial recursive depth, the deadlock will necessarily occurs first or later.)

In the few cases in wich following code didn't deadlocked it required ~200 ms (2x2Ghz cores) so 5000 jobs per second is actual throughput if you remove the deadlock.

void recursiveWork(ThreadPool & pool, int depth){ if (depth<=0) return; auto result = pool.enqueue( [ ]() {return 1;}); //empty job.. apparently the formatting remove some parentheses //here, I'm just passing a lambda that returns 1 by the way recursiveWork(pool, depth-1); result.get(); }

int main(){ ThreadPool pool(4); for (int i= 0; i<1000; i++){ recursiveWork(pool,4); } return 0; }

progschj commented 10 years ago

I have issues reproducing this. What compiler/flags/operating system are you using? I tested on Linux with gcc-4.8, clang++-3.5 and icpc 14.

progschj commented 10 years ago

I tried the following with all combinations of 1-10 for threads and recursion. Which usually gives me about 10'000 nanoseconds per job.

#include <iostream>
#include <chrono>
#include <cstdlib>
#include "ThreadPool.h"

void recursiveWork(ThreadPool & pool, int depth){
    if (depth<=0)
        return;
    auto result = pool.enqueue( []{return 1;});
    recursiveWork(pool, depth-1);
    result.get();
}

int main(int argc, char *argv[]){
    int threads = strtol(argv[1], nullptr, 10);
    int recursion = strtol(argv[2], nullptr, 10);
    int initial_jobs = 1000;
    auto t0 = std::chrono::high_resolution_clock::now();
    {
        ThreadPool pool(threads);
        for (int i= 0; i<initial_jobs; i++){
            recursiveWork(pool,recursion);
        }
    }
    auto t1 = std::chrono::high_resolution_clock::now();
    std::cout << "threads: " << threads << " depth: " << recursion << std::endl;
    std::cout << "ns/job: " << std::chrono::duration_cast<std::chrono::nanoseconds>(t1-t0).count()/((recursion+1)*initial_jobs) << std::endl;
    return 0;
}
Darelbi commented 10 years ago

Compiler version: GCC 4.8.0 posix sjlj Compiler flags: -Wall -Wextra -std=c++11 -s -O3 -msse2 Platform: Windows Vista

If I run the above code with threads and recursion = 10 it never ends (well 2 minutes passed I assume it's deadlockd)

Darelbi commented 10 years ago

I believe is a race condition on stop and empty(). need more time to test it seriously :/

edit: nope it seems correct: using "condition_variable_any" seems to fix the problem, so I think the real problem is inside the "condition_variable" implementation. (by the way. I don't understand why std have duplicated versions if "any" works for all. theorically condition_variable is fine, but seems it is a GCC problem, i search if someone fixed it or no in that case I try to make a ticket..)

edit2: lot of bugs still unresolved for condition_variable.: this one in particular seems the same (apparently they forgot to fix it): https://svn.boost.org/trac/boost/ticket/4978

Have you been able to reproduce it at least?

yxbh commented 9 years ago

I can reproduce the issue, albeit not consistently, though it does happen from time to time. Have not been able to reproduce it with std::condition_variable_any yet.

This is on Windows 7 using MinGW32 (GCC4.8.2), with test code variables threads, recusion = 10 and initial_jobs = 10,000.

by the way. I don't understand why std have duplicated versions if "any" works for all.

std::condition_variable is specialised for std::unique_lock, so it could be faster than the _any version depending on implementation. Indeed, std::condition_variable is quite a bit faster than std::condition_variable_any on my machine.

rubixcubin commented 9 years ago

Hey I think this is a windows issue. I saw this too, condition_variable_any fixed it for me. I think there was a bug somewhere in that implementation. I'm not too sure why

progschj commented 9 years ago

Given that this is apparently a stlib issue/bug and hasn't been active for a while I'll close this for now.