condition.wait deadlock

A-Debord commented 7 years ago

Hello,

Sometimes, after a worker has begun waiting, the other threads cannot lock the mutex anymore. I don't get why... This occurs very rarely, roughly once every 30000 call to enqueue(). It is however a serious issue for my use case.

To track the issue, I have added some prints in enqueue() and the worker :

Prints in enqueue :

std::future<return_type> res = task->get_future();
cout << "enqueue waiting for lock" << endl;
{
    cout << "enqueue locked" << endl;
    std::unique_lock<std::mutex> lock(queue_mutex);

    // don't allow enqueueing after stopping the pool
    if(stop)
        throw std::runtime_error("enqueue on stopped ThreadPool");

    tasks.emplace([task](){ (*task)(); });
    cout << "enqueue will unlock" << endl;
}
cout << "enqueue will notify" << endl;
condition.notify_one();
cout << "enqueue notified" << endl;

Prints in the worker lambda :

                std::function<void()> task;
                cout << "worker "<<i<<" waiting for lock" <<endl;
                {
                    std::unique_lock<std::mutex> lock(this->queue_mutex);
                    cout << "worker "<<i<<" waiting" << endl;
                    this->condition.wait(lock,
                        [this]{ return this->stop || !this->tasks.empty(); });
                    cout << "worker "<<i<<" awoken" << endl;
                    if(this->stop && this->tasks.empty())
                        return;
                    task = std::move(this->tasks.front());
                    this->tasks.pop();
                }
                cout << "worker "<<i<<" released"<<endl;
                task();

When the deadlock happens I get :

enqueue waiting for lock enqueue locked enqueue will unlock enqueue will notify worker 5 awoken worker 5 released worker 3 waiting for lock worker 3 waiting enqueue notified enqueue waiting for lock enqueue locked enqueue will unlock enqueue will notify worker 6 waiting for lock worker 6 waiting worker 6 awoken worker 6 released worker 4 waiting for lock worker 4 waiting worker 5 waiting for lock worker 6 waiting for lock

Two things I don't get

why the main thread hangs ? (I should get a "enqueue notified")
why is this deadlock happening ? (condition.wait should have been called and released the lock..)

I hope I don't give you too much of a headach Cheers

wilx commented 7 years ago

Try moving condition.notify_one(); under the lock.

Persixty commented 7 years ago

@wilx Moving notify_one() inside the locked scope shouldn't fix anything.

It's entirely valid and should probably be assumed most efficient to notify not holding the lock. Notification when holding the lock may put the notified thread into contention for the lock. That contention is usually satisfied very quickly but it risks waking a thread, putting it immediately to sleep and waking it again moments later.

This thread is on an old version of this thread and there are a few better versions after this.

It's hard to say what is really causing issue here without seeing a complete verifiable program. Fragments don't tell the full story.

Also note that std::cout isn't thread-safe and you should put a lock around it. I'm not saying that is the problem but you might be losing logged messages. It's none too pretty to introduce a std::mutex over std::cout because all the locking and unlocking makes the threads more contending and hence sequential and the implicit memory barriers can hide issues in the 'non-debug' code.

wilx commented 7 years ago

While I have personally said I think the wakeups cannot be missed (https://github.com/progschj/ThreadPool/issues/34#issuecomment-266973470), I am not sure any more. In fact, I did do this change in my modifed version of the thread pool (https://github.com/log4cplus/ThreadPool/commit/f690bf8dd9409aadcc3f35e6d5b50ef41eb28143) because I had similar issues, IIRC.

Persixty commented 7 years ago

@wilx All the documentation and all the advice is that you do not need to be (and should probably avoid) holding the lock to notify but must be holding the lock to update the conditional variables (i.e. what the predicate depends on).

http://en.cppreference.com/w/cpp/thread/condition_variable

I don't know what this problem is but barring an appalling bug in the library that really really should not be the problem.

A-Debord commented 7 years ago

I've let my program run all the day. It did not freeze. However yesterday with the same ThreadPool code, the deadlock occurred multiple times. The only difference I see is that yesterday the program was recompiled between each launch (because I was working on it). And today I put it in an infinite loop.

So, to me, it has something to do with the compilation procedure (I'm using O3 flag btw). I will investigate further this way.

Here is the exact code, maybe you'll see something :

threadpool.h

class ThreadPool {
public:
    ThreadPool(size_t);
    template<class F, class... Args>
    auto queue(F&& f, Args&&... args)
        -> std::future<typename std::result_of<F(Args...)>::type>;
    ~ThreadPool();
private:
    void worker(int i);

    bool _stop;
    std::vector< std::thread > _workers;
    std::queue<std::function<void()> > _tasks;
    std::mutex _tasks_mutex;
    std::condition_variable _messenger;
};

template<class F, class... Args>
auto ThreadPool::queue(F&& f, Args&&... args)
    -> std::future<typename std::result_of<F(Args...)>::type>
{
    using return_type = typename std::result_of<F(Args...)>::type;

    auto task = std::make_shared< std::packaged_task<return_type()> >(
            std::bind(std::forward<F>(f), std::forward<Args>(args)...)
        );

    std::future<return_type> res = task->get_future();
    cout << "enqueue waiting for lock" << endl;
    {
        cout << "enqueue locked" << endl;
        std::unique_lock<std::mutex> lock(_tasks_mutex);

        _tasks.emplace([task](){ (*task)(); });
        cout << "enqueue will unlock" << endl;
    }
    cout << "enqueue will notify" << endl;
    _messenger.notify_one();
    cout << "enqueue notified" << endl;
    return res;
}

threadpool.cpp

ThreadPool::ThreadPool(size_t threads):_stop(false)
{
    for(size_t i = 0;i<threads;++i){
        _workers.push_back(std::thread(&ThreadPool::worker,this,i));
    }
}

ThreadPool::~ThreadPool()
{
    {std::unique_lock<std::mutex> lock(_tasks_mutex);
        _stop = true;
    }
    _messenger.notify_all();
    for(size_t i = 0;i<_workers.size();++i)
        _workers[i].join();
}

void ThreadPool::worker(int i)
{
    while(true)
    {
        std::function<void()> task;
        cout << "worker "<<i<<" waiting for lock" <<endl;
        {
            std::unique_lock<std::mutex> lock(this->_tasks_mutex);
            cout << "worker "<<i<<" waiting" << endl;
            this->_messenger.wait(lock,
                [this]{ cout << "message recieved " <<  (this->_stop || !this->_tasks.empty()) << endl;
                return this->_stop || !this->_tasks.empty(); });
            cout << "worker "<<i<<" awoken" << endl;
            if(this->_stop && this->_tasks.empty())
                return;
            task = std::move(this->_tasks.front());
            this->_tasks.pop();
        }
        cout << "worker "<<i<<" released"<<endl;

        task();
    }
}

Persixty commented 7 years ago

I can see nothing wrong here.

std::unique_lock<> is entirely valid but you might consider the even lighter weight std::lock_guard<> where you aren't implicitly (or explicitly) releasing the lock such as queue and ~ThreadPool

wilx commented 7 years ago

@A-Debord Try to get call stacks of all threads if you ever see this deadlock again.

malamanteau commented 6 years ago

I am having the same issue. Here is what I think is happening:

If you add tasks at a slower rate than the pool is consuming them, everything will be fine.

If at any point, you add a task while all of the workers are busy, then the .notify_one() call will not trigger any .wait()ing workers. You now have asymmetry, and there will be items left in the queue. Adding new tasks will allow the old to be processed, but you'll have a queue that keeps growing every time a .notify_one() call falls on deaf ears.

std::condition_variable does not enqueue notify*() calls.

Perhaps I'm missing something. Correct me if I'm wrong.

wilx commented 6 years ago

@malamanteau commented on Jan 6, 2018, 9:55 PM GMT+1:

I am having the same issue. Here is what I think is happening:

If you add tasks at a slower rate than the pool is consuming them, everything will be fine.

If at any point, you add a task while all of the workers are busy, then the .notify_one() call will not trigger any .wait()ing workers. You now have asymmetry, and there will be items left in the queue. Adding new tasks will allow the old to be processed, but you'll have a queue that keeps growing every time a .notify_one() call falls on deaf ears.

std::condition_variable does not enqueue notify*() calls.

Perhaps I'm missing something. Correct me if I'm wrong.

This does not seem a plausible explanation. If the workers are busy and none is wait()ing then you indeed miss a notification. But that should not be a problem because once any worker gets out of task() call, it will lock mutex again and re-check the condition which will make it avoid sleeping on the CV and instead do another iteration with another task.

wilx commented 6 years ago

I have done some digging and this might be related, if you are using this on Linux/Glibc:

First comment to answer to c - pthread_cond_wait unlocks mutex attached with it. source code? - Stack Overflow mentions issues with older implementation of the conditional variable in Glibc and links to
13165 – pthread_cond_wait() can consume a signal that was sent before it started waiting which links to
0000609: It is not clear what threads are considered blocked with respect to a call to pthread_cond_signal() or pthread_cond_broadcast() - Austin Group Defect Tracker

Rather interesting read.

So, assuming this might be related to problems mentioned here by OP and @malamanteau, my next question is what is your Glibc version, assuming you are running this on Linux? It seems some issues were fixed by version 2.25.

wilx commented 6 years ago

Also, I have asked a question about how atomic is pthread_cond_wait() on which the C++11 construct is based on on Linux. It turns out there are corner cases that could affect us, I think: c - atomicity of pthread_cond_wait()'s unlock-and-wait? - Stack Overflow

So, I am inclined to suggest, again, move all of the notifications under held mutex section.

malamanteau commented 6 years ago

@wilx Indeed, this is running on Linux/Glibc 2.24.

I'm still parsing this, but wow, this is very interesting!

malamanteau commented 6 years ago

** facepalm My issue was elsewhere. sorry guys. I had a spurious lock outside of the thread pool.

fogti commented 6 years ago

I think this issue should be closed as it's (probably) solved.

progschj / ThreadPool