Closed Davidbrcz closed 1 year ago
I was not able to reproduce the deadlock on Win64 VS2019.
A couple things you might try to test your synchronization theory:
stopThreads()
move the statement _done = true
inside the _queueMutex
lock. This would effectively synchronize _done
to the queue mutex per your suggestion._block.wait
statement to remove the predicate, so it instead reads if(_queue.empty() && !_done) _block.wait(lock);
Let me know if either approach solves the problem.
Hey,
I have two commits with different approaches.
One tries to implement minimal changes that fixes some of the issues found in the original message. But it still deadlocks and I don't know why =(
Another one reworks a bit more the threading code and introduces an abstraction for a thread safe working queue. With that code, the test (and our main application) doesn't deadlock \o/. Regardless of correctness, I think this one is easier to understand as several concepts gathered into a single class and used only over a few lines.
I didn't follow through on the predicateless calls on wait
as it violates some static analyzer rules we have.
I've updated slightly the commit that reworks the threading more deeply to address an issue someone found
Hi,
Just adding my two cents. Currently get the same issue on mingw64 MSYS2 GCC 9.1.0 with osgEarth 3.1
Just the addition of a map (single geotiff) is enough to reproduce the lock when closing.
ie osg::Node* globe = osgDB::readNodeFile("myglobe.earth");
After a version upgrade, I approved David's changes
Unfortunately, the situation is not as good as I expected.
The original commit didn't have a call to notify_all
in SynchronizedStopQueue::stop
. It was an oversight and should have been a bug, but it works without it, and worse; with it it deadlocks in a similar fashion to the original deadlock !
I have been trying to narrow it down, but I'm kind of stuck.
The following demo program (a simple queue with some threads passing integers around) with the fixed SynchronizedStopQueue
works as expected (ie, works with the fixed version, doesn't when the line // A is commented).
#include <iostream>
#include <vector>
#include <atomic>
#include <functional>
#include <mutex>
#include <condition_variable>
#include <vector>
#include <unordered_map>
#include <queue>
#include <thread>
#include <future>
#include <type_traits>
#include <queue>
#include <map>
#include <sstream>
template <class U>
std::string mystr(U u){
std::stringstream ss;
ss << u;
return ss.str();
}
using Mutex =std::mutex;
using QueuedJob = int;
class SynchronizedPriorityQueuedJob {
public:
~SynchronizedPriorityQueuedJob()
{
std::cerr<<"SynchronizedPriorityQueuedJob destructor" <<std::endl;
}
template<class ...Ts>
void emplace(Ts... args) {
std::lock_guard<Mutex> lock(_queueMutex);
_queue.emplace(std::forward<Ts>(args)...);
_block.notify_one();
}
bool interrupt_pop(QueuedJob& next) {
std::unique_lock<Mutex> lk(_queueMutex);
_block.wait(lk,[this]{
auto message = std::string("In predicate ") + mystr(std::this_thread::get_id()) + " done="+std::to_string(_done)+"\n";
std::cerr << message << std::flush;
return !_queue.empty() || _done;
});
if(_done) {
auto message = std::string("done in ") + mystr(std::this_thread::get_id()) + "\n";
std::cerr << message << std::flush;
return false;
}
next = std::move(_queue.top());
_queue.pop();
return true;
}
bool try_pop(QueuedJob& next) {
std::unique_lock<Mutex> lk(_queueMutex);
if(_queue.empty()){
return false;
}
next = std::move(_queue.top());
_queue.pop();
return true;
}
bool empty() const
{
std::lock_guard<Mutex> lk(_queueMutex);
return _queue.empty();
}
int size() const
{
std::lock_guard<Mutex> lk(_queueMutex);
return _queue.size();
}
void stop() {
std::lock_guard<Mutex> lk(_queueMutex);
std::cerr << "STOP ALL THREADS" << std::endl;
_done = true;
_block.notify_all(); // line A
}
bool running() const{
std::lock_guard<Mutex> lk(_queueMutex);
return !_done;
}
private:
std::priority_queue<int> _queue;
mutable Mutex _queueMutex;
std::condition_variable_any _block;
bool _done{false};
};
struct JobArena {
static std::map<std::string, std::shared_ptr<JobArena>> _arenas;
std::vector<std::thread> _threads;
SynchronizedPriorityQueuedJob _queue;
JobArena()
{
for(int i = 0; i < 10; ++i) {
_threads.push_back(
std::thread([this]
{
auto message = std::string("Arena starting thread ") + mystr(std::this_thread::get_id()) + "\n";
std::cerr << message << std::endl;
runJobs();
// exit thread here
message = std::string("Thread ") + mystr(std::this_thread::get_id()) + " end loop \n";
std::cerr << message << std::endl;
}
));
}
}
void runJobs(){
while (_queue.running())
{
int next;
if (_queue.interrupt_pop(next))
{
auto message = std::string("executed") + std::to_string(next) + "\n";
std::cerr << message << std::endl;
}
}
}
void stopThread(){
_queue.stop();
// Clear out the queue
while(!_queue.empty())
{
int job;
if(_queue.try_pop(job))
{
std::cerr << "cleaned" << job << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(job));
}
}
// wait for them to exit
for (unsigned i = 0; i < _threads.size(); ++i)
{
if (_threads[i].joinable())
{
_threads[i].join();
}
}
_threads.clear();
}
~JobArena(){
stopThread();
}
static void shutdownAll()
{
std::cerr << "shutdown all" << std::endl;
_arenas.clear();
}
};
std::map<std::string, std::shared_ptr<JobArena>> JobArena::_arenas;
int main(int argc, char *argv[])
{
JobArena::_arenas["foo"] = std::shared_ptr<JobArena>(new JobArena());
std::atexit(JobArena::shutdownAll);
std::cerr << "gonna sleep" << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(200));
std::cerr << "slept" << std::endl;
int base = 2666;
for(int i = base; i < base+10; ++i){
JobArena::_arenas["foo"]->_queue.emplace(i);
}
std::cerr << "last main" << std::endl;
return 0;
}
There definitely something else going on with osgEarth, but I can't tell if Makamitsu and I are hitting an MSYS bug, if osgEarth's code is still not compliant with the language (UB for instance), if it's a well defined deadlock or a mixed bag of issues
I would love to be able to modify the example above to reproduce the issue and pinpoint it, but I've been unable so far.
Ok, I've dug deep on this, lost quite a bit of hair but I think I have an explanation.
I think the core issues come from the current design of osgEarth in JobArena.
JobArena (as a class) owns several instances of JobArena in the static member _arenas
that get destroyed when when the shared_ptrs are no longer referenced. In JobArena destructor, there are pieces of code to do thread synchronization.
The issue is that the code is in a DLL, and it interacts badly with static objects (roughly speaking, once main
is finished, threads in the DLL are killed rather abruptly by Windows before JobArena destructor run, and mayhem ensures within the thread synchronization code). This page has been very helpful understanding the issue.
The issue has hit many people over time (LDD issue with this on MSYS , LLVM ThreadPool bug, DLIB issue, SO questions 1 2, MS documentation for DLL good practices ...)
I've been able to reproduce the issue with the following example testdll.zip on MSYS (not tested on MSVC, would gladly have someone test it). I'm not sure if MSVC is immune to this or if no one has hit it by sheer luck, because the SO questions have people using MSVC.
My solution, without doing a full redesign to remove the static objects, is on top of the MR I have open,
shutdownAll
public and create a small RAII helper for it, and instantiate it in main in first thing first.
struct Cleaner {
~Cleaner(){
osgEarth::Threading::JobArena::shutdownAll();
}
};
int main(int argc, char** argv)
{
Cleaner c;
I'll update the PR asap.
Hello
We recently made the upgrade from osgEarth 2.8 to 3.2, and our application (built on MSYS2, G++ 8.2, Windwos 10, 64 bits) is hanging, while it's fine on Linux.
The following code is enough to reproduce the issue:
There is a single thread and the stack trace looks like this:
Details
> (gdb) bt > bt > #0 0x00007ff8f3c6d8c4 in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll > #1 0x00007ff8f17bcb70 in WaitForMultipleObjectsEx () from C:\Windows\System32\KernelBase.dll > #2 0x00007ff8f17bca6e in WaitForMultipleObjects () from C:\Windows\System32\KernelBase.dll > #3 0x0000000064941e5d in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll > #4 0x000000006494216b in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll > #5 0x00000000649421e5 in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll > #6 0x00000000015b9288 in std::_V2::condition_variable_any::~condition_variable_any (this=0x284bff08, __in_chrg=In the logs, I see many
I dig into the threading code, and I'll put the remarks I originally put in there for completeness.
I've been looking lately at Threading and Threading.cpp and I have several concerns I would like to discuss
I believe there is data-race on
_done
. Indeed, the variable is defined here, doesn't have an atomic type, is written to here and read by several other threads here without locking.Even if
_done
were atomic, the variable is still used in_block
condition_variable predicate here . For proper synchronization, that variable should be set while holding the same locked ( the last section of this page has an explanation)Thanks.