gwaldron / osgearth

3D Maps for OpenSceneGraph / C++14
https://www.pelicanmapping.com/home-1/opensource
Other
1.51k stars 779 forks source link

Deadlock on JobArena #1827

Closed Davidbrcz closed 1 year ago

Davidbrcz commented 3 years ago

Hello

We recently made the upgrade from osgEarth 2.8 to 3.2, and our application (built on MSYS2, G++ 8.2, Windwos 10, 64 bits) is hanging, while it's fine on Linux.

The following code is enough to reproduce the issue:

#include <osg/DisplaySettings>
#include <osg/PositionAttitudeTransform>
#include <osg/LOD>
#include <osgDB/Registry>

#include <osgEarthDrivers/cache_filesystem/FileSystemCache>
#include <osgViewer/Viewer>
#include <osgViewer/ViewerBase>
#include <osgEarth/Capabilities>
#include <osgEarth/Registry>
#include <osgEarth/Sky>
#include <osgEarth/TerrainEngineNode>
#include <osgEarth/Registry>
#include <osgEarth/Notify>
#include <osgEarth/MapNode>
#include <osgEarth/TerrainEngineNode>
#include <osgEarth/Sky>

// This header has to be last. It defines conditionally GLintptr
// and GLsizeiptr. If included first, it clashes with the typedef
// that Qt have in qopenglext.h
#include <osgEarth/GLUtils>

#include <osg/ref_ptr>
#include <osg/Referenced>

#include <iostream>

namespace {
struct OsgLogBridge : public osg::NotifyHandler
{
    void notify(osg::NotifySeverity severity, const char* message) override
    {
        std::string simplifiedMessage = message;
        std::cerr << simplifiedMessage;
    }
};

class GlobeTest : public osg::Referenced
{
public:
    explicit GlobeTest();
protected:
    osg::ref_ptr<osg::Group> _globeNode;
    osg::ref_ptr<osgEarth::Map> _map;
    osg::ref_ptr<osgEarth::MapNode> _mapNode;
};

void buildGlobeOptions(
               osgEarth::MapNode::Options& mapNodeOptions,
               osgEarth::TerrainOptions& terrainOptions,
               osgEarth::Util::SkyOptions& skyOptions) {

        terrainOptions.minTileRangeFactor() = 8;

        mapNodeOptions.overlayResolutionRatio() = 20.0f;
        mapNodeOptions.overlayTextureSize() = osgEarth::Registry::instance()->capabilities().getMaxTextureSize() / 8;
        mapNodeOptions.enableLighting() = false;

        skyOptions.ambient() = 0.8f;
        skyOptions.hours() = osgEarth::DateTime().hours();
}

GlobeTest::GlobeTest() :
    _globeNode(new osg::Group())
{
    osgEarth::MapNode::Options mapNodeOptions;
    osgEarth::TerrainOptions terrainOptions;
    osgEarth::Util::SkyOptions skyOptions;

    buildGlobeOptions(mapNodeOptions, terrainOptions, skyOptions);

    _globeNode = new osg::Group();
    osgEarth::Map::Options mapOptions;

    // create the map from the options
    _map = new osgEarth::Map(mapOptions, nullptr);
    mapNodeOptions.terrain() = terrainOptions;

    // create the mapNode from the options
    _mapNode = new osgEarth::MapNode(_map, mapNodeOptions);
    _globeNode->addChild(_mapNode);

    if (!_mapNode->open())
    {
        std::cerr << "Error opening the map node";
    }

}

} // end namespace

int main(void) {
    // allocate pager threads based on CPU configuration.
    int cores = 6;
    std::cerr << "Number of cores:" << cores;
    osg::DisplaySettings::instance()->setNumOfDatabaseThreadsHint(osg::clampAbove(cores, 2));
    osg::DisplaySettings::instance()->setNumOfHttpDatabaseThreadsHint(osg::clampAbove(cores / 2, 1));

    OsgLogBridge* osgLogBridge = new OsgLogBridge();
    osg::setNotifyHandler(osgLogBridge);
    osgEarth::setNotifyHandler(osgLogBridge);
    osg::setNotifyLevel(osg::INFO);
    osgEarth::setNotifyLevel(osg::INFO);

    // create a globe
    GlobeTest _globe;
}

There is a single thread and the stack trace looks like this:

Details > (gdb) bt > bt > #0 0x00007ff8f3c6d8c4 in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll > #1 0x00007ff8f17bcb70 in WaitForMultipleObjectsEx () from C:\Windows\System32\KernelBase.dll > #2 0x00007ff8f17bca6e in WaitForMultipleObjects () from C:\Windows\System32\KernelBase.dll > #3 0x0000000064941e5d in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll > #4 0x000000006494216b in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll > #5 0x00000000649421e5 in ?? () from C:\msys64\mingw64\bin\libwinpthread-1.dll > #6 0x00000000015b9288 in std::_V2::condition_variable_any::~condition_variable_any (this=0x284bff08, __in_chrg=) at C:/msys64/mingw64/include/c++/8.2.0/condition_variable:234 > #7 0x000000000119597b in osgEarth::Threading::JobArena::~JobArena (this=0x284bfe50, __in_chrg=) at C:/.conan/4c5d3f/1/source_subfolder/src/osgEarth/Threading.cpp:578 > #8 0x000000000147232c in __gnu_cxx::new_allocator::destroy (this=0x284bfe50, __p=0x284bfe50) at C:/msys64/mingw64/include/c++/8.2.0/ext/new_allocator.h:140 > #9 0x0000000001593570 in std::allocator_traits >::destroy (__a=..., __p=0x284bfe50) at C:/msys64/mingw64/include/c++/8.2.0/bits/alloc_traits.h:487 > #10 0x00000000015b7479 in std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x284bfe40) > at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr_base.h:552 > #11 0x0000000001589a6f in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x284bfe40) at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr_base.h:155 > #12 0x0000000001583fe7 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x284c3e70, __in_chrg=) at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr_base.h:706 > #13 0x000000000157ab1c in std::__shared_ptr::~__shared_ptr (this=0x284c3e68, __in_chrg=) > at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr_base.h:1145 > #14 0x0000000001551358 in std::shared_ptr::~shared_ptr (this=0x284c3e68, __in_chrg=) at C:/msys64/mingw64/include/c++/8.2.0/bits/shared_ptr.h:103 > #15 0x00000000015c478c in std::pair, std::allocator > const, std::shared_ptr >::~pair (this=0x284c3e48, __in_chrg=) > at C:/msys64/mingw64/include/c++/8.2.0/bits/stl_pair.h:198 > #16 0x0000000001474aac in __gnu_cxx::new_allocator, std::allocator > const, std::shared_ptr >, true> >::destroy, std::allocator > const, std::shared_ptr > > ( > this=0x19a5b40 , __p=0x284c3e48) at C:/msys64/mingw64/include/c++/8.2.0/ext/new_allocator.h:140 > #17 0x0000000001595110 in std::allocator_traits, std::allocator > const, std::shared_ptr >, true> > >::destroy, std::allocator > const, std::shared_ptr > > (__a=..., __p=0x284c3e48) > at C:/msys64/mingw64/include/c++/8.2.0/bits/alloc_traits.h:487 > #18 0x00000000016998e9 in std::__detail::_Hashtable_alloc, std::allocator > const, std::shared_ptr >, true> > >::_M_deallocate_node (this=0x19a5b40 , __n=0x284c3e40) at C:/msys64/mingw64/include/c++/8.2.0/bits/hashtable_policy.h:2100 > #19 0x00000000016999ff in std::__detail::_Hashtable_alloc, std::allocator > const, std::shared_ptr >, true> > >::_M_deallocate_nodes (this=0x19a5b40 , __n=0x0) at C:/msys64/mingw64/include/c++/8.2.0/bits/hashtable_policy.h:2113 > #20 0x0000000001541081 in std::_Hashtable, std::allocator >, std::pair, std::allocator > const, std::shared_ptr >, std::allocator, std::allocator > const, std::shared_ptr > >, std::__detail::_Select1st, std::equal_to, std::allocator > >, std::hash, std::allocator > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::clear (this=0x19a5b40 ) > at C:/msys64/mingw64/include/c++/8.2.0/bits/hashtable.h:2047 > #21 0x000000000157e508 in std::unordered_map, std::allocator >, std::shared_ptr, std::hash, std::allocator > >, std::equal_to, std::allocator > >, std::allocator, std::allocator > const, std::shared_ptr > > >::clear (this=0x19a5b40 ) at C:/msys64/mingw64/include/c++/8.2.0/bits/unordered_map.h:846 > #22 0x0000000001195a4d in osgEarth::Threading::JobArena::shutdownAll () at C:/.conan/4c5d3f/1/source_subfolder/src/osgEarth/Threading.cpp:599 > #23 0x00000000013794d4 in _execute_onexit_table (table=0x199e000 ) from C:\msys64\home\d.come\tmp\map3d\tests\build\05e94a83e88b84b752457d5436a46ca909bdbbb2\bin\libosgEarthd.dll > #24 0x0000000000ff116e in _CRT_INIT (hDllHandle=, dwReason=, lpreserved=0x1) at C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtdll.c:140 > #25 0x0000000000ff1249 in __DllMainCRTStartup (hDllHandle=0xff0000, dwReason=0, lpreserved=0x1) at C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtdll.c:204 > #26 0x00007ff8f3be9a1d in ntdll!RtlActivateActivationContextUnsafeFast () from C:\Windows\SYSTEM32\ntdll.dll > #27 0x00007ff8f3c2db91 in ntdll!LdrShutdownProcess () from C:\Windows\SYSTEM32\ntdll.dll > #28 0x00007ff8f3c2da2d in ntdll!RtlExitUserProcess () from C:\Windows\SYSTEM32\ntdll.dll > #29 0x00007ff8f2c0e0ab in KERNEL32!FatalExit () from C:\Windows\System32\kernel32.dll > #30 0x00007ff8f39ea155 in msvcrt!_exit () from C:\Windows\System32\msvcrt.dll > #31 0x00007ff8f39ea7c5 in msvcrt!_initterm_e () from C:\Windows\System32\msvcrt.dll > #32 0x00000000004014b5 in __tmainCRTStartup () at C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:342 > #33 0x000000000040150b in mainCRTStartup () at C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:223

In the logs, I see many

[Thread 79716.0x137c8 exited with code 0]
[Thread 79716.0x137c4 exited with code 0]
// ...

I dig into the threading code, and I'll put the remarks I originally put in there for completeness.

I've been looking lately at Threading and Threading.cpp and I have several concerns I would like to discuss

Thanks.

gwaldron commented 3 years ago

I was not able to reproduce the deadlock on Win64 VS2019.

A couple things you might try to test your synchronization theory:

Let me know if either approach solves the problem.

Davidbrcz commented 3 years ago

Hey,

I have two commits with different approaches.

One tries to implement minimal changes that fixes some of the issues found in the original message. But it still deadlocks and I don't know why =(

Another one reworks a bit more the threading code and introduces an abstraction for a thread safe working queue. With that code, the test (and our main application) doesn't deadlock \o/. Regardless of correctness, I think this one is easier to understand as several concepts gathered into a single class and used only over a few lines.

I didn't follow through on the predicateless calls on wait as it violates some static analyzer rules we have.

Davidbrcz commented 3 years ago

I've updated slightly the commit that reworks the threading more deeply to address an issue someone found

pixelsoba commented 2 years ago

Hi,

Just adding my two cents. Currently get the same issue on mingw64 MSYS2 GCC 9.1.0 with osgEarth 3.1

Just the addition of a map (single geotiff) is enough to reproduce the lock when closing. ie osg::Node* globe = osgDB::readNodeFile("myglobe.earth");

After a version upgrade, I approved David's changes

Davidbrcz commented 2 years ago

Unfortunately, the situation is not as good as I expected.

The original commit didn't have a call to notify_all in SynchronizedStopQueue::stop. It was an oversight and should have been a bug, but it works without it, and worse; with it it deadlocks in a similar fashion to the original deadlock !

I have been trying to narrow it down, but I'm kind of stuck. The following demo program (a simple queue with some threads passing integers around) with the fixed SynchronizedStopQueue works as expected (ie, works with the fixed version, doesn't when the line // A is commented).

#include <iostream>
#include <vector>
#include <atomic>
#include <functional>
#include <mutex>
#include <condition_variable>
#include <vector>
#include <unordered_map>
#include <queue>
#include <thread>
#include <future>
#include <type_traits>
#include <queue>
#include <map>
#include <sstream>

template <class U>
std::string mystr(U u){
        std::stringstream ss;
        ss << u;
        return ss.str();
}

using Mutex =std::mutex;
using QueuedJob = int;
    class SynchronizedPriorityQueuedJob {
    public:
        ~SynchronizedPriorityQueuedJob()
        {
          std::cerr<<"SynchronizedPriorityQueuedJob destructor" <<std::endl;
        }
        template<class ...Ts>
        void emplace(Ts... args) {
          std::lock_guard<Mutex> lock(_queueMutex);
          _queue.emplace(std::forward<Ts>(args)...);
          _block.notify_one();
        }

        bool interrupt_pop(QueuedJob& next) {
          std::unique_lock<Mutex> lk(_queueMutex);
          _block.wait(lk,[this]{
                                        auto message = std::string("In predicate ") +  mystr(std::this_thread::get_id()) + " done="+std::to_string(_done)+"\n";
                std::cerr << message   << std::flush;
                return !_queue.empty() || _done;
        });
        if(_done) {
                                        auto message = std::string("done in  ") +  mystr(std::this_thread::get_id()) + "\n";
                std::cerr << message   << std::flush;
            return false;
        }
        next = std::move(_queue.top());
        _queue.pop();
        return true;
        }

        bool try_pop(QueuedJob& next) {
          std::unique_lock<Mutex> lk(_queueMutex);
          if(_queue.empty()){
              return false;
          }

          next = std::move(_queue.top());
          _queue.pop();
          return true;
        }

        bool empty() const
        {
          std::lock_guard<Mutex> lk(_queueMutex);
          return _queue.empty();
        }
        int size() const
        {
          std::lock_guard<Mutex> lk(_queueMutex);
          return _queue.size();
        }

        void stop() {
          std::lock_guard<Mutex> lk(_queueMutex);
          std::cerr << "STOP ALL THREADS" << std::endl;
          _done = true;
          _block.notify_all(); // line A
        }
        bool running() const{
        std::lock_guard<Mutex> lk(_queueMutex);
        return !_done;
        }
    private:
        std::priority_queue<int> _queue;
        mutable Mutex _queueMutex;
        std::condition_variable_any _block;
        bool _done{false};
    };

struct JobArena {
    static std::map<std::string, std::shared_ptr<JobArena>> _arenas;
    std::vector<std::thread> _threads;
        SynchronizedPriorityQueuedJob _queue;
    JobArena()
    {
        for(int i = 0; i < 10; ++i) {
            _threads.push_back(
                std::thread([this]
                    {
                            auto message = std::string("Arena starting thread ") +  mystr(std::this_thread::get_id()) + "\n";
                            std::cerr << message  << std::endl;
                            runJobs();
                            // exit thread here
                            message = std::string("Thread ") + mystr(std::this_thread::get_id()) + " end loop \n";
                            std::cerr << message << std::endl;
                    }
                    ));
        }
    }
    void runJobs(){
            while (_queue.running())
            {

                    int next;
                    if (_queue.interrupt_pop(next))
                    {
                                                    auto message = std::string("executed") + std::to_string(next) + "\n";
                            std::cerr << message  << std::endl;
                    }
            }
    }
        void stopThread(){
                _queue.stop();

                // Clear out the queue
                while(!_queue.empty())
                {
                        int job;
                        if(_queue.try_pop(job))
                        {
                                std::cerr << "cleaned" << job << std::endl;
                                std::this_thread::sleep_for(std::chrono::milliseconds(job));
                        }
                }

                // wait for them to exit
                for (unsigned i = 0; i < _threads.size(); ++i)
                {
                        if (_threads[i].joinable())
                        {
                                _threads[i].join();
                        }
                }

                _threads.clear();
        }
    ~JobArena(){
        stopThread();
    }
        static void shutdownAll()
                {
                        std::cerr << "shutdown all" << std::endl;
                        _arenas.clear();
                }
};

std::map<std::string, std::shared_ptr<JobArena>> JobArena::_arenas;
int main(int argc, char *argv[])
{
    JobArena::_arenas["foo"] = std::shared_ptr<JobArena>(new JobArena());
    std::atexit(JobArena::shutdownAll);
    std::cerr << "gonna sleep" << std::endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(200));
    std::cerr << "slept" << std::endl;
    int base = 2666;
    for(int i = base; i < base+10; ++i){
             JobArena::_arenas["foo"]->_queue.emplace(i);
    }
    std::cerr << "last main" << std::endl;
    return 0;
}

There definitely something else going on with osgEarth, but I can't tell if Makamitsu and I are hitting an MSYS bug, if osgEarth's code is still not compliant with the language (UB for instance), if it's a well defined deadlock or a mixed bag of issues

I would love to be able to modify the example above to reproduce the issue and pinpoint it, but I've been unable so far.

Davidbrcz commented 2 years ago

Ok, I've dug deep on this, lost quite a bit of hair but I think I have an explanation.

I think the core issues come from the current design of osgEarth in JobArena.

JobArena (as a class) owns several instances of JobArena in the static member _arenas that get destroyed when when the shared_ptrs are no longer referenced. In JobArena destructor, there are pieces of code to do thread synchronization.

The issue is that the code is in a DLL, and it interacts badly with static objects (roughly speaking, once main is finished, threads in the DLL are killed rather abruptly by Windows before JobArena destructor run, and mayhem ensures within the thread synchronization code). This page has been very helpful understanding the issue.

The issue has hit many people over time (LDD issue with this on MSYS , LLVM ThreadPool bug, DLIB issue, SO questions 1 2, MS documentation for DLL good practices ...)

I've been able to reproduce the issue with the following example testdll.zip on MSYS (not tested on MSVC, would gladly have someone test it). I'm not sure if MSVC is immune to this or if no one has hit it by sheer luck, because the SO questions have people using MSVC.

My solution, without doing a full redesign to remove the static objects, is on top of the MR I have open,

I'll update the PR asap.