vision-dbms / vision

The master repository for the Vision database system.
https://vision-dbms.com
BSD 3-Clause "New" or "Revised" License
27 stars 12 forks source link

libV.so dynamic library unload (on exit) stack overflow #17

Closed VCommitter closed 7 years ago

VCommitter commented 7 years ago

In shared libraries that use the GNU/Linux ABI on unload of libV.so memory reclamation of VReferenceable heap objects from static instances within libV.so on process exit.

So far this only affects shared libraries built using the GNU/Linux ABI using gcc.

Prior Experience

Reproducer

There are 3 VString members of VApplicationLog that is a member of the static VTransientServices - if one is set the executable that sets it overflows and segfaults on library unload:

% mkdir -p sockets logs
% setenv LogFilePath logs/batchvisionVca.log
% vpool -serverFile=serverFile.txt ./sockets/ epool -logFilePath=./logs/poolLog.txt &
% sleep 1
% vpooladmin -serverFile=serverFile.txt -clientcount
1
Segmentation fault

Backtrace

(gdb) bt 15
#0  0x000000000049c36d in V::VAtomicMemoryOperations_<8u>::interlockedSetIf (pMemory=0x73dbf0 <V::VAllocatorInstance_<V::VAllocatorGranule::MultiThreaded>::Data+208>, iExpected=0x911788,
    iNew=<error reading variable: Cannot access memory at address 0x7fffff7feff8>) at ../kernel/V_VAtomicMemoryOperations_.h:484
#1  0x000000000049e189 in V::VPointer<V::VAllocatorGranule::MultiThreaded>::interlockedSetIf (this=0x73dbf0 <V::VAllocatorInstance_<V::VAllocatorGranule::MultiThreaded>::Data+208>, pNew=0x911818, pOld=0x911788)
    at ../kernel/V_VPointer_NRK.h:119
#2  0x000000000049e1d3 in V::VAggregatePointer<V::VAllocatorGranule::MultiThreaded>::interlockedPop (this=0x73dbf0 <V::VAllocatorInstance_<V::VAllocatorGranule::MultiThreaded>::Data+208>,
    rpFirst=@0x7fffff7ff0f0: 0x0, pLinkMember=&V::VAllocatorGranule::MultiThreaded::m_pNextFree) at ../kernel/V_VAggregatePointer_NRK.h:79
#3  0x000000000049d4e7 in V::VAggregatePointer<V::VAllocatorGranule::MultiThreaded>::interlockedPop (this=0x73dbf0 <V::VAllocatorInstance_<V::VAllocatorGranule::MultiThreaded>::Data+208>, rpFirst=...,
    pLinkMember=&V::VAllocatorGranule::MultiThreaded::m_pNextFree) at ../kernel/V_VAggregatePointer_NRK.h:90
#4  0x000000000049c534 in V::VAllocatorGranule::MultiThreaded::pop (this=0x73dbf0 <V::VAllocatorInstance_<V::VAllocatorGranule::MultiThreaded>::Data+208>, rpFirst=...) at ../kernel/V_VAllocator.h:113
#5  0x000000000049c566 in V::VAllocatorGranule::MultiThreaded::allocate (this=0x73dbf0 <V::VAllocatorInstance_<V::VAllocatorGranule::MultiThreaded>::Data+208>, sCell=72) at ../kernel/V_VAllocator.h:119
#6  0x000000000049e339 in V::VAllocatorGranule_<V::VAllocatorGranule::MultiThreaded>::allocate (this=0x73dbe8 <V::VAllocatorInstance_<V::VAllocatorGranule::MultiThreaded>::Data+200>, sObject=72)
    at ../kernel/V_VAllocator.h:180
#7  0x000000000049d752 in V::VAllocator<64u, 8ul, 0ul, V::VAllocatorGranule::MultiThreaded>::allocate (this=0x73db20 <V::VAllocatorInstance_<V::VAllocatorGranule::MultiThreaded>::Data>, sObject=64)
    at ../kernel/V_VAllocator.h:334
#8  0x000000000049c79d in V::ThreadModel::Multi::allocate (sObject=64) at ../kernel/VReferenceable.h:328
#9  0x000000000049dc9e in V::VReferenceableImplementation_<V::ThreadModel::Multi>::operator new (sObject=64) at ../kernel/VReferenceable.h:416
#10 0x00007ffff4dbd84f in V::VThread::Here () at ../kernel/V_VThread.cpp:134
#11 0x00007ffff4da83a1 in V::VThread::ReclaimObject (pObject=0x9117d0) at ../kernel/V_VThread.h:170
#12 0x00007ffff4da78d4 in V::ThreadModel::Multi::reclaim (pObject=0x9117d0) at ../kernel/VReferenceable.cpp:46
#13 0x000000000049efd7 in V::VReferenceableImplementation_<V::ThreadModel::Multi>::reclaimThis (this=0x9117d0) at ../kernel/VReferenceable.h:482
#14 0x000000000049eb7a in V::VReferenceableImplementation_<V::ThreadModel::Multi>::release (this=0x9117d0) at ../kernel/VReferenceable.h:512
(More stack frames follow...)
(gdb) bt -15
#182187 0x00007ffff4da83bc in V::VThread::ReclaimObject (pObject=0x76b3d0) at ../kernel/V_VThread.h:170
#182188 0x00007ffff4da78d4 in V::ThreadModel::Multi::reclaim (pObject=0x76b3d0) at ../kernel/VReferenceable.cpp:46
#182189 0x000000000049efd7 in V::VReferenceableImplementation_<V::ThreadModel::Multi>::reclaimThis (this=0x76b3d0) at ../kernel/VReferenceable.h:482
#182190 0x000000000049eb7a in V::VReferenceableImplementation_<V::ThreadModel::Multi>::release (this=0x76b3d0) at ../kernel/VReferenceable.h:512
#182191 0x000000000049e3b0 in VReference<V::VCOS::OwnershipToken>::releaseReferent (this=0x7ffff4fd8ce8 <g_iDefaultTSP+40>) at ../kernel/VReference.h:60
#182192 0x00007ffff4db6fc8 in VReference<V::VCOS::OwnershipToken>::clear (this=0x7ffff4fd8ce8 <g_iDefaultTSP+40>) at ../kernel/VReference.h:175
#182193 0x00007ffff4db688a in V::VCOS::deallocateStorage (this=0x7ffff4fd8ce0 <g_iDefaultTSP+32>) at ../kernel/V_VCOS.cpp:60
#182194 0x000000000049c82d in V::VCOS::~VCOS (this=0x7ffff4fd8ce0 <g_iDefaultTSP+32>, __in_chrg=<optimized out>) at ../kernel/V_VCOS.h:59
#182195 0x000000000049c94c in VString::~VString (this=0x7ffff4fd8cd8 <g_iDefaultTSP+24>, __in_chrg=<optimized out>) at ../kernel/V_VString.h:36
#182196 0x00007ffff4daf197 in V::VApplicationLog::~VApplicationLog (this=0x7ffff4fd8cd0 <g_iDefaultTSP+16>, __in_chrg=<optimized out>) at ../kernel/V_VApplicationLog.h:34
#182197 0x00007ffff4dacb9c in VTransientServices::~VTransientServices (this=0x7ffff4fd8cc0 <g_iDefaultTSP>, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at ../kernel/VTransientServices.cpp:70
#182198 0x00007ffff3d9bdba in __cxa_finalize () from /lib64/libc.so.6
#182199 0x00007ffff4d9eeb3 in __do_global_dtors_aux () from /openfds/home/osvadmin/vision-open-source/software/src/master/src/vpooladmin/../lib/dbg/libV.so
#182200 0x00007fffffffd3c0 in ?? ()
#182201 0x00007ffff7dec85a in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: frame did not save the PC
VCommitter commented 7 years ago

Yesterday I confirmed this unload segfault occurs on 8.0 and 8.1 version of libV

VCommitter commented 7 years ago

Fascinating. The library load order matches the order they're listed in the make.llist file (and specified on the gcc linking command line). strace will gladly show the run time load order:

strace -e trace=open,close vpooladmin -serverFile=serverFile.txt -clientcount
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/tls/x86_64/libVCore.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/tls/libVCore.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/x86_64/libVCore.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libVCore.so", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libV.so", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libVca.so", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libVsa.so", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libVcaMain.so", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
VCommitter commented 7 years ago

gdb can also help examine share library loads:

(gdb) set stop-on-solib-events 1
(gdb) run -serverFile=serverFile.txt -clientcount
Starting program: /home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/vpooladmin -serverFile=serverFile.txt -clientcount
Stopped due to shared library event (no libraries added or removed)
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.2.x86_64
(gdb) continue
Continuing.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Stopped due to shared library event:
  Inferior loaded /home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libVCore.so
    /home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libVsa.so
    /home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libVcaMain.so
    /home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libVca.so
    /home/osvadmin/vision-open-source/software/builds/8.0.0/Linux_x86_64/bin/../lib/libV.so
    /lib64/libpthread.so.0
    /lib64/libuuid.so.1
    /lib64/libstdc++.so.6
    /lib64/libm.so.6
    /lib64/libgcc_s.so.1
    /lib64/libc.so.6
MichaelJCaruso commented 7 years ago

I'm a bit puzzled. It looks like you only moved 'VThread' to VCore even though VThread depends on a number of other components that remain in V. I trust that it makes the crash go away. I also recognize that the Linux linker is willing to resolve upward and downward, especially given the default 'nux symbol visibility policy of exposing everything. I wonder if this works when compiler and linker options are set to control symbol visibility more tightly (e.g, __declspec). If memory serves, I think your 'online' linux builds do that. Is that an issue and have you tried this there?

VCommitter commented 7 years ago

I only moved VThread to libVCore.so because I tracked down the exact static that was getting recreated in an infinite loop and determined that is all that would need to move. I almost just put a guard in the code so that static could never get recreated in a given thread but though a minimal libVCore.so might have utility for controlling static destructor order in the future. And the smallest possible libVCore.so seems like the most useful libVCore.so so I didn't move anything else.

Are you suggesting that adding libVCore to the online system would cause it not to build? I haven't tried that because the online system is not on a version of gcc or linux that has the new linux ABI. It would be a fair bit of work to bring OSV back into online just to run that test.

MichaelJCaruso commented 7 years ago

Regarding OSV and online, if it's not an issue, it's not something that needs to be done (at least for now).

Given that your fix solves a problem, that's probably good enough. Still, given what you're telling me about its possible dependence on ABI version, I wonder how stable the fix will be in the OSV world. Sounds like a lot of testing and finger crossing ahead. For example, I'd want to test this on at least Solaris (x86 and sparc) and a few more Linux variants (I'm away from my lab for the weekend, but when I'm back, I can set up some of that).

Beyond these questions, the more it seems that the real problem is that there's a lot of stuff attached to VTransientServices that probably doesn't belong there. I realize there was no way to know this, but in its original incarnation, VTransientServices was kind-of/sort-of supposed to abstract some low level operating system level services (not very well, but it's one of the oldest C++ classes in our codebase). The logging and related stuff that's in there now is definitely mission creep. Their fragility testifies to that. Out of curiosity, if you were to comment out the routines and state in transient services having to do with 'VString', how far up the food chain would you have to go before things stop compiling? I'd bet it's Vsa (maybe Vca). I can't help but wonder if we can't move the required functionality up to that level (maybe even a static instance of a VTransientServices subclass could be added there).

Hmmm...

Wonder if it would work?

VCommitter commented 7 years ago

A bit more on what I think is happening here. A VReferenceable is being reclaimed at shared object unload time. In the test case here it's a VString that lives in VApplicationLog that kicks it off. So it gets reclaimed:

void V::ThreadModel::Multi::reclaim (VReferenceableBase *pObject) {
    VThread::ReclaimObject (pObject);
}

VThread is trying to do the reclamation:

        static void ReclaimObject (VReferenceableBase *pObject) {
            Here ()->reclaimObject (pObject);
        }

But the Here member function is causing a problem:

V::VThread::Reference V::VThread::Here () {
    BaseClass::Reference pSpecific; Reference pThisInstance;
    if (g_iTLSKey.getSpecific (pSpecific) && pSpecific.isntNil ())
        pThisInstance.setTo (static_cast<ThisClass*>(pSpecific.referent ()));
    else
        pThisInstance.setTo (new VUnmanagedThread ());

    return pThisInstance;
}

I believe (and this is the theory bit) the g_iTLSKey.getSpecific() is always Nil because some of the VThread statics, specifically V::VThreadSpecific::Key const V::VThread::g_iTLSKey;, have already been destroyed.

So in order to delete the VString you create a new VUnmanagedThread. This would be fine, the VString would be properly destroyed except that the VUnmanagedThread is also a VReferenceable that must get destroyed as when VThread::ReclaimObject exits; because VUnmanagedThread is a VReferenceable it needs another VUnmanagedThread to be destroyed (g_iTLSKey is still gone). Now you have an infinite recursion.

Destroying the VUnmanagedThread you just created involves destroying it's inherited VReferenceable which creates another VUnmanagedThread. You continue to allocate another VUnmanagedThread to delete the prior VUnmanagedThread until you overflow the stack and die with a segfault.

Putting g_iTLSKey into a lower level shared object, VCore, causes it to be kept around while all other statics that inherit from VReferenceable are deleted and the executable exits without an error.

VCommitter commented 7 years ago

23 fixes this problem in all branches.