swiss-seismological-service / scrtdd

Double Difference Relocator for SeisComP
27 stars 10 forks source link

segfault while writing output picks xml #56

Closed filefolder closed 2 years ago

filefolder commented 2 years ago

(excuse typo in title *while not whole)

Hi, recently I have found that scrtdd segfaults near the very end of its process, while writing the output xml

code is ran like this: scrtdd --reloc-catalog $csv --ep $xml --profile $profile --author="scrtdd" --verbosity=4 --console=1 -d $conn --xmlout > relocated_picks.$profile.scxml

here's an example of the final xml file, abruptly stopped

      <arrival>
        <pickID>smi:local/pick/220123Kps1YrtU</pickID>
        <phase>S</phase>
        <azimuth>331.3775

It was working fine until somewhat recently, after a few system and seiscomp upgrades. Currently on the latest seiscomp master branch, ubuntu 20.04.

luca-s commented 2 years ago

Let me ask you few details:

I wonder if the issue is with the master branch of seiscomp. That normally works but it is not as stable as the releases branch (currently v4).

However, we can see where the code crashes using the debugger gdb and reporting the output here.

How to run the debugger gdb?

Important: before running the debugger, you need to make sure the debugging information are present in the compiled code, which means you have to set CMAKE_BUILD_TYPE to RelWithDebInfo in ccmake.

filefolder commented 2 years ago

I do suspect that maybe it is an issue with the current seiscomp as scrtdd was working very recently and I just went over a somewhat similar crash issue with boost: https://github.com/SeisComP/main/issues/34

Otherwise I am using the most recent scrtdd here.

I will rebuild with the new CMAKE_BUILD_TYPE set and try again tomorrow.

gempa-jabe commented 2 years ago

Have you recompiled scrtdd after you upgraded SeisComP?

filefolder commented 2 years ago

Have you recompiled scrtdd after you upgraded SeisComP?

I believe so, assuming it compiles with the rest of SC. The scrtdd source is kept in build/extras and the age of the scrtdd binary is the same as all the ones I recompiled yesterday.

edit: here is the relevant output

[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/__/3rd-party/lsqr-lsmr/lsmr.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/__/3rd-party/lsqr-lsmr/lsqr.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/utils.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/solver.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/csvreader.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/catalog.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/sccatalog.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/waveform.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/clustering.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/ttt.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/scttt.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/nllttt.cpp.o
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/CMakeFiles/seiscomp_hdd.dir/hypodd.cpp.o
[ 99%] Linking CXX shared library ../../../../../lib/libseiscomp_hdd.so
[ 99%] Built target seiscomp_hdd
[ 99%] Building CXX object src/extras/scrtdd/libs/hdd/test/CMakeFiles/test_hdd_waveform.dir/waveform.cpp.o
[ 99%] Linking CXX executable ../../../../../../bin/test_hdd_waveform
[ 99%] Built target test_hdd_waveform
[100%] Building CXX object src/extras/scrtdd/libs/hdd/test/CMakeFiles/test_hdd_ellipsoid.dir/ellipsoid.cpp.o
[100%] Linking CXX executable ../../../../../../bin/test_hdd_ellipsoid
[100%] Built target test_hdd_ellipsoid
[100%] Building CXX object src/extras/scrtdd/libs/hdd/test/CMakeFiles/test_hdd_ttt.dir/ttt.cpp.o
[100%] Linking CXX executable ../../../../../../bin/test_hdd_ttt
[100%] Built target test_hdd_ttt
[100%] Building CXX object src/extras/scrtdd/libs/hdd/test/CMakeFiles/test_hdd_clustering.dir/clustering.cpp.o
In file included from /data/seiscomp/SRC/src/extras/scrtdd/libs/hdd/test/clustering.cpp:8:
/data/seiscomp/SRC/src/extras/scrtdd/../../base/common/libs/seiscomp/logging/log.h:26:6: warning: #warning SEISCOMP_COMPONENT not defined -> setting to "Seiscomp" [-Wcpp]
   26 | #    warning SEISCOMP_COMPONENT not defined -> setting to "Seiscomp"
      |      ^~~~~~~
[100%] Linking CXX executable ../../../../../../bin/test_hdd_clustering
[100%] Built target test_hdd_clustering
[100%] Building CXX object src/extras/scrtdd/libs/hdd/test/CMakeFiles/test_hdd_dd.dir/dd.cpp.o
In file included from /data/seiscomp/SRC/src/extras/scrtdd/libs/hdd/test/dd.cpp:10:
/data/seiscomp/SRC/src/extras/scrtdd/../../base/common/libs/seiscomp/logging/log.h:26:6: warning: #warning SEISCOMP_COMPONENT not defined -> setting to "Seiscomp" [-Wcpp]
   26 | #    warning SEISCOMP_COMPONENT not defined -> setting to "Seiscomp"
      |      ^~~~~~~
[100%] Linking CXX executable ../../../../../../bin/test_hdd_dd
[100%] Built target test_hdd_dd
[100%] Building CXX object src/extras/scrtdd/libs/rtddmsg/CMakeFiles/seiscomp_rtddmsg.dir/rtddmsg.cpp.o
[100%] Linking CXX shared library ../../../../../lib/libseiscomp_rtddmsg.so
[100%] Built target seiscomp_rtddmsg
[100%] Building CXX object src/extras/scrtdd/apps/scrtdd/CMakeFiles/scrtdd.dir/app.cpp.o
[100%] Building CXX object src/extras/scrtdd/apps/scrtdd/CMakeFiles/scrtdd.dir/rtdd.cpp.o
[100%] Building CXX object src/extras/scrtdd/apps/scrtdd/CMakeFiles/scrtdd.dir/main.cpp.o
[100%] Linking CXX executable ../../../../../bin/scrtdd
[100%] Built target scrtdd
[100%] Building CXX object src/extras/scrtdd/plugins/locator/rtddloc/CMakeFiles/rtddloc.dir/rtddloc.cpp.o
[100%] Linking CXX shared module ../../../../../../lib/rtddloc.so
[100%] Built target rtddloc
filefolder commented 2 years ago

and here is the gdb output

Thread 1 "scrtdd" received signal SIGSEGV, Segmentation fault.
Seiscomp::DataModel::DatabaseArchive::close (this=this@entry=0x555555682dd0) at /data/seiscomp/SRC/src/base/common/libs/seiscomp/datamodel/databasearchive.cpp:645
645     _db = nullptr;

and backtrace

(gdb) bt
#0  Seiscomp::DataModel::DatabaseArchive::close (this=this@entry=0x555555682dd0) at /data/seiscomp/SRC/src/base/common/libs/seiscomp/datamodel/databasearchive.cpp:645
#1  0x00007ffff769c725 in Seiscomp::DataModel::DatabaseReader::~DatabaseReader (this=0x555555682dd0, __in_chrg=<optimized out>) at /data/seiscomp/SRC/src/base/common/libs/seiscomp/datamodel/databasereader.cpp:52
#2  0x00007ffff76a703d in Seiscomp::DataModel::DatabaseQuery::~DatabaseQuery (this=0x555555682dd0, __in_chrg=<optimized out>) at /data/seiscomp/SRC/src/base/common/libs/seiscomp/datamodel/databasequery.cpp:48
#3  0x00007ffff76dd37a in Seiscomp::Core::BaseObject::decrementReferenceCount (this=<optimized out>) at /data/seiscomp/SRC/src/base/common/libs/seiscomp/core/baseobject.inl:51
#4  Seiscomp::Core::intrusive_ptr_release (p=<optimized out>) at /data/seiscomp/SRC/src/base/common/libs/seiscomp/core/baseobject.inl:32
#5  boost::intrusive_ptr<Seiscomp::DataModel::DatabaseArchive>::~intrusive_ptr (this=0x7fffffffddc0, __in_chrg=<optimized out>) at /usr/include/boost/smart_ptr/intrusive_ptr.hpp:98
#6  Seiscomp::DataModel::PublicObjectCache::~PublicObjectCache (this=0x7fffffffddb0, __in_chrg=<optimized out>) at /data/seiscomp/SRC/src/base/common/libs/seiscomp/datamodel/publicobjectcache.cpp:148
#7  0x000055555559be62 in Seiscomp::DataModel::PublicObjectTimeSpanBuffer::~PublicObjectTimeSpanBuffer (this=0x7fffffffddb0, __in_chrg=<optimized out>)
    at /data/seiscomp/SRC/src/extras/scrtdd/../../base/common/libs/seiscomp3/datamodel/publicobjectcache.h:232
#8  Seiscomp::RTDD::~RTDD (this=0x7fffffffd3b0, __in_chrg=<optimized out>) at /data/seiscomp/SRC/src/extras/scrtdd/apps/scrtdd/rtdd.cpp:365
#9  0x00005555555881a4 in main (argc=<optimized out>, argv=<optimized out>) at /data/seiscomp/SRC/src/extras/scrtdd/apps/scrtdd/main.cpp:28

I hope this helps? Somewhat of interest, but when running gdb i did not pipe the output xml to a file and just let it print, in doing so the xml output seemed complete without cutting off.

edit: here is the final output without piping the xml print to file

.....
  </EventParameters>
</seiscomp>
22:36:18 [info] Disable timer
22:36:18 [debug] Unload plugin 'MySQL database driver'
22:36:18 [debug] Unload plugin 'Locator implementation using NonLinLoc by Anthony Lomax (http://alomax.free.fr/nlloc/)'
22:36:18 [debug] Unload plugin 'MLa magnitude. Calculates magnitude based on universal formulae MLa=c0_log10(Amp)+c1*log10(delta*c3+c4)+c5*(delta+c6), where coefficients c1...6 vary based on epicentral location.'
22:36:18 [debug] Leaving ::done
22:36:18 [info] Unloading profile my_profile
Segmentation fault (core dumped)
$
luca-s commented 2 years ago

So the segfault is happening within seiscomp lib. That doesn't mean scrtdd is not responsible for the bug, though. To restrict my search for the segfault cause, I have to ask you to switch seiscomp to branch v4, recompile and run scrtdd again. If the problem doesn't show up anymore I know where to look for the issue.

Somewhat of interest, but when running gdb i did not pipe the output xml to a file and just let it print, in doing so the xml output seemed complete without cutting off.

From the debugger I can see the segfault happens at the end of everything, after the processing, after the xml file is generated. In the pipe case, the xml file is truncated probably because the application crashes before the xml content is totally written to file, although it was already generated.

filefolder commented 2 years ago

It may be a while before I can get to that, but I can confirm that the code worked in early December and I tend to refresh SC on this machine fairly often. It could not have been older than 4.7.2.

gempa-jabe commented 2 years ago

This might be related to https://github.com/SeisComP/common/pull/49 which now resets everything in Application::done. According to the output posted there is still something unloading after done was called. Can you check your application code? After done all interfaces and plugins are wiped. Accessing pointers still pointing to e.g. the MYSQL interface will result in a segmentation fault.

gempa-jabe commented 2 years ago

@filefolder, you don't need necessarily build against V4. There were some deep changes necessary in how an application is cleaned up and that might break something now with scrtdd. That is not desirable but maybe a good opportunity to fix the clean up procedure.

luca-s commented 2 years ago

@gempa-jabe I will have a look at the scrtdd Application code and let you know

luca-s commented 2 years ago

Looking at the changes of Application::done from branch v4 to master I can see that now a call to System::Application::done() is performed, which in turn calls PluginRegistry::Reset().

Since the database is a plugin based interface, I suspect the new call to System::Application::done() makes the DatabaseArchivePtr, held by PublicObjectTimeSpanBuffer, invalid. If this is true I can simply reset RTDD::_cache member before calling the super class Application::done.

However, when I wrote that code I was not very proficient with seiscomp and I copied the approach from existing code (scwfparam) so I suspect there are other such cases around the code base.

@gempa-jabe thoughts?

gempa-jabe commented 2 years ago

That is true. So the best would be to call PluginRegistry::Reset() in the application's destructor to avoid further pitfalls with other modules. Can you reproduce the crash?

luca-s commented 2 years ago

Can you reproduce the crash?

not yet, I'll let you know soon

gempa-jabe commented 2 years ago

I pushed https://github.com/SeisComP/common/commit/a850e6c95e2e3af15cc9bbb04c4c7dedbb01f9f1 which should fix the issue.

gempa-jabe commented 2 years ago

@filefolder please give it another try.

luca-s commented 2 years ago

@gempa-jabe I cannot replicate the segfault on my side, so I cannot verify my hypothesis

filefolder commented 2 years ago

seems to be working! thanks both for getting to this so quickly. i leave it to you to close