iLCSoft / Marlin

Modular Analysis and Reconstruction for the LINear Collider
GNU General Public License v3.0
11 stars 16 forks source link

Segmentation fault at the end of Marlin processor running #42

Open yradkhorrami opened 3 years ago

yradkhorrami commented 3 years ago

........ [ MESSAGE "Marlin"] --------------------------------------------------------- [ MESSAGE "Marlin"] Events skipped by processors : [ MESSAGE "Marlin"] Total: 0 [ MESSAGE "Marlin"] --------------------------------------------------------- [ MESSAGE "Marlin"] [ MESSAGE "Marlin"] --------------------------------------------------------- [ MESSAGE "Marlin"] Time used by processors ( in processEvent() ) :
[ MESSAGE "Marlin"] [ MESSAGE "Marlin"] MyIsolatedLeptonTaggingProcess 8.300000e-01 s in 998 events ==> 8.316633e-04 [ s/evt.] [ MESSAGE "Marlin"] Total: 8.300000e-01 s in 998 events ==> 8.316633e-04 [ s/evt.] [ MESSAGE "Marlin"] --------------------------------------------------------- Segmentation fault

dudarboh commented 3 years ago

I have tried it with my local analysis process and with MyRefitProcessorProton process from ILDConfig production.

And I couldn't reproduce the Seg. fault message..

Could you share the processor code which could reproduce this?

I think I had this behavior before, although I don't remember how did I fix that exactly... My guess would be that it is something with Process destructor.. Seeing the code would help

yradkhorrami commented 3 years ago

I'm using just IsolatedLeptonTaggingProcessor centrally installed on cvmfs the steering file is attached. (just rename .xml.txt ->.xml) SLDCorrection.xml.txt

dudarboh commented 3 years ago

Very interesting..

I have tried on naf: with: source /cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/init_ilcsoft.sh then Marlin ./SLDCorrection.xml outputs no Seg.Fault. in the end..

 MESSAGE "MyIsolatedLeptonTaggingProcessor"] -------------------------------------------------
[ MESSAGE "Marlin"]  --------------------------------------------------------- 
[ MESSAGE "Marlin"]   Events skipped by processors : 
[ MESSAGE "Marlin"]   Total: 0
[ MESSAGE "Marlin"]  --------------------------------------------------------- 
[ MESSAGE "Marlin"] 
[ MESSAGE "Marlin"]  --------------------------------------------------------- 
[ MESSAGE "Marlin"]       Time used by processors ( in processEvent() ) :      
[ MESSAGE "Marlin"] 
[ MESSAGE "Marlin"] MyIsolatedLeptonTaggingProcess       7.000000e-01 s in          998 events  ==> 7.014028e-04 [ s/evt.] 
[ MESSAGE "Marlin"]             Total:                   7.000000e-01 s in          998 events  ==> 7.014028e-04 [ s/evt.] 
[ MESSAGE "Marlin"]  --------------------------------------------------------- 
dudarboh commented 3 years ago

I could reproduce the problem by adding my custom /afs/desy.de/user/d/dudarboh/iLCSoft/MarlinUtil/lib/libMarlinUtilNew.so to the $MARLIN_DLL. Then, Seg. Fault appears in the end as described above.

@yradkhorrami could you share your output of echo $MARLIN_DLL to check if it has any potential processor/library duplicates?

My guess would be that this happens when marlin::Processor::~Processor() tries to clean up Processor parameters here

Although I am a bit puzzled, as my libMarlinUtilNew.so is not really a processor at all and I renamed the library...

Here is relevant part of valgrind output:

. . .
==16765== Invalid read of size 8
==16765==    at 0x4E8D8F0: marlin::Processor::~Processor() (in /cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/Marlin/v01-17-01/lib/libMarlin.so.1.17.1)
==16765==    by 0x7216CE8: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==16765==    by 0x7216D36: exit (in /usr/lib64/libc-2.17.so)
==16765==    by 0x71FF55B: (below main) (in /usr/lib64/libc-2.17.so)
==16765==  Address 0x2bb250d0 is 1,680 bytes inside an unallocated block of size 1,696 in arena "client"
==16765== 
==16765== Invalid read of size 8
==16765==    at 0x4E8D8F9: marlin::Processor::~Processor() (in /cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/Marlin/v01-17-01/lib/libMarlin.so.1.17.1)
==16765==    by 0x7216CE8: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==16765==    by 0x7216D36: exit (in /usr/lib64/libc-2.17.so)
==16765==    by 0x71FF55B: (below main) (in /usr/lib64/libc-2.17.so)
==16765==  Address 0x2bb24fd0 is 1,424 bytes inside an unallocated block of size 1,696 in arena "client"
==16765== 
==16765== Jump to the invalid address stated on the next line
==16765==    at 0x0: ???
==16765==    by 0x4E8D8FE: marlin::Processor::~Processor() (in /cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/Marlin/v01-17-01/lib/libMarlin.so.1.17.1)
==16765==    by 0x7216CE8: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==16765==    by 0x7216D36: exit (in /usr/lib64/libc-2.17.so)
==16765==    by 0x71FF55B: (below main) (in /usr/lib64/libc-2.17.so)
==16765==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==16765== 
==16765== 
==16765== Process terminating with default action of signal 11 (SIGSEGV)
==16765==  Bad permissions for mapped region at address 0x0
==16765==    at 0x0: ???
==16765==    by 0x4E8D8FE: marlin::Processor::~Processor() (in /cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/Marlin/v01-17-01/lib/libMarlin.so.1.17.1)
==16765==    by 0x7216CE8: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==16765==    by 0x7216D36: exit (in /usr/lib64/libc-2.17.so)
==16765==    by 0x71FF55B: (below main) (in /usr/lib64/libc-2.17.so)
==16765== 
. . . 

Maybe running it with debug symbols can give more info, although I would need to manually rebuild Marlin from scratch then..

Maybe @tmadlener, @gaede have a better explanation and potential fix in mind?

yradkhorrami commented 3 years ago

@dudarboh, I just looked at MARLIN libraries and found an interesting point: before including a local Marlin library, there is no problem, and the Marlin job finishes without any Seg. Fault. As soon as I add some of my local Marlin library, the Seg. Fault appears at the end. I checked which libraries cause the issue and found out those had been compiled using previous versions of ILCSoft (gcc,...) cause the issue. after recompiling the same processor with the latest version, the Seg.Faul does not appear. the output of echo $MARLIN_DLL is:

/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/MarlinDD4hep/v00-06/lib/libMarlinDD4hep.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/DDMarlinPandora/v00-11/lib/libDDMarlinPandora.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/MarlinReco/v01-31/lib/libMarlinReco.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/PandoraAnalysis/v02-00-01/lib/libPandoraAnalysis.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/LCFIVertex/v00-08/lib/libLCFIVertexProcessors.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/CEDViewer/v01-17-01/lib/libCEDViewer.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/Overlay/v00-22-02/lib/libOverlay.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/MarlinFastJet/v00-05-02/lib/libMarlinFastJet.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/LCTuple/v01-12/lib/libLCTuple.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/MarlinKinfit/v00-06/lib/libMarlinKinfit.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/MarlinTrkProcessors/v02-11/lib/libMarlinTrkProcessors.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/MarlinKinfitProcessors/v00-04-02/lib/libMarlinKinfitProcessors.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/ILDPerformance/v01-10/lib/libILDPerformance.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/Clupatra/v01-03/lib/libClupatra.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/Physsim/v00-04-01/lib/libPhyssim.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/LCFIPlus/v00-09/lib/libLCFIPlus.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/FCalClusterer/v01-00-01/lib/libFCalClusterer.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/ForwardTracking/v01-14/lib/libForwardTracking.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/ConformalTracking/v01-10/lib/libConformalTracking.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/LICH/v00-01/lib/libLICH.so:/cvmfs/ilc.desy.de/sw/x86_64_gcc82_centos7/v02-02-02/Garlic/v03-01/lib/libGarlic.so:/afs/desy.de/group/flc/pool/radkhory/HdecayMode/lib/libHdecayMode.so:/afs/desy.de/group/flc/pool/radkhory/SLDecayCorrection/lib/libSLDecayCorrection.so

which HdecayMode caused the issue.

dudarboh commented 2 years ago

As Julie @Torndal recently also encountered this problem. I want to throw my 5 cents again.

Basically, I want to confirm @yradkhorrami observations from the previous post. I encountered this seg. fault in the end, only with libraries inside MARLIN_DLL which were compiled with a previous versions of iLCSoft.

I was trying to debug it with gdb a bit, thanks to @tmadlener, but it really went far beyond return 0; in the main() and crashed somewhere on std::string() destructor...

Recompiling the processor with a consistent version with all other libraries from iLCSoft, I think should fix it