lcfiplus / LCFIPlus

Flavor tagging code for ILC detectors
https://confluence.slac.stanford.edu/display/ilc/LCFIPlus
GNU General Public License v3.0
6 stars 19 forks source link

JetVertexRefiner #19

Closed bogdanmishchenko closed 7 years ago

bogdanmishchenko commented 7 years ago

Dear LCFIPlus developers,

I have encountered memory allocation(malloc) error with running JetVertexRefiner( I have used ilcsoft release /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/v01-19-01/ ). You can find steering file what I have used and log file attached in zip archive. Files.zip

jstrube commented 7 years ago

Thank you for reporting this, and for attaching the steering file. I need a bit more info, though. What are the files you are running over? What's the physics process, what's the detector model that was used?

protopopescu commented 7 years ago

The input file was simulated and reconstructed using SiD_o2_v02, from single_b_jets_200GeV.slcio.

bogdanmishchenko commented 7 years ago

We were able to run two steps:

1.vertexing ( I have used vertex.xml for for DST production) (vertexing works fine) (1-2 steps)

2.I have used jet clustering(jetclustering.xml) for non-flavor-tag applications (jetclustering works fine) (3-4 steps)

And malloc error occurred only after running JetVertexRefiner

jstrube commented 7 years ago

Sorry, still stuck at the simulation stage: Is this what you're doing?

source /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/v01-19-02/init_ilcsoft.sh
source ../lcgeo/bin/thislcgeo.sh
ddsim -N=5 --compactFile=../lcgeo/SiD/compact/SiD_o2_v02/SiD_o2_v02.xml --runType=batch --inputFile=E250-TDR_ws.Pn2n2h_bb.Gwhizard-1_95.eL.pR.I109040.0001.slcio --outputFile=bb_sim_5_events.slcio

I am getting the error message

cling::DynamicLibraryManager::loadLibrary(): dlopen: cannot load any more object with static TLS
cling::DynamicLibraryManager::loadLibrary(): dlopen: cannot load any more object with static TLS
+--------------------------------------------------------------------------------------------------------+
|  Failed to load DDG4 library:                                                                          |
|  DDG4.py: Failed to load the DDG4 library libDDG4Plugins: No such file or directory                    |
+--------------------------------------------------------------------------------------------------------+

I've checked that LD_LIBRARY_PATH contains a path with the file libDDG4Plugins.so, so I'm not sure what the problem is here.

bogdanmishchenko commented 7 years ago

Sourcing ilcsoft and thislcgeo seems fine. However, I am not sure that ddsim command works fine in such order (usually - ddsim --compactFile=.xml file --runType= --inputFile= -N(number of events) --outputFile=)

jstrube commented 7 years ago

Thanks for the quick reply. Tried that, but I get the same error message. Are there other envvars that I am missing? I've also tried to find the lib in root, but that works:

$ root
   ------------------------------------------------------------
  | Welcome to ROOT 6.08/02                http://root.cern.ch |
  |                               (c) 1995-2016, The ROOT Team |
  | Built for linuxx8664gcc                                    |
  | From tag v6-08-02, 2 December 2016                         |
  | Try '.help', '.demo', '.license', '.credits', '.quit'/'.q' |
   ------------------------------------------------------------

root [0] gSystem->Load("libDDG4Plugins")
(int) 0

However, from python:

python
Python 2.7.10 (default, Mar 10 2016, 14:55:16)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> ROOT.gSystem.Load("libDDG4Plugins")
cling::DynamicLibraryManager::loadLibrary(): dlopen: cannot load any more object with static TLS
-1

So it looks like this is a ROOT issue. Not sure why that's not a problem at CERN. Maybe they use a different dlopen?

bogdanmishchenko commented 7 years ago

It might be not the case. However, I used such command for cmake: cmake -DCMAKE_CXX_COMPILER=which g++ -DCMAKE_C_COMPILER=which gcc \ -DILCUTIL_DIR=/cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/v01-17-19/ilcutil/v01-03/ -C $ILCSOFT/ILCSoft.cmake .. It also might due to the recent modification of the lcgeo GitHub.

jstrube commented 7 years ago

Well, the library libDDG4Plugins.so lives in /cvmfs, so I didn't compile it. dlopen is a system library, so I'll try on a different machine...

jstrube commented 7 years ago

OK, I got further on a KEK machine, but ddsim expects an MCParticle list name "MCParticle". How do I tell it that my list has a different name?

aidanrobson commented 7 years ago

Hi Jan, not sure offhand, but if you use for example the file you sent us single_b_jets_200GeV.stdhep and stdhepjob to convert it, then the collection should anyway be named MCParticle:

stdhepjob single_b_jets_200GeV.stdhep single_b_jets_200GeV.slcio -1

ddsim --compactFile=./lcgeo/SiD/compact/SiD_o2_v02/SiD_o2_v02.xml --runType=batch --inputFile=single_b_jets_50GeV.slcio -N=10 --outputFile=single_b_jets_50GeV_sim.slcio

protopopescu commented 7 years ago

Jan, here's my LCFIPlus testing sequence https://www.evernote.com/l/AJ0XEvoXDC9F45SB-bRI2pYDFKvdHcqDqVU

jstrube commented 7 years ago

Thank you. The instructions from @protopopescu are very helpful. I am now able to reproduce the crash. Looking into it...

jstrube commented 7 years ago

I re-compiled LCFIPlus with -g and ran gdb.

(gdb) where
#0  0x00007ffff53ed625 in raise () from /lib64/libc.so.6
#1  0x00007ffff53eee05 in abort () from /lib64/libc.so.6
#2  0x00007ffff542b537 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff5430f4e in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff5435528 in _int_malloc () from /lib64/libc.so.6
#5  0x00007ffff5435b1c in malloc () from /lib64/libc.so.6
#6  0x00007fffd5f8dbe1 in ROOT::Minuit2::Numerical2PGradientCalculator::operator()(ROOT::Minuit2::MinimumParameters const&, ROOT::Minuit2::FunctionGradient const&) const ()
    at /scratch/cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/build-6.08.02/include/Minuit2/StackAllocator.h:97
#7  0x00007fffd5f99977 in ROOT::Minuit2::VariableMetricBuilder::Minimum(ROOT::Minuit2::MnFcn const&, ROOT::Minuit2::GradientCalculator const&, ROOT::Minuit2::MinimumSeed const&, std::vector<ROOT::Minuit2::MinimumState, std::allocator<ROOT::Minuit2::MinimumState> >&, unsigned int, double) const () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/VariableMetricBuilder.cxx:350
#8  0x00007fffd5f9c402 in ROOT::Minuit2::VariableMetricBuilder::Minimum(ROOT::Minuit2::MnFcn const&, ROOT::Minuit2::GradientCalculator const&, ROOT::Minuit2::MinimumSeed const&, ROOT::Minuit2::MnStrategy const&, unsigned int, double) const () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/VariableMetricBuilder.cxx:124
#9  0x00007fffd5f8ac5c in ROOT::Minuit2::ModularFunctionMinimizer::Minimize(ROOT::Minuit2::MnFcn const&, ROOT::Minuit2::GradientCalculator const&, ROOT::Minuit2::MinimumSeed const&, ROOT::Minuit2::MnStrategy const&, unsigned int, double) const () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/ModularFunctionMinimizer.cxx:166
#10 0x00007fffd5f89360 in ROOT::Minuit2::ModularFunctionMinimizer::Minimize(ROOT::Minuit2::FCNBase const&, ROOT::Minuit2::MnUserParameterState const&, ROOT::Minuit2::MnStrategy const&, unsigned int, double) const ()
    at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/ModularFunctionMinimizer.cxx:120
#11 0x00007fffd5f4aecc in ROOT::Minuit2::Minuit2Minimizer::Minimize() () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/Minuit2Minimizer.cxx:504
#12 0x00007fffd628caf7 in lcfiplus::Helix::LogLikelihood(TVector3 const&, double&) const () at /home/ilc/jstrube/ILC/work/LCFIPlus/src/geometry.cc:328
#13 0x00007fffd629323e in lcfiplus::Helix::LogLikelihood(TVector3 const&) const () at /home/ilc/jstrube/ILC/work/LCFIPlus/./include/geometry.h:120
#14 0x00007fffd628ffbc in ROOT::Math::FunctorHandler<ROOT::Math::Functor, lcfiplus::GeometryHandler::PointFitFunctor>::DoEval(double const*) const () at /home/ilc/jstrube/ILC/work/LCFIPlus/./include/geometry.h:234
#15 0x00007fffd5f81e94 in ROOT::Minuit2::MnUserFcn::operator()(ROOT::Minuit2::LAVector const&) const () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/MnUserFcn.cxx:42
#16 0x00007fffd5f7a31b in ROOT::Minuit2::MnSeedGenerator::operator()(ROOT::Minuit2::MnFcn const&, ROOT::Minuit2::GradientCalculator const&, ROOT::Minuit2::MnUserParameterState const&, ROOT::Minuit2::MnStrategy const&) const
    () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/MnSeedGenerator.cxx:66
#17 0x00007fffd5f89332 in ROOT::Minuit2::ModularFunctionMinimizer::Minimize(ROOT::Minuit2::FCNBase const&, ROOT::Minuit2::MnUserParameterState const&, ROOT::Minuit2::MnStrategy const&, unsigned int, double) const ()
    at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/ModularFunctionMinimizer.cxx:118
#18 0x00007fffd5f88ceb in ROOT::Minuit2::ModularFunctionMinimizer::Minimize(ROOT::Minuit2::FCNBase const&, ROOT::Minuit2::MnUserParameters const&, ROOT::Minuit2::MnStrategy const&, unsigned int, double) const ()
    at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/root/6.08.02/math/minuit2/src/ModularFunctionMinimizer.cxx:80
#19 0x00007fffd62898d7 in lcfiplus::GeometryHandler::PointFit(std::vector<lcfiplus::PointBase*, std::allocator<lcfiplus::PointBase*> > const&, TVector3 const&, lcfiplus::Point*) ()
    at /home/ilc/jstrube/ILC/work/LCFIPlus/src/geometry.cc:1316
#20 0x00007fffd6307e24 in lcfiplus::VertexFitterSimple<std::_List_iterator<lcfiplus::Track const*> >::operator()(std::_List_iterator<lcfiplus::Track const*>, std::_List_iterator<lcfiplus::Track const*>, lcfiplus::Vertex*, bool) () at /home/ilc/jstrube/ILC/work/LCFIPlus/./include/VertexFitterSimple.h:35
#21 0x00007fffd6306c88 in lcfiplus::findPrimaryVertex(std::vector<lcfiplus::Track const*, std::allocator<lcfiplus::Track const*> > const&, double, bool, bool) ()
    at /home/ilc/jstrube/ILC/work/LCFIPlus/./include/VertexFinderTearDown.h:49
#22 0x00007fffd625acca in lcfiplus::PrimaryVertexFinder::process() () at /home/ilc/jstrube/ILC/work/LCFIPlus/src/process.cc:81
#23 0x00007fffd62729f2 in LcfiplusProcessor::processEvent(EVENT::LCEvent*) () at /home/ilc/jstrube/ILC/work/LCFIPlus/src/LcfiplusProcessor.cc:234
#24 0x00007ffff7b9705c in marlin::ProcessorMgr::processEvent(EVENT::LCEvent*) () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/v01-19-02/Marlin/v01-11/source/src/ProcessorMgr.cc:468
#25 0x00007ffff75133ed in SIO::SIOReader::readStream(int) () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/v01-19-02/lcio/v02-08/src/cpp/src/SIO/SIOReader.cc:732
#26 0x0000000000412a57 in main () at /cvmfs/ilc.desy.de/sw/x86_64_gcc49_sl6/v01-19-02/Marlin/v01-11/source/src/Marlin.cc:499

Not quite sure yet, what's going on, but I'll keep digging.

jstrube commented 7 years ago

@suehara Have you seen something like this before? If you don't have time to look at this yourself right now, could you point us in the right direction?

andresailer commented 7 years ago

Could you please try

export MALLOC_CHECK_=3

And then rerun Marlin and post the error message and stacktrace if there is one? Thanks

protopopescu commented 7 years ago

Ok, here's the error with MALLOCCHECK=3; it now says Marlin: free(): invalid pointer: 0x0000000003c0a710 stack.txt

andresailer commented 7 years ago

Thanks! That is the same error that we see.

protopopescu commented 7 years ago

I've narrowed it down to a crash in Minimize() in geometry.cc (both options). I try to understand whether it crashes because there's nothing to minimize, or because an intrinsic ROOT Minimize() issue. With MALLOCCHECK=1 the code sometimes runs without crashing. Still digging ...

protopopescu commented 7 years ago

It seems that the crash in PointFit() is caused by the fact that points[i], where i>0, are unusable.

So, to summarise, the algorithms work fine for the first Event, then the VertexRefiner somehow deletes or overwrites something such that at the second Event points[i] passed in VertexFitterSimple to PointFit, are junk or unusable for i>0.

jstrube commented 7 years ago

so that means all points are junk? An in that case the size of the array should have been set to 0?

andresailer commented 7 years ago

@protopopescu , Nacho could you please test the changes from #21

jstrube commented 7 years ago

@sailer Many thanks for the pull request. I tested it and it looks good. I'd like an ok from another developer before merging first, since I haven't been using LCFIPlus personally in a while. @protopopescu @bogdanmishchenko You can check like this (from an existing LCFIPlus clone)

git checkout -b andresailer-fixParameters master
git pull https://github.com/andresailer/LCFIPlus.git fixParameters
protopopescu commented 7 years ago

I can confirm that replacing _map(ref._map) with _map() fixes the crash. Thanks for the fix, Andre!

nachogargar commented 7 years ago

@jstrube I also confirm that the fix from André solves the issue. LCFIPlus working properly locally and in Grid. Thanks André!