root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.68k stars 1.27k forks source link

`TROOT::EndOfProcessCleanups` fails when using TCMalloc on different destructors #13429

Closed aandvalenzuela closed 1 year ago

aandvalenzuela commented 1 year ago

Check duplicate issues.

Description

Hello,

Since we moved to TCMalloc (instead of JeMalloc) for cmsRun we have seen some failures in TROOT::EndOfProcessCleanups upon destruction of TString (and some other destructors) affecting CMSSW ROOT6 (version 6.29.01) and ROOT628 (version 6.28.05) IBs (el8_amd64_gcc11).

We reported the issue in the cmssw repo at #42468.

Reproducer

The errors can be reproduced on lxplus as follows:

lxplus816:~> cmssw-el8
Singularity> cd /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-07-2300/src/
Singularity> cmsenv
Singularity> cd /tmp/avalenzu/
Singularity> cmsRun /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-07-2300/src/Alignment/OfflineValidation/test/inspectData_cfg.py unitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T

And for gdb:

Singularity> gdb cmsRun
(gdb) run /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-07-2300/src/Alignment/OfflineValidation/test/inspectData_cfg.py unitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T

A sample stacktrace upon destruction of the TString:

Thread 1 (Thread 0x7ffff413fc80 (LWP 3407688) "cmsRun"):
#0  tcmalloc::SLL_PopRange (end=<synthetic pointer>, start=<synthetic pointer>, N=32, head=0x45a0a0) at src/linked_list.h:88
#1  tcmalloc::SLL_PopRange (end=<synthetic pointer>, start=<synthetic pointer>, N=32, head=0x45a0a0) at src/linked_list.h:79
#2  tcmalloc::ThreadCache::FreeList::PopRange (end=<synthetic pointer>, start=<synthetic pointer>, N=32, this=0x45a0a0) at src/thread_cache.h:238
#3  tcmalloc::ThreadCache::ReleaseToCentralCache (this=this@entry=0x45a040, src=src@entry=0x45a0a0, cl=<optimized out>, N=N@entry=32) at src/thread_cache.cc:206
#4  0x00007ffff57dff2c in tcmalloc::ThreadCache::ListTooLong (this=0x45a040, list=0x45a0a0, cl=<optimized out>) at src/thread_cache.cc:164
#5  0x00007ffff6dc2465 in TString::UnLink (this=0xcbb820) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.29.01-192f857c68d2f0a1a8cb821f03f5a854/root-6.29.01/core/base/inc/TString.h:265
--Type <RET> for more, q to quit, c to continue without paging--c
#6  0x00007ffff68a2266 in TString::~TString (this=0xcbb820, __in_chrg=<optimized out>) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.29.01-192f857c68d2f0a1a8cb821f03f5a854/root-6.29.01/core/base/src/TString.cxx:249
#7  0x00007ffff68ac37f in TStyle::~TStyle (this=0xcbb500, __in_chrg=<optimized out>) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.29.01-192f857c68d2f0a1a8cb821f03f5a854/root-6.29.01/core/base/src/TStyle.cxx:478
#8  0x00007ffff68ac46c in TStyle::~TStyle (this=0xcbb500, __in_chrg=<optimized out>) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.29.01-192f857c68d2f0a1a8cb821f03f5a854/root-6.29.01/core/base/src/TStyle.cxx:483
#9  0x00007ffff68fa74e in TCollection::GarbageCollect (obj=0xcbb500) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.29.01-192f857c68d2f0a1a8cb821f03f5a854/root-6.29.01/core/cont/src/TCollection.cxx:736
#10 0x00007ffff6902c6f in TList::Delete (this=0xcb8070, option=0x7ffff69efd24 "") at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.29.01-192f857c68d2f0a1a8cb821f03f5a854/root-6.29.01/core/cont/src/TList.cxx:537
#11 0x00007ffff689635c in TROOT::EndOfProcessCleanups (this=0x7ffff6b34040 <ROOT::Internal::GetROOT1()::alloc>) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.29.01-192f857c68d2f0a1a8cb821f03f5a854/root-6.29.01/core/base/src/TROOT.cxx:1235
#12 0x00007ffff682ef46 in CallEndOfProcessCleanups () at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.29.01-192f857c68d2f0a1a8cb821f03f5a854/root-6.29.01/core/base/src/TApplication.cxx:90
#13 0x00007ffff4c7026c in __run_exit_handlers () from /lib64/libc.so.6
#14 0x00007ffff4c703a0 in exit () from /lib64/libc.so.6
#15 0x00007ffff4c59d8c in __libc_start_main () from /lib64/libc.so.6
#16 0x000000000040803e in _start ()

See https://github.com/cms-sw/cmssw/issues/42468#issuecomment-1671897314 for other destructors failing on cleanup.

It runs fine when using cmsRunJE instead (for JeMalloc allocation type): cmsRunJE /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-07-2300/src/AlignmentunitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T.

ROOT version

CMSSW ROOT6 (version 6.29.01) and ROOT628 (version 6.28.05) IBs.

Installation method

from source

Operating system

el8

Additional context

FYI, @smuzaffar @makortel Reported to ROOT as discussed in Core Software meeting

hahnjo commented 1 year ago

Hi @aandvalenzuela, unfortunately I am not allowed to run the reproducer:

Singularity> cmsRun /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-07-2300/src/Alignment/OfflineValidation/test/inspectData_cfg.py unitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T
adding the max pT filter
16-Aug-2023 08:33:46 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
16-Aug-2023 08:33:47 CEST  Initiating request to open file root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
----- Begin Fatal Exception 16-Aug-2023 08:33:48 CEST-----------------------
An exception of category 'FallbackFileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root'
   Additional Info:
      [a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
      [b] Input file root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root could not be opened.
      [c] XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3010] Unable to give access - user access restricted - unauthorized identity used ; Permission denied
' (errno=3010, code=400). No additional data servers were found.
      [d] Last URL tried: root://eoscms.cern.ch:1094//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root?tried=&xrdcl.requuid=25fc2286-8a38-4992-9487-e8d7cdee3a8c
      [e] Problematic data server: eoscms.cern.ch:1094
      [f] Disabled source: eoscms.cern.ch:1094
----- End Fatal Exception -------------------------------------------------

Any chance there can be "public" reproducer?

aandvalenzuela commented 1 year ago

Hi @hahnjo, I think you need to get a grid proxy with voms-proxy-init to access the file.

Let me know if that works for you. Sorry for the inconvenience, Andrea

hahnjo commented 1 year ago

That command gives me errors:

$ voms-proxy-init
No credentials found!

outside of Singularity and inside:

Singularity> voms-proxy-init
Unable to find user certificate or key: /afs/cern.ch/user/j/jhahnfel/.globus/usercred.p12

Note that I'm not a CMS member, so maybe I'm missing some permissions?

iarspider commented 1 year ago

@smuzaffar is there an equivalent of doing scram b use-ibeos when running cmsRun directly and not via scram?

makortel commented 1 year ago

is there an equivalent of doing scram b use-ibeos when running cmsRun directly and not via scram?

SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local cmsRun ...

should do the job

hahnjo commented 1 year ago

Unfortunately it still doesn't work for me, neither setting SITECONFIG_PATH nor installing a grid certificate (as explained in comment that was meanwhile deleted?) and then running voms-proxy-init :cry: is there a way to reproduce the problem without requiring that file?

smuzaffar commented 1 year ago

@hahnjo , what do you get when run the following cmmands on lxplus8

> ssh lxplus8
> export SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local
> cd /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-20-2300
> eval `scram run -sh`
> cd /tmp/`whoami`
> cmsRun $CMSSW_BASE/src/Alignment/OfflineValidation/test/inspectData_cfg.py unitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T

normally it should read the input file from root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/

hahnjo commented 1 year ago

@smuzaffar the output of the cmsRun command is:

adding the max pT filter
21-Aug-2023 12:36:36 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
21-Aug-2023 12:36:37 CEST  Initiating request to open file root://cms-xrd-global.cern.ch//eos/cms/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
21-Aug-2023 12:36:39 CEST  Initiating request to open file root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
----- Begin Fatal Exception 21-Aug-2023 12:36:40 CEST-----------------------
An exception of category 'FallbackFileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root'
   Additional Info:
      [a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://cms-xrd-global.cern.ch//eos/cms/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
      [b] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
      [c] Input file root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root could not be opened.
      [d] XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3010] Unable to give access - user access restricted - unauthorized identity used ; Permission denied
' (errno=3010, code=400). No additional data servers were found.
      [e] Last URL tried: root://eoscms.cern.ch:1094//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root?tried=&xrdcl.requuid=77efdb8c-ca84-408a-a3eb-5044434acba7
      [f] Problematic data server: eoscms.cern.ch:1094
      [g] Disabled source: eoscms.cern.ch:1094
----- End Fatal Exception -------------------------------------------------
makortel commented 1 year ago

@hahnjo What happens if you open the root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root file directly in e.g. ROOT prompt? (unfortunately the details of that file open error are currently lost, maybe time for us to finally do something about it...)

smuzaffar commented 1 year ago

I think as @hahnjo is running as non-cms user so he might not have rights to access this file. I am looking how I can make /eos/cms/store/user/cmsbuild readable by all CERN users

smuzaffar commented 1 year ago

@hahnjo , I have copied the required root file on public afs. Please try this

> ssh lxplus8
> export SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local
> cd /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-20-2300
> eval `scram run -sh`
> cd /tmp/`whoami`
> cmsRun ~cmsbuild/public/inspectData_cfg.py unitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T
hahnjo commented 1 year ago

Thanks @smuzaffar, that worked. At least from a first look, ROOT doesn't seem to be doing anything wrong here: The TStyle is the global "Classic" style, the TString describes a line style with seemingly correct data that has never been modified since initialization and hasn't been freed before...

hahnjo commented 1 year ago

Ok, but there is a double delete, and with a bit of CLING_DEBUG=1 I think I was able to guess what is going on:

cmssw/DataFormats/TrackCandidate/interface/TrajectoryStopReasons.h has the following static const std::string array: https://github.com/cms-sw/cmssw/blob/86e051b0c241cba35601a6095391324b1e4bfd7b/DataFormats/TrackCandidate/interface/TrajectoryStopReasons.h#L24-L40

namespace StopReasonName {
  static const std::string StopReasonName[] = {
      "UNINITIALIZED",                 //  0
      "MAX_HITS",                      //  1
      "MAX_LOST_HITS",                 //  2
      "MAX_CONSECUTIVE_LOST_HITS",     //  3
      "LOST_HIT_FRACTION",             //  4
      "MIN_PT",                        //  5
      "CHARGE_SIGNIFICANCE",           //  6
      "LOOPER",                        //  7
      "MAX_CCC_LOST_HITS",             //  8
      "NO_SEGMENTS_FOR_VALID_LAYERS",  //  9
      "SEED_EXTENSION",                // 10
      "NOT_STOPPED"                    // 11 (be careful, NOT_STOPPED needs to be the last,
                                       //     its index differs from the enumeration value)
  };
};

That header + static const std::string array are #included into multiple dictionaries / Cling transactions and eventually double-deleted. Standalone reproducer in pure ROOT:

 $ gdb --args root.exe -q -e 'namespace NS { static const std::string Reason[] = { "stopped" }; }' -e 'NS::Reason[0]' -e 'NS::Reason[0]'
[...]
(gdb) b cling::IncrementalExecutor::runAndRemoveStaticDestructors
Function "cling::IncrementalExecutor::runAndRemoveStaticDestructors" not defined.                                                                                                                       
Make breakpoint pending on future shared library load? (y or [n]) y                                                                                                                                     
Breakpoint 1 (cling::IncrementalExecutor::runAndRemoveStaticDestructors) pending.
(gdb) r
[...]
Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=0x46d3f0, T=0x1c44750)
180       assert(T && "Must be set");
(gdb) b free
Breakpoint 2 at 0x7ffff4497e0c: free. (6 locations)                                                                                                                                                     
(gdb) commands                                                                                                                                                                                          
Type commands for breakpoint(s) 2, one per line.                                                                                                                                                        
End with a line saying just "end".                                                                                                                                                                      
>printf "free(%p)\n", $rdi                                                                                                                                                                              
>continue                                                                                                                                                                                               
>end                                                                                                                                                                                                    
(gdb) c

This now jumps from one runAndRemoveStaticDestructors to the next and collects the pointers that are passed to free ($rdi is the first argument register in the x86_64 calling convention):

Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=0x46d3f0, T=0x1c800e0)
180       assert(T && "Must be set");                                                                                                                                                                   
(gdb)                                                                                                                                                                                                   
Continuing.

Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0xec0500)

Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1cc45b0)

Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=this@entry=0x46d3f0, T=T@entry=0x1c800e0)
(gdb)
Continuing.

Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1c08100)

Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=0x46d3f0, T=0x1cef7d0) at /home/jhahnfel/ROOT/src/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:180          
180       assert(T && "Must be set");
(gdb)
Continuing.

Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=0x46d3f0, T=0x1c58770) at /home/jhahnfel/ROOT/src/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:180          
180       assert(T && "Must be set");
(gdb)
Continuing.

Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1b2e5d0)

Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1ca9130)

Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=this@entry=0x46d3f0, T=T@entry=0x1c58770)                                                                                
    at /home/jhahnfel/ROOT/src/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:180
180       assert(T && "Must be set");
(gdb)
Continuing.

Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1c08100)

eventually freeing 0x1c08100 twice. Still unclear why we double delete the same object (I have a suspicion), but before I investigate let me first try to work around the problem to confirm that I'm on the right track.

hahnjo commented 1 year ago

Even easier way to reproduce:

$ ./bin/root.exe -q -l -e 'struct A { int val; A(int v) : val(v) { printf("A(%d), this = %p\n", val, this); } ~A() { printf("~A(%d), this = %p\n", val, this); } };' -e 'const A a(1);' -e 'a.val' -e 'a.val'

A(1), this = 0x7f5349439000
A(1), this = 0x7f5349439000
(const int) 1
A(1), this = 0x7f5349439000
(const int) 1
~A(1), this = 0x7f5349439000
~A(1), this = 0x7f5349439000
~A(1), this = 0x7f5349439000

With that, confirmed that this affects master and v6.28 while v6.26 and older seem fine. I guess this is good news for CMS and their production version...

hahnjo commented 1 year ago

Results of today's investigations: The reason is that LinkageComputer::getLVForNamespaceScopeDecl assigns internal linkage to const qualified variables without external storage class, which then our KeepLocalGVPass changes to a weak symbol. If that happens in multiple transactions, the JIT infrastructure (?) merges the symbols into one memory location but constructors and destructors run multiple times on the same object (see above).

We cannot disable KeepLocalGVPass because it also takes care of static variables in functions that are lazily initialized. Additionally, not merging the constant variables would mean that they have different addresses which could also cause problems (FWIW this is what you get upstream in clang-repl). Fun fact: it works by accident in the upgrade to LLVM 16 because I backported https://reviews.llvm.org/D156537, but not https://reviews.llvm.org/D156897 so if a variable MustBeEmitted it is actually not re-emitted and KeepLocalGVPass works correctly.

Conclusion: What I think we should really do here is patch LinkageComputer::getLVForNamespaceScopeDecl to not assign internal linkage to const qualified variables under IncrementalExtensions. This seems to work in a few quick tests, but I need to investigate and think more on this...