Closed aandvalenzuela closed 1 year ago
Hi @aandvalenzuela, unfortunately I am not allowed to run the reproducer:
Singularity> cmsRun /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-07-2300/src/Alignment/OfflineValidation/test/inspectData_cfg.py unitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T
adding the max pT filter
16-Aug-2023 08:33:46 CEST Initiating request to open file root://eoscms.cern.ch//eos/cms/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
16-Aug-2023 08:33:47 CEST Initiating request to open file root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
----- Begin Fatal Exception 16-Aug-2023 08:33:48 CEST-----------------------
An exception of category 'FallbackFileOpenError' occurred while
[0] Constructing the EventProcessor
[1] Constructing input source of type PoolSource
[2] Calling RootInputFileSequence::initTheFile()
[3] Calling StorageFactory::open()
[4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root'
Additional Info:
[a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
[b] Input file root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root could not be opened.
[c] XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3010] Unable to give access - user access restricted - unauthorized identity used ; Permission denied
' (errno=3010, code=400). No additional data servers were found.
[d] Last URL tried: root://eoscms.cern.ch:1094//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root?tried=&xrdcl.requuid=25fc2286-8a38-4992-9487-e8d7cdee3a8c
[e] Problematic data server: eoscms.cern.ch:1094
[f] Disabled source: eoscms.cern.ch:1094
----- End Fatal Exception -------------------------------------------------
Any chance there can be "public" reproducer?
Hi @hahnjo,
I think you need to get a grid proxy with voms-proxy-init
to access the file.
Let me know if that works for you. Sorry for the inconvenience, Andrea
That command gives me errors:
$ voms-proxy-init
No credentials found!
outside of Singularity and inside:
Singularity> voms-proxy-init
Unable to find user certificate or key: /afs/cern.ch/user/j/jhahnfel/.globus/usercred.p12
Note that I'm not a CMS member, so maybe I'm missing some permissions?
@smuzaffar is there an equivalent of doing scram b use-ibeos
when running cmsRun
directly and not via scram?
is there an equivalent of doing
scram b use-ibeos
when runningcmsRun
directly and not via scram?
SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local cmsRun ...
should do the job
Unfortunately it still doesn't work for me, neither setting SITECONFIG_PATH
nor installing a grid certificate (as explained in comment that was meanwhile deleted?) and then running voms-proxy-init
:cry: is there a way to reproduce the problem without requiring that file?
@hahnjo , what do you get when run the following cmmands on lxplus8
> ssh lxplus8
> export SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local
> cd /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-20-2300
> eval `scram run -sh`
> cd /tmp/`whoami`
> cmsRun $CMSSW_BASE/src/Alignment/OfflineValidation/test/inspectData_cfg.py unitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T
normally it should read the input file from root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/
@smuzaffar the output of the cmsRun
command is:
adding the max pT filter
21-Aug-2023 12:36:36 CEST Initiating request to open file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
21-Aug-2023 12:36:37 CEST Initiating request to open file root://cms-xrd-global.cern.ch//eos/cms/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
21-Aug-2023 12:36:39 CEST Initiating request to open file root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
----- Begin Fatal Exception 21-Aug-2023 12:36:40 CEST-----------------------
An exception of category 'FallbackFileOpenError' occurred while
[0] Constructing the EventProcessor
[1] Constructing input source of type PoolSource
[2] Calling RootInputFileSequence::initTheFile()
[3] Calling StorageFactory::open()
[4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root'
Additional Info:
[a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://cms-xrd-global.cern.ch//eos/cms/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
[b] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
[c] Input file root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root could not be opened.
[d] XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3010] Unable to give access - user access restricted - unauthorized identity used ; Permission denied
' (errno=3010, code=400). No additional data servers were found.
[e] Last URL tried: root://eoscms.cern.ch:1094//store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root?tried=&xrdcl.requuid=77efdb8c-ca84-408a-a3eb-5044434acba7
[f] Problematic data server: eoscms.cern.ch:1094
[g] Disabled source: eoscms.cern.ch:1094
----- End Fatal Exception -------------------------------------------------
@hahnjo What happens if you open the root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/data/Commissioning2022/Cosmics/ALCARECO/TkAlCosmics0T-PromptReco-v1/000/348/776/00000/96538f53-2088-422c-91a5-841d735a81a8.root
file directly in e.g. ROOT prompt?
(unfortunately the details of that file open error are currently lost, maybe time for us to finally do something about it...)
I think as @hahnjo is running as non-cms user so he might not have rights to access this file. I am looking how I can make /eos/cms/store/user/cmsbuild readable by all CERN users
@hahnjo , I have copied the required root file on public afs. Please try this
> ssh lxplus8
> export SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local
> cd /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-20-2300
> eval `scram run -sh`
> cd /tmp/`whoami`
> cmsRun ~cmsbuild/public/inspectData_cfg.py unitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T
Thanks @smuzaffar, that worked. At least from a first look, ROOT doesn't seem to be doing anything wrong here: The TStyle
is the global "Classic" style, the TString
describes a line style with seemingly correct data that has never been modified since initialization and hasn't been freed before...
Ok, but there is a double delete, and with a bit of CLING_DEBUG=1
I think I was able to guess what is going on:
cmssw/DataFormats/TrackCandidate/interface/TrajectoryStopReasons.h
has the following static const std::string
array: https://github.com/cms-sw/cmssw/blob/86e051b0c241cba35601a6095391324b1e4bfd7b/DataFormats/TrackCandidate/interface/TrajectoryStopReasons.h#L24-L40
namespace StopReasonName {
static const std::string StopReasonName[] = {
"UNINITIALIZED", // 0
"MAX_HITS", // 1
"MAX_LOST_HITS", // 2
"MAX_CONSECUTIVE_LOST_HITS", // 3
"LOST_HIT_FRACTION", // 4
"MIN_PT", // 5
"CHARGE_SIGNIFICANCE", // 6
"LOOPER", // 7
"MAX_CCC_LOST_HITS", // 8
"NO_SEGMENTS_FOR_VALID_LAYERS", // 9
"SEED_EXTENSION", // 10
"NOT_STOPPED" // 11 (be careful, NOT_STOPPED needs to be the last,
// its index differs from the enumeration value)
};
};
That header + static const std::string
array are #include
d into multiple dictionaries / Cling transactions and eventually double-deleted. Standalone reproducer in pure ROOT:
$ gdb --args root.exe -q -e 'namespace NS { static const std::string Reason[] = { "stopped" }; }' -e 'NS::Reason[0]' -e 'NS::Reason[0]'
[...]
(gdb) b cling::IncrementalExecutor::runAndRemoveStaticDestructors
Function "cling::IncrementalExecutor::runAndRemoveStaticDestructors" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (cling::IncrementalExecutor::runAndRemoveStaticDestructors) pending.
(gdb) r
[...]
Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=0x46d3f0, T=0x1c44750)
180 assert(T && "Must be set");
(gdb) b free
Breakpoint 2 at 0x7ffff4497e0c: free. (6 locations)
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>printf "free(%p)\n", $rdi
>continue
>end
(gdb) c
This now jumps from one runAndRemoveStaticDestructors
to the next and collects the pointers that are passed to free
($rdi
is the first argument register in the x86_64 calling convention):
Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=0x46d3f0, T=0x1c800e0)
180 assert(T && "Must be set");
(gdb)
Continuing.
Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0xec0500)
Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1cc45b0)
Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=this@entry=0x46d3f0, T=T@entry=0x1c800e0)
(gdb)
Continuing.
Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1c08100)
Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=0x46d3f0, T=0x1cef7d0) at /home/jhahnfel/ROOT/src/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:180
180 assert(T && "Must be set");
(gdb)
Continuing.
Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=0x46d3f0, T=0x1c58770) at /home/jhahnfel/ROOT/src/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:180
180 assert(T && "Must be set");
(gdb)
Continuing.
Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1b2e5d0)
Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1ca9130)
Breakpoint 1, cling::IncrementalExecutor::runAndRemoveStaticDestructors (this=this@entry=0x46d3f0, T=T@entry=0x1c58770)
at /home/jhahnfel/ROOT/src/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:180
180 assert(T && "Must be set");
(gdb)
Continuing.
Breakpoint 2, 0x00007ffff6b13e20 in free () from /lib64/libc.so.6
free(0x1c08100)
eventually freeing 0x1c08100
twice. Still unclear why we double delete the same object (I have a suspicion), but before I investigate let me first try to work around the problem to confirm that I'm on the right track.
Even easier way to reproduce:
$ ./bin/root.exe -q -l -e 'struct A { int val; A(int v) : val(v) { printf("A(%d), this = %p\n", val, this); } ~A() { printf("~A(%d), this = %p\n", val, this); } };' -e 'const A a(1);' -e 'a.val' -e 'a.val'
A(1), this = 0x7f5349439000
A(1), this = 0x7f5349439000
(const int) 1
A(1), this = 0x7f5349439000
(const int) 1
~A(1), this = 0x7f5349439000
~A(1), this = 0x7f5349439000
~A(1), this = 0x7f5349439000
With that, confirmed that this affects master
and v6.28 while v6.26 and older seem fine. I guess this is good news for CMS and their production version...
Results of today's investigations: The reason is that LinkageComputer::getLVForNamespaceScopeDecl
assigns internal linkage to const
qualified variables without external storage class, which then our KeepLocalGVPass
changes to a weak symbol. If that happens in multiple transactions, the JIT infrastructure (?) merges the symbols into one memory location but constructors and destructors run multiple times on the same object (see above).
We cannot disable KeepLocalGVPass
because it also takes care of static
variables in functions that are lazily initialized. Additionally, not merging the constant variables would mean that they have different addresses which could also cause problems (FWIW this is what you get upstream in clang-repl
). Fun fact: it works by accident in the upgrade to LLVM 16 because I backported https://reviews.llvm.org/D156537, but not https://reviews.llvm.org/D156897 so if a variable MustBeEmitted
it is actually not re-emitted and KeepLocalGVPass
works correctly.
Conclusion: What I think we should really do here is patch LinkageComputer::getLVForNamespaceScopeDecl
to not assign internal linkage to const
qualified variables under IncrementalExtensions
. This seems to work in a few quick tests, but I need to investigate and think more on this...
Check duplicate issues.
Description
Hello,
Since we moved to TCMalloc (instead of JeMalloc) for
cmsRun
we have seen some failures inTROOT::EndOfProcessCleanups
upon destruction ofTString
(and some other destructors) affecting CMSSW ROOT6 (version6.29.01
) and ROOT628 (version6.28.05
) IBs (el8_amd64_gcc11
).We reported the issue in the cmssw repo at #42468.
Reproducer
The errors can be reproduced on lxplus as follows:
And for gdb:
A sample stacktrace upon destruction of the
TString
:See https://github.com/cms-sw/cmssw/issues/42468#issuecomment-1671897314 for other destructors failing on cleanup.
It runs fine when using
cmsRunJE
instead (for JeMalloc allocation type):cmsRunJE /cvmfs/cms-ib.cern.ch/week1/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_ROOT6_X_2023-08-07-2300/src/AlignmentunitTest=True trackCollection=ALCARECOTkAlCosmicsCTF0T
.ROOT version
CMSSW ROOT6 (version
6.29.01
) and ROOT628 (version6.28.05
) IBs.Installation method
from source
Operating system
el8
Additional context
FYI, @smuzaffar @makortel Reported to ROOT as discussed in Core Software meeting