Closed veprbl closed 1 month ago
Somehow the issue seems to affect TTable libraries more than others. Here is an example backtrace from a JetFinder production code:
#0 0x55638430 in __kernel_vsyscall ()
#1 0x58bd4eb2 in _xstat () from /lib/libc.so.6
#2 0x5610637b in stat () from /afs/rhic.bnl.gov/star/ROOT/5.34.38/.sl73_gcc485/rootdeb/lib/libCint.so
#3 0x55f301dd in G__matchfilename (i1=14, filename=0xf2ed368 "/afs/rhic.bnl.gov/star/packages/SL21a/.sl73_gcc485/lib/libStDb_Tables.so")
at cint/cint/src/loadfile.cxx:586
#4 0x55f3615c in G__unregister_sharedlib (libname=0xf2ed368 "/afs/rhic.bnl.gov/star/packages/SL21a/.sl73_gcc485/lib/libStDb_Tables.so")
at cint/cint/src/loadfile.cxx:3166
#5 0x55ff37df in G__UnregisterLibrary (func=0x5a77c103 <G__cpp_setupSt_ssdPedStrip_TableCint>) at cint/cint/src/shl.cxx:2147
#6 0x55f261c6 in G__remove_setup_func (libname=0x5aaa924b "St_ssdPedStrip_TableCint") at cint/cint/src/init.cxx:109
#7 0x5a77c714 in G__cpp_setup_initSt_ssdPedStrip_TableCint::~G__cpp_setup_initSt_ssdPedStrip_TableCint() ()
from /afs/rhic.bnl.gov/star/packages/SL21a/.sl73_gcc485/lib/libStDb_Tables.so
#8 0x58b22669 in __run_exit_handlers () from /lib/libc.so.6
#9 0x58b226cd in exit () from /lib/libc.so.6
#10 0x558cacc9 in TUnixSystem::Exit (this=0xec011d0, code=0, mode=true)
at /afs/rhic.bnl.gov/star/ROOT/36/5.34.38/.sl73_gcc485/rootdeb/core/unix/src/TUnixSystem.cxx:2213
#11 0x5585785f in TApplication::Terminate (this=0xee4c5e8, status=0)
at /afs/rhic.bnl.gov/star/ROOT/36/5.34.38/.sl73_gcc485/rootdeb/core/base/src/TApplication.cxx:1205
#12 0x5773db39 in TRint::Terminate (this=0xee4c5e8, status=0)
at /afs/rhic.bnl.gov/star/ROOT/36/5.34.38/.sl73_gcc485/rootdeb/core/rint/src/TRint.cxx:656
#13 0x5773cf8d in TRint::Run (this=0xee4c5e8, retrn=false)
at /afs/rhic.bnl.gov/star/ROOT/36/5.34.38/.sl73_gcc485/rootdeb/core/rint/src/TRint.cxx:436
#14 0x08099cb3 in main (argcp=5, argv=0xfff54764) at .sl73_gcc485/obj/asps/rexe/MAIN_rmain.cxx:35
Similar backtraces can be obtained from the example using:
# cat bug.gdb
break main
run
break G__matchfilename
command
silent
bt
echo ------------------------\n
continue
end
continue
# csh -c "starver SL21b; exec gdb --command bug.gdb --args root4star -l -b -q bug.C"
That is weird:
[rcas6010] ~/> csh -c 'starver SL21c; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
[rcas6010] ~/> csh -c 'starver SL21b; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
[rcas6010] ~/> csh -c 'starver SL21a; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
Somehow works for me
[rcas6009] ~/> csh -c 'starver SL21c; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21c
[rcas6009] ~/> csh -c 'starver SL21b; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21b
[rcas6009] ~/> csh -c 'starver SL21a; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
Which login script to source?
[smirnovd@rcas6009 ~]$ echo $SHELL
/bin/tcsh
[smirnovd@rcas6009 ~]$ csh -f -c 'source /afs/rhic.bnl.gov/rhstar/group/star_cshrc.csh; starver SL21c; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
csh -f -c 'source /afs/rhic.bnl.gov/rhstar/group/star_cshrc.csh; starver SL21c; env | grep ^STAR='
This can't quite work.
GROUP_DIR
to be defined as starver
is an alias to source ${GROUP_DIR}/.starver
# csh -f -c 'alias foo "echo bar"; foo'
foo: Command not found.
# csh -f <(echo 'alias foo "echo bar"; foo')
foo: Command not found.
# csh -f <(echo 'alias foo "echo bar"\n foo')
bar
For these reasons, in an isolated environment, I would use
# csh -f -c 'setenv SILENT 1; setenv ECHO 0; setenv GROUP_DIR /afs/rhic.bnl.gov/rhstar/group; source /afs/rhic.bnl.gov/rhstar/group/star_cshrc.csh; source $GROUP_DIR/.starver SL21c; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21c
I don't see anything special in my ~veprbl/.cshrc or ~veprbl/.tcshrc, feel free to take a look, they are world-readable.
Filed a fix upstream
That's merged now. Any chance we apply https://github.com/root-project/root/commit/7231e038b2519eee62446aea7be2becb8671fc57 at RCF? @jlauret
Can they create a new release 5.34.39? We could replace the current 5.34.38 in our spack
We could replace the current 5.34.38 in our spack
It won't help unless STAR libraries are compiled against spack. A new ROOT release will mean that version compatibility check will fail.
Can they create a new release 5.34.39?
I will ask
It won't help unless STAR libraries are compiled against spack.
Cons does not do anything about RPATH AFAIK so, it might work if LD_LIBRARY_PATH points to the new release.
@plexoos I meant this guy https://github.com/root-project/root/blob/1c1347b551a8ce9a51d6938c15587ccc4be6f4c0/core/base/src/TSystem.cxx#L1891-L1892
# csh -c "starver SL21a; source ~veprbl/afs_hammering_bug/root/bin/thisroot.csh; echo 'gSystem->Load(\"libsim_Tables.so\");' | root4star -l -b"
Welcome to cshell
Error in <TUnixSystem::Load>: version mismatch, /afs/rhic.bnl.gov/star/packages/SL21a/.sl73_gcc485/lib/libsim_Tables.so = 53438, ROOT = 53439
Can they create a new release 5.34.39? We could replace the current 5.34.38 in our spack
Typically we would not do that exactly but rather, create a 5.34.38_2 , apply specific minor patches and once built and verified, switch the link [this circumvent having to wait for the ROOT team to release a newer version and still move on with improvements].
Cons does not do anything about RPATH AFAIK so, it might work if LD_LIBRARY_PATH points to the new release.
It will work and has always worked using that soft-link change trick (no jobs will ever die for a minor internal change and patch).
Hi, I am reviewing old issues on rainy Sunday afternoon ... this particular issue has not been followed up in ~3years. Is there still a need to leave it open? Please advise.
@fgeurts Thanks for doing that! The issue is not present in the new libraries that can rely on root 6, but old releases remain unpatched. I'm not running any simulations, and it seems like nobody else noticed, so let's close.
I was checking on jobs that would take hours to complete. I've discovered that they were making an excessive amount stat64 system calls on AFS, and, on some of the nodes, each would take a fraction of a second to complete.
Basic steps to reproduce
This demonstrates that root has made 1944 (and actually twice of that) stat64() calls just to unload a single library. The issue comes from a
G__matchfilename()
function https://github.com/root-project/root/blob/250a360349e60dc2dc187ecaa63fc9f5f81a9d74/cint/cint/src/loadfile.cxx#L585-L586 when called in a loop fromG__unregister_sharedlib()
https://github.com/root-project/root/blob/250a360349e60dc2dc187ecaa63fc9f5f81a9d74/cint/cint/src/loadfile.cxx#L3164-L3170The idea behind the complication in
G__matchfilename()
is to check if some different paths could point to the same file (in case there are symlinks or bind mounts?). Based on that information, theG__unregister_sharedlib()
may choose to free some memory, which does not matter at all in the end of program's lifecycle.