star-bnl / star-sw

Core software for STAR experiment
26 stars 63 forks source link

root4star/CINT spends hours hammering AFS on exit #115

Closed veprbl closed 1 month ago

veprbl commented 2 years ago

I was checking on jobs that would take hours to complete. I've discovered that they were making an excessive amount stat64 system calls on AFS, and, on some of the nodes, each would take a fraction of a second to complete.

Basic steps to reproduce

# cat bug.C
void bug() {
   gSystem->Load("libsim_Tables.so");
}
# csh -c "starver SL21b; strace root4star -l -b -q bug.C" 2>&1 | grep stat64 | grep libsim | wc -l
1944

This demonstrates that root has made 1944 (and actually twice of that) stat64() calls just to unload a single library. The issue comes from a G__matchfilename() function https://github.com/root-project/root/blob/250a360349e60dc2dc187ecaa63fc9f5f81a9d74/cint/cint/src/loadfile.cxx#L585-L586 when called in a loop from G__unregister_sharedlib() https://github.com/root-project/root/blob/250a360349e60dc2dc187ecaa63fc9f5f81a9d74/cint/cint/src/loadfile.cxx#L3164-L3170

The idea behind the complication in G__matchfilename() is to check if some different paths could point to the same file (in case there are symlinks or bind mounts?). Based on that information, the G__unregister_sharedlib() may choose to free some memory, which does not matter at all in the end of program's lifecycle.

veprbl commented 2 years ago

Somehow the issue seems to affect TTable libraries more than others. Here is an example backtrace from a JetFinder production code:

#0  0x55638430 in __kernel_vsyscall ()
#1  0x58bd4eb2 in _xstat () from /lib/libc.so.6
#2  0x5610637b in stat () from /afs/rhic.bnl.gov/star/ROOT/5.34.38/.sl73_gcc485/rootdeb/lib/libCint.so
#3  0x55f301dd in G__matchfilename (i1=14, filename=0xf2ed368 "/afs/rhic.bnl.gov/star/packages/SL21a/.sl73_gcc485/lib/libStDb_Tables.so")
    at cint/cint/src/loadfile.cxx:586
#4  0x55f3615c in G__unregister_sharedlib (libname=0xf2ed368 "/afs/rhic.bnl.gov/star/packages/SL21a/.sl73_gcc485/lib/libStDb_Tables.so")
    at cint/cint/src/loadfile.cxx:3166
#5  0x55ff37df in G__UnregisterLibrary (func=0x5a77c103 <G__cpp_setupSt_ssdPedStrip_TableCint>) at cint/cint/src/shl.cxx:2147
#6  0x55f261c6 in G__remove_setup_func (libname=0x5aaa924b "St_ssdPedStrip_TableCint") at cint/cint/src/init.cxx:109
#7  0x5a77c714 in G__cpp_setup_initSt_ssdPedStrip_TableCint::~G__cpp_setup_initSt_ssdPedStrip_TableCint() ()
   from /afs/rhic.bnl.gov/star/packages/SL21a/.sl73_gcc485/lib/libStDb_Tables.so
#8  0x58b22669 in __run_exit_handlers () from /lib/libc.so.6
#9  0x58b226cd in exit () from /lib/libc.so.6
#10 0x558cacc9 in TUnixSystem::Exit (this=0xec011d0, code=0, mode=true)
    at /afs/rhic.bnl.gov/star/ROOT/36/5.34.38/.sl73_gcc485/rootdeb/core/unix/src/TUnixSystem.cxx:2213
#11 0x5585785f in TApplication::Terminate (this=0xee4c5e8, status=0)
    at /afs/rhic.bnl.gov/star/ROOT/36/5.34.38/.sl73_gcc485/rootdeb/core/base/src/TApplication.cxx:1205
#12 0x5773db39 in TRint::Terminate (this=0xee4c5e8, status=0)
    at /afs/rhic.bnl.gov/star/ROOT/36/5.34.38/.sl73_gcc485/rootdeb/core/rint/src/TRint.cxx:656
#13 0x5773cf8d in TRint::Run (this=0xee4c5e8, retrn=false)
    at /afs/rhic.bnl.gov/star/ROOT/36/5.34.38/.sl73_gcc485/rootdeb/core/rint/src/TRint.cxx:436
#14 0x08099cb3 in main (argcp=5, argv=0xfff54764) at .sl73_gcc485/obj/asps/rexe/MAIN_rmain.cxx:35

Similar backtraces can be obtained from the example using:

# cat bug.gdb
break main

run

break G__matchfilename

command
silent
bt
echo ------------------------\n
continue
end

continue
# csh -c "starver SL21b; exec gdb --command bug.gdb --args root4star -l -b -q bug.C"
plexoos commented 2 years ago

That is weird:

[rcas6010] ~/> csh -c 'starver SL21c; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
[rcas6010] ~/> csh -c 'starver SL21b; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
[rcas6010] ~/> csh -c 'starver SL21a; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
veprbl commented 2 years ago

Somehow works for me

[rcas6009] ~/> csh -c 'starver SL21c; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21c
[rcas6009] ~/> csh -c 'starver SL21b; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21b
[rcas6009] ~/> csh -c 'starver SL21a; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
plexoos commented 2 years ago

Which login script to source?

[smirnovd@rcas6009 ~]$ echo $SHELL
/bin/tcsh
[smirnovd@rcas6009 ~]$ csh -f -c 'source /afs/rhic.bnl.gov/rhstar/group/star_cshrc.csh; starver SL21c; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21a
veprbl commented 2 years ago

csh -f -c 'source /afs/rhic.bnl.gov/rhstar/group/star_cshrc.csh; starver SL21c; env | grep ^STAR='

This can't quite work.

  1. You need GROUP_DIR to be defined as starver is an alias to source ${GROUP_DIR}/.starver
  2. For some reason csh does not process the aliases inline:
    # csh -f -c 'alias foo "echo bar"; foo'
    foo: Command not found.
    # csh -f <(echo 'alias foo "echo bar"; foo')
    foo: Command not found.
    # csh -f <(echo 'alias foo "echo bar"\n foo')
    bar

For these reasons, in an isolated environment, I would use

# csh -f -c 'setenv SILENT 1; setenv ECHO 0; setenv GROUP_DIR /afs/rhic.bnl.gov/rhstar/group; source /afs/rhic.bnl.gov/rhstar/group/star_cshrc.csh; source $GROUP_DIR/.starver SL21c; env | grep ^STAR='
STAR=/afs/rhic.bnl.gov/star/packages/SL21c
veprbl commented 2 years ago

I don't see anything special in my ~veprbl/.cshrc or ~veprbl/.tcshrc, feel free to take a look, they are world-readable.

veprbl commented 2 years ago

Filed a fix upstream

veprbl commented 2 years ago

That's merged now. Any chance we apply https://github.com/root-project/root/commit/7231e038b2519eee62446aea7be2becb8671fc57 at RCF? @jlauret

plexoos commented 2 years ago

Can they create a new release 5.34.39? We could replace the current 5.34.38 in our spack

veprbl commented 2 years ago

We could replace the current 5.34.38 in our spack

It won't help unless STAR libraries are compiled against spack. A new ROOT release will mean that version compatibility check will fail.

Can they create a new release 5.34.39?

I will ask

plexoos commented 2 years ago

It won't help unless STAR libraries are compiled against spack.

Cons does not do anything about RPATH AFAIK so, it might work if LD_LIBRARY_PATH points to the new release.

veprbl commented 2 years ago

@plexoos I meant this guy https://github.com/root-project/root/blob/1c1347b551a8ce9a51d6938c15587ccc4be6f4c0/core/base/src/TSystem.cxx#L1891-L1892

# csh -c "starver SL21a; source ~veprbl/afs_hammering_bug/root/bin/thisroot.csh; echo 'gSystem->Load(\"libsim_Tables.so\");' | root4star -l -b"
Welcome to cshell
Error in <TUnixSystem::Load>: version mismatch, /afs/rhic.bnl.gov/star/packages/SL21a/.sl73_gcc485/lib/libsim_Tables.so = 53438, ROOT = 53439
jlauret commented 2 years ago

Can they create a new release 5.34.39? We could replace the current 5.34.38 in our spack

Typically we would not do that exactly but rather, create a 5.34.38_2 , apply specific minor patches and once built and verified, switch the link [this circumvent having to wait for the ROOT team to release a newer version and still move on with improvements].

Cons does not do anything about RPATH AFAIK so, it might work if LD_LIBRARY_PATH points to the new release.

It will work and has always worked using that soft-link change trick (no jobs will ever die for a minor internal change and patch).

fgeurts commented 1 month ago

Hi, I am reviewing old issues on rainy Sunday afternoon ... this particular issue has not been followed up in ~3years. Is there still a need to leave it open? Please advise.

veprbl commented 1 month ago

@fgeurts Thanks for doing that! The issue is not present in the new libraries that can rely on root 6, but old releases remain unpatched. I'm not running any simulations, and it seems like nobody else noticed, so let's close.