ROOT loads unneeded PCMs

ktf commented 1 year ago

Check duplicate issues.

[X] Checked for duplicates

Describe the bug

When doing a simple TFile::Open(), ROOT initialises the interpreter on the first TEnv::GetValue needed by the aforementioned method. This makes it load all the pcms which it can find, regardless of weather they are needed to read the file or not (in our case we only have branches with basic types, e.g. floats or ints or vector of thereof). This has the net result of adding 70MB of RSS for no particular reason to any process which uses ROOT. You can find more details in the attached instruments screenshot. Moreover, single stepping with the debugger and printing out the LoadModule argument, it seems that the pcms for Sofie, Proof and 3D graphics are particularly honours, all of which we do not need and we do not link.

What is the expected behaviour?

The best would be for ROOT to be smart enough to lazily load the pcms when they are actually needed, not upfront. A second best option would be to be able to start the interpreter in a clean slate state and load by hand only the pcms which are needed to read a file.

How to reproduce?

On macOS using Instruments one can do:

xcrun xctrace record --output root.trace --time-limit 30s --template Allocations --launch -- /Users/ktf/src/sw/osx_arm64/ROOT/alice-v6-28-04-local2/bin/root.exe -q -b http://alimonitor.cern.ch/train-workdir/testdata/LFN/alice/data/2015/LHC15o/000246392/pass2/PWGZZ/Run3_Conversion/320_20220701-0813/AOD/019/AO2D.root
open root.trace

and get the report in question. Using the alice environment should not matter, but in principle you should be able to load it from CVMFS.

ROOT version

6.28.04, but it's a while we see it.

How did you install ROOT?

ALICE build environment

Which operating system are you using?

macOS, linux

Additional context

No response

Axel-Naumann commented 1 year ago

Info from @ktf : with 100 processes linked against ROOT running on these machines that's an extra 70MB*100=7GB used by (unneeded?) PCMs.

vgvassilev commented 1 year ago

The best would be for ROOT to be smart enough to lazily load the pcms when they are actually needed, not upfront.

Yes, and unfortunately that is as good as it gets with the current infrastructure. Here I've put a bit more details about what could be improved to further reduce the memory footprint: https://github.com/cms-sw/cmssw/issues/41810#issuecomment-1578982627

Note that the loading of the pcm files is mmap-ed and is a no-op. The problem is that some sections of the PCMs are not lazy and they are eagerly deserialized. Using a PCH won't make the things much better since the PCH still deserializes eagerly some sections. That is, the deficiency is not in ROOT but in Clang itself.

Could you rerun using export ROOTDEBUG=7 and attach the output. Sometimes we lookup an identifier like I or something which is not found and it triggers deserialization of entities. That sometimes could be avoided.

ktf commented 1 year ago

I will provide you the output for ROOTDEBUG=7.

Could you elaborate a bit on the mmap? I would expect that using MAP_SHARED would be enough of a workaround for our (ALICE) use case, because what is really killing us is the fact the memory is not shared among the many processes we have. Any reason why that cannot be done? Did I misunderstand something?

vgvassilev commented 1 year ago

I will provide you the output for ROOTDEBUG=7.

Could you elaborate a bit on the mmap?

The pcm files are "mmapped" when being opened. That practically uses some virtual memory with the promise it won't increase your rss, unless something is needed from that file. Here is how this is achieved: https://github.com/root-project/root/blob/6eb7f64bd7d93c492226b1a5ea5048a149814278/interpreter/llvm/src/tools/clang/lib/Serialization/ModuleManager.cpp#L216

I believe your issue is that due to some identifier lookup we start loading pcm files which have sections that require eager deserialization where the mmap manifests into a real rss increase.

I would expect that using MAP_SHARED would be enough of a workaround for our (ALICE) use case, because what is really killing us is the fact the memory is not shared among the many processes we have. Any reason why that cannot be done? Did I misunderstand something?

The reason is that the serialization in Clang has deficiencies and reads from disk when a pcm file is loaded. I've been hunting down these cases and sometimes we could avoid them. That's why I was looking for some output that could help us do that.

That being said, we could make some effort to split the startup phase of ROOT into loading and initialization. Then we could move the registration of pcm files and setting up the ROOT's runtime as part of the initialization process but realistically, if you use ROOT for anything you'd probably need these both...

ktf commented 1 year ago

Ok, so you are saying that it's not the mmaping to cause the problem, but the subsequent allocations due to some deserialisation. Ok, I checked and indeed the mmap is called with MAP_SHARED...

Find in attachment the ROOTDEBUG=7 log. This is only for the component which reads the flat root files (i.e. only basic types).

log.txt

vgvassilev commented 1 year ago

According to the log ROOT is looking up Experimental (which is a namespace) and pulls in all modules that have namespace partitions with the name Experimental. We have tried to improve that with https://github.com/root-project/root/pull/10969 but probably missed some bit. Would applying it help?

ktf commented 1 year ago

The behavior seems to have indeed changed, yet I still see the same amount of RSS being allocated in the same codepath. See attached log.

log.txt.gz

ktf commented 1 year ago

Anything else which I can try here?

vgvassilev commented 1 year ago

@ktf,

cat /Users/vvassilev/Downloads/log\ 2.txt | grep 'on demand'
[89725:internal-dpl-aod-reader]: Loading 'Core' on demand for 'TArrayF'
[89725:internal-dpl-aod-reader]: Loading 'Cling_Runtime' on demand for 'T'
[89725:internal-dpl-aod-reader]: Loading 'Smatrix' on demand for 'MatRepStd'
[89725:internal-dpl-aod-reader]: Loading 'MathCore' on demand for 'R'
[89725:internal-dpl-aod-reader]: Loading 'std' on demand for 'complex'
[89725:internal-dpl-aod-reader]: Loading 'GenVector' on demand for 'Cartesian2D'
[89725:internal-dpl-aod-reader]: Loading 'ROOT_Foundation_Stage1_NoRTTI' on demand for 'TVirtualIsAProxy'
[89725:internal-dpl-aod-reader]: Loading 'Tree' on demand for 'TIOFeatures'
[89725:internal-dpl-aod-reader]: Loading 'MultiProc' on demand for 'ReadBuffer'
[89725:internal-dpl-aod-reader]: Loading 'Hist' on demand for 'TH1'
[89725:internal-dpl-aod-reader]: Loading 'TreePlayer' on demand for 'TTreeFormula'
[89725:internal-dpl-aod-reader]: Loading 'RIO' on demand for 'TMapFile'
[89725:internal-dpl-aod-reader]: Loading 'ROOT_Rtypes' on demand for 'UChar_t'

Looks quite good. However, for some reason we pre-load a lot more modules than I expect. Do you have an up-to-date ROOTSYS/lib/modules.idx file?

ktf commented 1 year ago

I have whatever is generated by building ROOT 6.28.4. I attached it. Is there a way to trim it down for a specific proxy?

modules.idx.zip

ktf commented 1 year ago

Any further idea?

ktf commented 1 year ago

Anything else we can try here? I would really love to get rid of the overhead from ROOT.

vgvassilev commented 1 year ago

Anything else we can try here? I would really love to get rid of the overhead from ROOT.

Apologies for being slow. I am attending a workshop these past several days... This is what ROOT (without the suggested patch does for me):

Essentially it loads 56 pcm files at startup. That's still pretty bad but not 100.

/usr/bin/time -v root.exe -l -b -q shows User time (seconds): 0.08 System time (seconds): 0.04 Maximum resident set size (kbytes): 179344

Still probably too much memory being used but that's what the state of the art modules technology allows us.

The numbers are not great but not I cannot reproduce the pcm loads that you report. Can you prepare some debug environment for me to check it out. I will need a standard build of ROOT perhaps with -DCMAKE_BUILD_TYPE=Debug -DLLVM_BUILD_TYPE=Debug?

ktf commented 1 year ago

TLDR; I repeated your tests with our environment, see below for the actual setup. Without the root file being opened, I get more preloaded dictionaries (119), but roughly the same memory footprint. Adding a ROOT file on the command line it adds an additional 60MB and I see SOFIE and PyMVA being loaded on demand (see also #13055).

Long story:

You can set up the same environment by going to lxplus and doing:

→ ALIBUILD_ARCH_PREFIX="Packages" WORK_DIR=/cvmfs/alice.cern.ch/el7-x86_64 . /cvmfs/alice.cern.ch/el7-x86_64/Packages/ROOT/v6-28-04-9/etc/profile.d/init.sh
/cvmfs/alice.cern.ch/el7-x86_64/Packages/AliEn-Runtime/v2-19-le-136/etc/profile.d/init.sh:source:7: no such file or directory: /cvmfs/alice.cern.ch/el7-x86_64/Packages/ApMon-CPP/v2.2.8-alice5-40/etc/profile.d/init.sh

You can safely ignore the ApMon message.

Without the root file, I still see over one hundred preloaded, while "on demand" and the memory usage is roughly the same.

# eulisse at lxplus707.cern.ch in ~ [9:39:17]
→ ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep Preloading | wc -l
119

# eulisse at lxplus707.cern.ch in ~ [9:39:24]
→ ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq | wc -l

34

# eulisse at lxplus707.cern.ch in ~ [9:39:57]
→ /usr/bin/time -v root.exe -l -b -q

        Command being timed: "root.exe -l -b -q"
        User time (seconds): 0.22
        System time (seconds): 0.26
        Percent of CPU this job got: 62%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.78
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 139016
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 56647
        Voluntary context switches: 5479
        Involuntary context switches: 23
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

With the root file being loaded, the preloading stays almost the same, while "on demand" goes to 41. This also results in a 60MB jump in memory:

→ ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep Preloading | wc -l
120

→ ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq | wc -l
41

→ /usr/bin/time -v root.exe -l -b -q ~/public/AO2D.root

Attaching file /afs/cern.ch/user/e/eulisse/public/AO2D.root as _file0...
(TFile *) 0x3910f20
Command exited with non-zero status 255
        Command being timed: "root.exe -l -b -q /afs/cern.ch/user/e/eulisse/public/AO2D.root"
        User time (seconds): 0.42
        System time (seconds): 0.30
        Percent of CPU this job got: 72%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.01
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 207928
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 76766
        Voluntary context switches: 9921
        Involuntary context switches: 28
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 255

The difference indeed seems to come from loading SOFIE and PyMVA on demand (why?).

→ diff -Naur =(ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep Preloading) =(ROOTDEBUG=7 root.exe -l -b -q  2>&1 | grep Preloading)
--- /tmp/zsh3krahi      2023-06-28 09:47:35.214307973 +0200
+++ /tmp/zshDEGUqN      2023-06-28 09:47:35.843315522 +0200
@@ -117,4 +117,3 @@
 Info in <TCling::__LoadModule>: Preloading module MultiProc.
 Info in <TCling::__LoadModule>: Preloading module Imt.
 Info in <TCling::__LoadModule>: Preloading module MathCore.
-Info in <TCling::__LoadModule>: Preloading module Tree.

→ diff -Naur =(ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq) =(ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq)
--- /tmp/zshD2FXJ0      2023-06-28 09:48:55.414270426 +0200
+++ /tmp/zsh1jCtJt      2023-06-28 09:48:56.293280974 +0200
@@ -1,5 +1,4 @@
 'Cling_Runtime'
-'Cling_Runtime_Extra'
 'Core'
 'Foam'
 'Genetic'
@@ -9,7 +8,6 @@
 'Hist'
 'HistFactory'
 'Imt'
-'libc'
 'MathCore'
 'MathMore'
 'Minuit'
@@ -17,7 +15,6 @@
 'MultiProc'
 'Net'
 'NetxNG'
-'PyMVA'
 'Rint'
 'RIO'
 'RooFitCore'
@@ -26,16 +23,12 @@
 'RooStats'
 'ROOTDataFrame'
 'ROOT_Foundation_Stage1_NoRTTI'
-'ROOT_Rtypes'
-'ROOTTMVASofie'
-'ROOTTMVASofieParser'
 'ROOTTreeViewer'
 'ROOTVecOps'
 'ROOTWebDisplay'
 'Smatrix'
 'std'
 'Thread'
-'TMVA'
 'Tree'
 'TreePlayer'
 'XMLIO'

Full list of modules being preloaded can be found below.

Info in <TCling::__LoadModule>: Preloading module libc.
Info in <TCling::__LoadModule>: Preloading module std.
Info in <TCling::__LoadModule>: Preloading module _Builtin_intrinsics.
Info in <TCling::__LoadModule>: Preloading module ROOT_Foundation_C.
Info in <TCling::__LoadModule>: Preloading module ROOT_Config.
Info in <TCling::__LoadModule>: Preloading module ROOT_Rtypes.
Info in <TCling::__LoadModule>: Preloading module ROOT_Foundation_Stage1_NoRTTI.
Info in <TCling::__LoadModule>: Preloading module Core.
Info in <TCling::__LoadModule>: Preloading module Rint.
Info in <TCling::__LoadModule>: Preloading module RIO.
Info in <TCling::__LoadModule>: Preloading module MathCore.
Info in <TCling::__LoadModule>: Preloading module Hist.
Info in <TCling::__LoadModule>: Preloading module Unfold.
Info in <TCling::__LoadModule>: Preloading module RHTTPSniff.
Info in <TCling::__LoadModule>: Preloading module Imt.
Info in <TCling::__LoadModule>: Preloading module HistPainter.
Info in <TCling::__LoadModule>: Preloading module PyMVA.
Info in <TCling::__LoadModule>: Preloading module RMySQL.
Info in <TCling::__LoadModule>: Preloading module RHTTP.
Info in <TCling::__LoadModule>: Preloading module FitPanel.
Info in <TCling::__LoadModule>: Preloading module ROOTVecOps.
Info in <TCling::__LoadModule>: Preloading module ProofDraw.
Info in <TCling::__LoadModule>: Preloading module Quadp.
Info in <TCling::__LoadModule>: Preloading module GeomPainter.
Info in <TCling::__LoadModule>: Preloading module ROOT_Foundation_Stage1_NoRTTI.
Info in <TCling::__LoadModule>: Preloading module xlocale.h.
Info in <TCling::__LoadModule>: Preloading module Genetic.
Info in <TCling::__LoadModule>: Preloading module Eve.
Info in <TCling::__LoadModule>: Preloading module TreeViewer.
Info in <TCling::__LoadModule>: Preloading module Physics.
Info in <TCling::__LoadModule>: Preloading module ROOTTMVASofieParser.
Info in <TCling::__LoadModule>: Preloading module EG.
Info in <TCling::__LoadModule>: Preloading module Tree.
Info in <TCling::__LoadModule>: Preloading module HistFactory.
Info in <TCling::__LoadModule>: Preloading module Spectrum.
Info in <TCling::__LoadModule>: Preloading module Monalisa.
Info in <TCling::__LoadModule>: Preloading module Matrix.
Info in <TCling::__LoadModule>: Preloading module Html.
Info in <TCling::__LoadModule>: Preloading module Hist.
Info in <TCling::__LoadModule>: Preloading module GuiHtml.
Info in <TCling::__LoadModule>: Preloading module _Builtin_intrinsics.
Info in <TCling::__LoadModule>: Preloading module Gpad.
Info in <TCling::__LoadModule>: Preloading module TMVAGui.
Info in <TCling::__LoadModule>: Preloading module Postscript.
Info in <TCling::__LoadModule>: Preloading module RGL.
Info in <TCling::__LoadModule>: Preloading module Fumili.
Info in <TCling::__LoadModule>: Preloading module Net.
Info in <TCling::__LoadModule>: Preloading module Geom.
Info in <TCling::__LoadModule>: Preloading module RooFitMore.
Info in <TCling::__LoadModule>: Preloading module X3d.
Info in <TCling::__LoadModule>: Preloading module Cling_Runtime.
Info in <TCling::__LoadModule>: Preloading module EGPythia6.
Info in <TCling::__LoadModule>: Preloading module SPlot.
Info in <TCling::__LoadModule>: Preloading module RooFit.
Info in <TCling::__LoadModule>: Preloading module RCsg.
Info in <TCling::__LoadModule>: Preloading module ROOT_Config.
Info in <TCling::__LoadModule>: Preloading module Rint.
Info in <TCling::__LoadModule>: Preloading module XMLParser.
Info in <TCling::__LoadModule>: Preloading module MultiProc.
Info in <TCling::__LoadModule>: Preloading module RooStats.
Info in <TCling::__LoadModule>: Preloading module RooFitRDataFrameHelpers.
Info in <TCling::__LoadModule>: Preloading module GeomBuilder.
Info in <TCling::__LoadModule>: Preloading module Proof.
Info in <TCling::__LoadModule>: Preloading module Ged.
Info in <TCling::__LoadModule>: Preloading module Recorder.
Info in <TCling::__LoadModule>: Preloading module GuiBld.
Info in <TCling::__LoadModule>: Preloading module ROOTWebDisplay.
Info in <TCling::__LoadModule>: Preloading module RooFitCore.
Info in <TCling::__LoadModule>: Preloading module Gui.
Info in <TCling::__LoadModule>: Preloading module SQLIO.
Info in <TCling::__LoadModule>: Preloading module XMLIO.
Info in <TCling::__LoadModule>: Preloading module ROOT_Rtypes.
Info in <TCling::__LoadModule>: Preloading module std.
Info in <TCling::__LoadModule>: Preloading module RIO.
Info in <TCling::__LoadModule>: Preloading module GX11TTF.
Info in <TCling::__LoadModule>: Preloading module ROOTTMVASofie.
Info in <TCling::__LoadModule>: Preloading module ProofPlayer.
Info in <TCling::__LoadModule>: Preloading module ASImage.
Info in <TCling::__LoadModule>: Preloading module libc.
Info in <TCling::__LoadModule>: Preloading module ROOT_Foundation_C.
Info in <TCling::__LoadModule>: Preloading module MathMore.
Info in <TCling::__LoadModule>: Preloading module RooFitHS3.
Info in <TCling::__LoadModule>: Preloading module Foam.
Info in <TCling::__LoadModule>: Preloading module SpectrumPainter.
Info in <TCling::__LoadModule>: Preloading module Minuit2.
Info in <TCling::__LoadModule>: Preloading module Core.
Info in <TCling::__LoadModule>: Preloading module MLP.
Info in <TCling::__LoadModule>: Preloading module ROOTDataFrame.
Info in <TCling::__LoadModule>: Preloading module GenVector.
Info in <TCling::__LoadModule>: Preloading module _Builtin_stddef_max_align_t.
Info in <TCling::__LoadModule>: Preloading module NetxNG.
Info in <TCling::__LoadModule>: Preloading module Minuit.
Info in <TCling::__LoadModule>: Preloading module RootAuth.
Info in <TCling::__LoadModule>: Preloading module TMVA.
Info in <TCling::__LoadModule>: Preloading module Graf3d.
Info in <TCling::__LoadModule>: Preloading module ASImageGui.
Info in <TCling::__LoadModule>: Preloading module Graf.
Info in <TCling::__LoadModule>: Preloading module GX11.
Info in <TCling::__LoadModule>: Preloading module Gdml.
Info in <TCling::__LoadModule>: Preloading module ROOTTreeViewer.
Info in <TCling::__LoadModule>: Preloading module ProofBench.
Info in <TCling::__LoadModule>: Preloading module Cling_Runtime_Extra.
Info in <TCling::__LoadModule>: Preloading module MathCore.
Info in <TCling::__LoadModule>: Preloading module Gviz3d.
Info in <TCling::__LoadModule>: Preloading module WebGui6.
Info in <TCling::__LoadModule>: Preloading module ROOTTPython.
Info in <TCling::__LoadModule>: Preloading module TreePlayer.
Info in <TCling::__LoadModule>: Preloading module RooFitJSONInterface.
Info in <TCling::__LoadModule>: Preloading module Smatrix.
Info in <TCling::__LoadModule>: Preloading module SessionViewer.
Info in <TCling::__LoadModule>: Preloading module Thread.
Info in <TCling::__LoadModule>: Preloading module Core.
Info in <TCling::__LoadModule>: Preloading module Rint.
Info in <TCling::__LoadModule>: Preloading module Thread.
Info in <TCling::__LoadModule>: Preloading module RIO.
Info in <TCling::__LoadModule>: Preloading module Net.
Info in <TCling::__LoadModule>: Preloading module MultiProc.
Info in <TCling::__LoadModule>: Preloading module Imt.
Info in <TCling::__LoadModule>: Preloading module MathCore.

while the full list of on demand modules is:

'Cling_Runtime'
'Cling_Runtime_Extra'
'Core'
'Foam'
'Genetic'
'GenVector'
'Geom'
'Gpad'
'Hist'
'HistFactory'
'Imt'
'libc'
'MathCore'
'MathMore'
'Minuit'
'Minuit2'
'MultiProc'
'Net'
'NetxNG'
'PyMVA'
'Rint'
'RIO'
'RooFitCore'
'RooFitHS3'
'RooFitJSONInterface'
'RooStats'
'ROOTDataFrame'
'ROOT_Foundation_Stage1_NoRTTI'
'ROOT_Rtypes'
'ROOTTMVASofie'
'ROOTTMVASofieParser'
'ROOTTreeViewer'
'ROOTVecOps'
'ROOTWebDisplay'
'Smatrix'
'std'
'Thread'
'TMVA'
'Tree'
'TreePlayer'
'XMLIO'

the file used can be found in /afs/cern.ch/user/e/eulisse/public/AO2D.root.

ktf commented 1 year ago

Ok, indeed SOFIE gets loaded because of the Experimental namespace, I guess. I do not see it with your patch. I will try to do a full build on CVMFS with it applied.

ktf commented 1 year ago

Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member?

vgvassilev commented 1 year ago

Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member?

Yes! Very likely.

vgvassilev commented 1 year ago

@ktf, I was just on a call with @hahnjo and we seem that we can re-introduce a memory optimization that we lost over the last llvm upgrade. That is a version of https://github.com/root-project/root/pull/12062/commits/5280ade7529803007742b004a3afa1f72360d6b3 .

ktf commented 1 year ago

Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member?

Yes! Very likely.

That's actually it. If I remove the Experimental namespace and simply use Internal::TBulkBranchRead, it does not load the other Experimental bits either.

ktf commented 1 year ago

and I just realised that the "big modules" are not necessarily big, they just happen to trigger the resizing of some pool. https://github.com/root-project/root/commit/5280ade7529803007742b004a3afa1f72360d6b3 seems to give 4 / 5 MB: nothing dramatic.

vgvassilev commented 1 year ago

Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member?

Yes! Very likely.

That's actually it. If I remove the Experimental namespace and simply use Internal::TBulkBranchRead, it does not load the other Experimental bits either.

However, that in practice shows a more generic design problem in the clang modules system. That is, then we make a lookup of a namespace identifier clang (rightfully) tries to collect all namespace partitions from all reachable modules. This has to do with things like overload resolution. Due to the autoloading system, ROOT essentially considers all modules reachable and thus loads them. The only reliable way to fix this is to make the clang module loads a no-op which is a bit of a challenge…

vgvassilev commented 1 year ago

and I just realised that the "big modules" are not necessarily big, they just happen to trigger the resizing of some pool. 5280ade seems to give 4 / 5 MB: nothing dramatic.

That pool resizing has to do with source locations, right? Can you show the stack trace?

ktf commented 1 year ago

That pool resizing has to do with source locations, right? Can you show the stack trace?

Apparently so (assuming SLoc is source locations):

  12.22 MB     start        
  12.22 MB     main /Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/main/src/rmain.cxx:84   
  12.22 MB     TRint::TRint(char const*, int*, char**, void*, int, bool, bool)  /Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/rint/src/TRint.cxx:175 
  12.22 MB     TApplication::TApplication(char const*, int*, char**, void*, int)    /Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/base/src/TApplication.cxx:153  
  12.22 MB     ROOT::Internal::GetROOT2()   /Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/base/src/TROOT.cxx:385 
  12.22 MB     TROOT::InitInterpreter() /Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/base/src/TROOT.cxx:2072    
  12.22 MB     CreateInterpreter    /Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/metacling/src/TCling.cxx:614   
  12.22 MB     TCling::TCling(char const*, char const*, char const* const*, void*)  /Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/interpreter/cling/include/cling/Interpreter/Interpreter.h:0 
  12.22 MB     cling::Interpreter::Interpreter(int, char const* const*, char const*, std::__1::vector<std::__1::shared_ptr<clang::ModuleFileExtension>, std::__1::allocator<std::__1::shared_ptr<clang::ModuleFileExtension>>> const&, void*, bool, cling::Interpreter const*)      
  12.22 MB     cling::Interpreter::Initialize(bool, bool, llvm::SmallVectorImpl<llvm::StringRef>&)      
  12.21 MB     cling::Interpreter::declare(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, cling::Transaction**)       
  12.21 MB     cling::IncrementalParser::Compile(llvm::StringRef, cling::CompilationOptions const&)     
  12.21 MB     cling::IncrementalParser::ParseInternal(llvm::StringRef)     
  12.21 MB     clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, bool)       
  12.21 MB     clang::Preprocessor::Lex(clang::Token&)      
  12.21 MB     clang::Lexer::LexTokenInternal(clang::Token&, bool)      
  12.21 MB     clang::Preprocessor::HandleDirective(clang::Token&)      
  12.21 MB     clang::Preprocessor::HandleIncludeDirective(clang::SourceLocation, clang::Token&, clang::DirectoryLookup const*, clang::FileEntry const*)        
  12.21 MB     clang::Preprocessor::HandleHeaderIncludeOrImport(clang::SourceLocation, clang::Token&, clang::Token&, clang::SourceLocation, clang::DirectoryLookup const*, clang::FileEntry const*)     
  12.21 MB     clang::CompilerInstance::loadModule(clang::SourceLocation, llvm::ArrayRef<std::__1::pair<clang::IdentifierInfo*, clang::SourceLocation>>, clang::Module::NameVisibilityKind, bool)       
  12.21 MB     clang::CompilerInstance::findOrCompileModuleAndReadAST(llvm::StringRef, clang::SourceLocation, clang::SourceLocation, bool)      
  12.21 MB     clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, llvm::SmallVectorImpl<clang::ASTReader::ImportedSubmodule>*)       
  12.15 MB     clang::ASTReader::ReadASTBlock(clang::serialization::ModuleFile&, unsigned int)      
   8.20 MB     clang::SourceManager::AllocateLoadedSLocEntries(unsigned int, unsigned int)      
   8.20 MB     llvm::SmallVectorBase<unsigned int>::grow_pod(void*, unsigned long, unsigned long)       
   8.20 MB     realloc

However I also see other bumps. For example I see 14MB which seem to be allocated for the QualType, whatever that means.

vgvassilev commented 1 year ago

8.20 MB clang::SourceManager::AllocateLoadedSLocEntries(unsigned int, unsigned int)

Yes, that's what I was talking about. That's the relevant line of code that we need to "fix" in clang. And this operation we try hard to avoid by delaying the module loads until possible with the global module index (modules.idx) file.

However, the QualType thing is new to me. Do you think we go through here? One can see the module file contents by running in ROOT:

make llvm-bcanalyzer
llvm-bcanalyzer $ROOTSYS/lib/some_file.pcm

ktf commented 1 year ago

If I understand correctly from:

for x in ls lib/*.pcm; do cat $x | ./llvm-bcanalyzer | grep SPECIAL_TYPES; done | wc -l

every single PCM we have has 137 SPECIAL_TYPES. See here dump of RooStats.pcm, which seems to be responsible for the bump.

I am now trying to do a debug build to check if we go through the line you suggested.

ktf commented 1 year ago

So I can also confirm we go through ASRReader.cpp:3300 and that indeed ASTReader.cpp:3179 jumps over 1.5M entries over the course of execution. In particular some pcms like Genetic.pcm, RooStats.pcm, ROOTDataFrame, MathCore seem to be responsible for large jumps (>100K entries).

vgvassilev commented 1 year ago

@ktf, this is very useful information. I am adding @ChuanqiXu9 and @zygoloid who work in the area of modules and that information might be also interesting to them.

Do we copy in each pcm the special types from all of its dependencies? I don’t see how we can get to 1.5M entries.

I am wondering if we can do anything about the special types in general…

ktf commented 1 year ago

Glad to help. Notice that the vector which is expanded is not SpecialTypes but TypesLoaded at ASTReader.cpp:3179.

ktf commented 1 year ago

On a side note, could you also elaborate on what are the special types?

ChuanqiXu9 commented 1 year ago

I just took a quick look on this. The problems stated now is:

Clang will load modules eagerly.
The name lookup is not so cheap.
We will load a lot of source location informations.

Do I understand right? Then we can try to add these things separately to llvm issues pages. Then we can try to work on then separately.

vgvassilev commented 1 year ago

I just took a quick look on this. The problems stated now is:
1. Clang will load modules eagerly.

That's what our system does. We want to preload all modules as if we have something like import *.

2. The name lookup is not so cheap.

The name lookup is cheap per se. However, the module load is not a no-op due to the number of things we still eagerly deserialize upon load.

3. We will load a lot of source location informations.

Yes, that is right. The problem comes just from a single line in clang. However fixing this would be very hard and might need to rework the source location management. That will be probably a good thing for clang since it is a bit inefficient how. We might be able to use some sort of binary tree to model them.

In fact the other issue might be easier to fix and less intrusive. That is, the way we load types eagerly here.

The third major issue which is solved for us but not yet upstream is https://reviews.llvm.org/D41416 Some polishing is done by @hahnjo in root-project/root#13139.

Do I understand right? Then we can try to add these things separately to llvm issues pages. Then we can try to work on then separately.

I am not sure if issues will help but we can add them if you prefer.

ChuanqiXu9 commented 1 year ago

2. The name lookup is not so cheap.
The name lookup is cheap per se. However, the module load is not a no-op due to the number of things we still eagerly deserialize upon load.

Maybe due to we have different workloads. In our work loads there are many redundant/duplicated named declarations. So while a single name lookup is cheap, it is not cheap to lookup a name with many declarations. But it sounds good that you didn't meet the problems.

3. We will load a lot of source location informations.
Yes, that is right. The problem comes just from a single line in clang. However fixing this would be very hard and might need to rework the source location management. That will be probably a good thing for clang since it is a bit inefficient how. We might be able to use some sort of binary tree to model them.

I heard you talked this before. It looks interesting. And similar to above, I feel it will be more interesting to merge/re-use the source locations. Did you meet the problem of duplicated declarations?

For example, in C++20 modules, it is common:

// A.cppm
module;
#include <iostream>
export module A;
...

// B.cppm
module;
#include <iostream>
export module A;
...

// C.cpp
import A;
import B;

Did you have similar cases?

In fact the other issue might be easier to fix and less intrusive. That is, the way we load types eagerly here.

I am not sure how feasible it is to load types eagerly. Since many types are constructed in ASTContext locally (e.g., some types for CXXRecords). And I am sure how can we load types before deserialize the decls.

The third major issue which is solved for us but not yet upstream is https://reviews.llvm.org/D41416 Some polishing is done by @hahnjo in #13139.

Looking forward to this!

Do I understand right? Then we can try to add these things separately to llvm issues pages. Then we can try to work on then separately.

I am not sure if issues will help but we can add them if you prefer.

I just feel it may easier to track by putting things together : ) It doesn't matter nor hurry.

vgvassilev commented 1 year ago

2. The name lookup is not so cheap.
The name lookup is cheap per se. However, the module load is not a no-op due to the number of things we still eagerly deserialize upon load.
Maybe due to we have different workloads. In our work loads there are many redundant/duplicated named declarations. So while a single name lookup is cheap, it is not cheap to lookup a name with many declarations. But it sounds good that you didn't meet the problems.
3. We will load a lot of source location informations.
Yes, that is right. The problem comes just from a single line in clang. However fixing this would be very hard and might need to rework the source location management. That will be probably a good thing for clang since it is a bit inefficient how. We might be able to use some sort of binary tree to model them.
I heard you talked this before. It looks interesting. And similar to above, I feel it will be more interesting to merge/re-use the source locations. Did you meet the problem of duplicated declarations?

For example, in C++20 modules, it is common:
// A.cppm
module;
#include <iostream>
export module A;
...

// B.cppm
module;
#include <iostream>
export module A;
...

// C.cpp
import A;
import B;
Did you have similar cases?

If the question is if we have multiple duplication due to textual inclusion of header files we have almost none. We have worked very very hard to modularize things bottom up, including libc, libstdc++, libxml, boost, etc.

In fact the other issue might be easier to fix and less intrusive. That is, the way we load types eagerly here.

I am not sure how feasible it is to load types eagerly. Since many types are constructed in ASTContext locally (e.g., some types for CXXRecords). And I am sure how can we load types before deserialize the decls.

It feels feasible to me, since there is no need of any of these types unless they were needed. We could possibly store their hash and offset and deserialize them on demand. Pretty much the same as what @zygoloid suggested long ago with the template specializations.

I just feel it may easier to track by putting things together : ) It doesn't matter nor hurry.

I can open these issue reports. It would take me some time to put sufficient information to make them actionable though.

ChuanqiXu9 commented 1 year ago

If the question is if we have multiple duplication due to textual inclusion of header files we have almost none. We have worked very very hard to modularize things bottom up, including libc, libstdc++, libxml, boost, etc.

Yes, this is our problem : (

It feels feasible to me, since there is no need of any of these types unless they were needed. We could possibly store their hash and offset and deserialize them on demand.

I am still confused about the intention. How could you know the necessary types to be loaded ahead of time?

I can open these issue reports. It would take me some time to put sufficient information to make them actionable though.

Thank you for your time.

ktf commented 1 year ago

Thank you for the nice investigations. What are the next steps? I tried #13139 but that does not help us. I see a different memory profile, but the extra memory due to ROOT is stil around 70MB (even a bit more with that PR, AFAICT).

Besides actually solving the module loading problem, I was wondering if there is not something pathological in some of the pcms. As a said Genetic.pcm, RooStats.pcm, ROOTDataFrame, MathCore seem to be particularly responsible for the increase of the TypesLoaded vector. Maybe we can get some gain already via some headers cleanups?

vgvassilev commented 1 year ago

Thank you for the nice investigations. What are the next steps? I tried #13139 but that does not help us. I see a different memory profile, but the extra memory due to ROOT is stil around 70MB (even a bit more with that PR, AFAICT).

Is that with or without opening the file? My expectation is that root.exe -l -b -q will take a lot less rss.

Besides actually solving the module loading problem, I was wondering if there is not something pathological in some of the pcms. As a said Genetic.pcm, RooStats.pcm, ROOTDataFrame, MathCore seem to be particularly responsible for the increase of the TypesLoaded vector. Maybe we can get some gain already via some headers cleanups?

That's my feeling as well. Could you apply that patch and rebuild the dictionaries for the modules which have most types? Note that will probably print a lot of text on screen...

diff --git a/interpreter/llvm-project/clang/lib/Serialization/ASTWriter.cpp b/interpreter/llvm-project/clang/lib/Serialization/ASTWriter.cpp
index d0895a76df..21038e4eac 100644
--- a/interpreter/llvm-project/clang/lib/Serialization/ASTWriter.cpp
+++ b/interpreter/llvm-project/clang/lib/Serialization/ASTWriter.cpp
@@ -2866,6 +2866,7 @@ void ASTWriter::WriteType(QualType T) {

   // Record the offset for this type.
   unsigned Index = Idx.getIndex() - FirstTypeID;
+  T.dump();
   if (TypeOffsets.size() == Index)
     TypeOffsets.emplace_back(Offset);
   else if (TypeOffsets.size() < Index) {

Apologies for not being more helpful but currently I am in between some internal deadlines...

ktf commented 1 year ago

Is that with or without opening the file? My expectation is that root.exe -l -b -q will take a lot less rss.

Yes, that's opening the file. I will try and let you know.

That's my feeling as well. Could you apply that patch and rebuild the dictionaries for the modules which have most types?

Doing that now.

Apologies for not being more helpful but currently I am in between some internal deadlines...

No worries and thank you for all the help.

hahnjo commented 1 year ago

My understanding is that one possible optimization is related to auto-loading of namespaces. At the moment, we load all modules, which can be a problem for "widespread" namespaces such as ROOT::Experimental. However, we need this for ROOT::RDF and so on which live in libraries / modules that are not eagerly loaded.

FWIW yesterday I thought about this, and it seems to me that once we know that something is a namespace that we've already seen somewhere, we don't actually need to do anything - contained classes will be loaded on demand from the modules that actually contain them. My idea was to add a check to TClingCallbacks::LookupObject(const DeclContext* DC, DeclarationName Name) and if Name already resolves to a known namespace, just return false. That's a bit tricky though because if we do a name lookup from there, it will recursively call Cling again. We might be able to cut this loop with a boolean flag that we set just before doing the lookup, but I stopped my experiments here. If somebody wants to try it further, be my guest :sweat_smile:

ktf commented 1 year ago

https://cernbox.cern.ch/s/J7w2JH0D0YXdcyA for the 1.6GB log...

vgvassilev commented 1 year ago

My understanding is that one possible optimization is related to auto-loading of namespaces. At the moment, we load all modules, which can be a problem for "widespread" namespaces such as ROOT::Experimental. However, we need this for ROOT::RDF and so on which live in libraries / modules that are not eagerly loaded.

FWIW yesterday I thought about this, and it seems to me that once we know that something is a namespace that we've already seen somewhere, we don't actually need to do anything - contained classes will be loaded on demand from the modules that actually contain them. My idea was to add a check to TClingCallbacks::LookupObject(const DeclContext* DC, DeclarationName Name) and if Name already resolves to a known namespace, just return false. That's a bit tricky though because if we do a name lookup from there, it will recursively call Cling again. We might be able to cut this loop with a boolean flag that we set just before doing the lookup, but I stopped my experiments here. If somebody wants to try it further, be my guest 😅

Do you mean this https://github.com/root-project/root/pull/10969 We tried something along these lines but see the verdict...

hahnjo commented 1 year ago

Do you mean this #10969 We tried something along these lines but see the verdict...

Yes, we need the namespaces in the GMI for things like ROOT::RDF. But if the AST already knows about a namespace, I'm not sure if we actually need to do anything when presented with that DeclarationName...

vgvassilev commented 1 year ago

https://cernbox.cern.ch/s/J7w2JH0D0YXdcyA for the 1.6GB log...

It seems we are hit by a code with a lot of templates code. Very roughly the *SpecializationType are a lot and they are quite big in terms of memory representation:

vvassilev@vv-nuc /tmp $ cat log | grep Type | wc -l 10165751 vvassilev@vv-nuc /tmp $ cat log | grep SpecializationType | wc -l 566599

I am not sure what we could do. So far I cannot find a recurring pattern, except for a bunch of instantiations that might be part of libstdc++ (eg std.pcm), however, I cannot confirm that's an issue since the printing does not include where a subtype comes from. I guess I am stuck and I might need to disturb @zygoloid again...

ktf commented 1 year ago

Ok, let me know if you have other tests which you would like me to do.

One more question for my understanding. Are those big vectors actually needed at all when reading a file? If not why aren't they swapped out to disk either implicitly via some madvise or by simply caching to disk the results? The latter might even give some startup advantage, assuming they are offsets to the pcms and they do not change (not the basic root ones, at least).

vgvassilev commented 1 year ago

Ok, let me know if you have other tests which you would like me to do.

I was a bit surprised to learn that:

Thank you for the nice investigations. What are the next steps? I tried #13139 but that does not help us. I see a different memory profile, but the extra memory due to ROOT is stil around 70MB (even a bit more with that PR, AFAICT).

Is that with or without opening the file? My expectation is that root.exe -l -b -q will take a lot less rss.

ping.

One more question for my understanding. Are those big vectors actually needed at all when reading a file? If not why aren't they swapped out to disk either implicitly via some madvise or by simply caching to disk the results? The latter might even give some startup advantage, assuming they are offsets to the pcms and they do not change (not the basic root ones, at least).

They are not needed upon module loads. We tried to implement some sparse vectors to mitigate this issue. However, I feel like we did not have the persistence to make it work. That is, we can implement a lazy vector structure that behaves just like vector but the reserve operation does not allocate. Instead, we allocate slabs around the access pattern (we override the subscript operator, etc). This would help us avoid eager allocations and hopefully, depending on whether there is a clear access pattern to optimize things. Our implementation seems to be still here.

Another easier thing to try is reviving https://reviews.llvm.org/D89749

ktf commented 1 year ago

Is that with or without opening the file? My expectation is that root.exe -l -b -q will take a lot less rss. ping.

With the file.

ktf commented 1 year ago

to be clear, as I said above, the profile changed, however the total sum for cling initialisation is still at 70MB.

vgvassilev commented 1 year ago

Is that with or without opening the file? My expectation is that root.exe -l -b -q will take a lot less rss. ping.

With the file.

I meant to ping on the stats without a file.

to be clear, as I said above, the profile changed, however the total sum for cling initialisation is still at 70MB.

Can you share the new profile?

ktf commented 1 year ago

So I finally managed to get back to this and try a few things and as a result in my tests I managed to go from 82 MB of overhead to 62 MB. See https://github.com/root-project/root/pull/13641 for the actual code.

While there is still a few things to cleanup, I think what I have is fairly general and quite self contained change, so I would appreciate feedback and if people like it, instructions on how to proceed to make sure this ends up upstream.

The solution is based on a newly introduced PagedVector<T> which allows registering callbacks which get invoked whenever the associated range is accessed the first time. At such point, the range is actually allocated and each element of the range is passed to the callback.

Initially I developed a mechanism to be general enough to do complex initialisations of the elements, however I later realised this is not actually needed. In order to optimise the issue with the large vectors TypesLoaded and DeclsLoaded one only needs to have something which does the default construction of their elements for ranges which are small enough to avoid to large upfront allocations. The code which is currently there in root / clang is already smart enough to lazily initialise only the elements which are actually needed, so the whole issue is to find a good tradeoff between number of ranges and sparse enough ones.

A few caveats:

PagedVector is a bad name, I agree. Suggestions are welcome.
At the moment the implementation allows generic callbacks for ranges, however as I said we only need default initialisation. Removing such flexibility should buy another 1 MB from my test.
In case DeclID* and QualType can default to all 0, one could probably achieve the same result in a more elegant way using calloc / realloc.
There are probably other vectors which can be optimised in the same way, in particular IdentifiersLoaded as already mentioned by @vgvassilev.

Comments?

vgvassilev commented 1 year ago

Comments?

This looks awesome!

A few caveats:

* `PagedVector` is a bad name, I agree. Suggestions are welcome.

We did not come up with a better name. Ours was SparseVector. Just to confirm, we do not pre-allocate the entire vector but we deterministically allocate the slabs. That is, if I hardcoded an offset 101 in my code and the pre-allocation allocated just a slab [0-100), then when I do PagedVector[101] it would put what I need where I need it, right?

* At the moment the implementation allows generic callbacks for ranges, however as I said we only need default initialisation. Removing such flexibility should buy another 1 MB from my test.

If it is not needed, let's keep it simple. It would be easier to go through the upstream llvm reviews.

* In case `DeclID*` and `QualType` can default to all 0, one could probably achieve the same result in a more elegant way using calloc / realloc.

Can we make the API of the PagedVector such that it becomes a drop-in replacement? I have not looked deeply in the PR but I see some work being done on the users side.

* There are probably other vectors which can be optimised in the same way, in particular IdentifiersLoaded as already mentioned by @vgvassilev.

The source locations offset would be a major source of improvement if this technique flies there.

PP: It seems llvm has some facility along these lines: https://llvm.org/doxygen/classllvm_1_1SparseBitVector.html

ktf commented 1 year ago

We did not come up with a better name. Ours was SparseVector. Just to confirm, we do not pre-allocate the entire vector but we deterministically allocate the slabs. That is, if I hardcoded an offset 101 in my code and the pre-allocation allocated just a slab [0-100), then when I do PagedVector[101] it would put what I need where I need it, right?

No, you would need first to register the range [100, 200) via addEmptyRange(). I could try to modify it for that behaviour, if you prefer. Maybe it would allow me to get rid of the logarithmic lookup, actually (basically, one less vector to worry about).

If it is not needed, let's keep it simple. It would be easier to go through the upstream llvm reviews.

Ok, I will clean it up a bit more.

Can we make the API of the PagedVector such that it becomes a drop-in replacement? I have not looked deeply in the PR but I see some work being done on the users side.

Yes, users would need to addEmptyRange where they now do resize(). I guess I could actually hide addEmptyRange inside the resize. Is it allowed to use std::pmr::vector in the llvm codebase?

The source locations offset would be a major source of improvement if this technique flies there.

I couldn't find the source locations vector anymore. Could you point it to me?

PP: It seems llvm has some facility along these lines: https://llvm.org/doxygen/classllvm_1_1SparseBitVector.html

I will have a look.

root-project / root