ROOT loads unneeded PCMs

ktf commented 1 year ago

Check duplicate issues.

[X] Checked for duplicates

Describe the bug

When doing a simple TFile::Open(), ROOT initialises the interpreter on the first TEnv::GetValue needed by the aforementioned method. This makes it load all the pcms which it can find, regardless of weather they are needed to read the file or not (in our case we only have branches with basic types, e.g. floats or ints or vector of thereof). This has the net result of adding 70MB of RSS for no particular reason to any process which uses ROOT. You can find more details in the attached instruments screenshot. Moreover, single stepping with the debugger and printing out the LoadModule argument, it seems that the pcms for Sofie, Proof and 3D graphics are particularly honours, all of which we do not need and we do not link.

What is the expected behaviour?

The best would be for ROOT to be smart enough to lazily load the pcms when they are actually needed, not upfront. A second best option would be to be able to start the interpreter in a clean slate state and load by hand only the pcms which are needed to read a file.

How to reproduce?

On macOS using Instruments one can do:

xcrun xctrace record --output root.trace --time-limit 30s --template Allocations --launch -- /Users/ktf/src/sw/osx_arm64/ROOT/alice-v6-28-04-local2/bin/root.exe -q -b http://alimonitor.cern.ch/train-workdir/testdata/LFN/alice/data/2015/LHC15o/000246392/pass2/PWGZZ/Run3_Conversion/320_20220701-0813/AOD/019/AO2D.root
open root.trace

and get the report in question. Using the alice environment should not matter, but in principle you should be able to load it from CVMFS.

ROOT version

6.28.04, but it's a while we see it.

How did you install ROOT?

ALICE build environment

Which operating system are you using?

macOS, linux

Additional context

No response

vgvassilev commented 1 year ago

Yes, users would need to addEmptyRange where they now do resize(). I guess I could actually hide addEmptyRange inside the resize. Is it allowed to use std::pmr::vector in the llvm codebase?

I don't think so. I could not find uses of it in clang.

The source locations offset would be a major source of improvement if this technique flies there.

I couldn't find the source locations vector anymore. Could you point it to me?

From here you will need to jump to the clang::SourceManager object.

ktf commented 1 year ago

@vgvassilev I have updated the PR to include a similar patch for the SourceManager and at least the trivial test seems to work fine, including a nice 9MB reduction in memory allocations. I am now testing with the ALICE analysis framework.

I have also done a few of the cleanups, and it now only exposes an "expand" interface (basically resize without shrinking). AFAICT, it's not worth to implement the full "resize" functionality, given it's complicated and at least the places I fixed never shrink.

I think also the calloc approach might not be a good idea given realloc does not guarantee that the memory is zeroed and besides that page ranges might be a tad too big for vector of pointers and so on. I would put aside that idea, at least for now.

My current plan:

Verify with larger tests using ALICE analysis and asynchronous reconstruction.
Unify the two PagedVector implementations I currently have (one for ASTReader and one for SourceManager).
Use fixed size ranges, so that I can get rid of the ranges in my internal index and just keep where each page was allocated.

ktf commented 1 year ago

For the record, the new top offender is actually findOrCreateModule.

vgvassilev commented 1 year ago

For the record, the new top offender is actually findOrCreateModule.

That's probably the loading of module std.pcm...

ktf commented 1 year ago

For the record, I have done what I outlined above and updated the PR. I do see some drastic improvements for some of our workflows (250MB out of 1GB) where we have many processes initialising the interpreter. For others, where the usage of ROOT is limited to reading files in a single process, the improvement is not so obvious. Simply opening a file does show improvement as well (I am down to 49MB, when I also patch the FileInfo vector in the HeaderSearch).

The PR somehow seems to die with some old memory corruption which I am pretty sure I fixed and I cannot reproduce anymore. Is there any need to clean some cache?

Notice I have also submitted the patch to llvm itself and it passes their CI (https://github.com/llvm/llvm-project/pull/66430#issuecomment-1720164026)

hahnjo commented 1 year ago

The PR somehow seems to die with some old memory corruption which I am pretty sure I fixed and I cannot reproduce anymore. Is there any need to clean some cache?

I cannot reproduce it either, but a number of Linux builds agree that the assertion can still be triggered. PRs should be built from a clean state (in Jenkins), but even if not you are not changing the on-disk format so incremental builds must also work correctly. This needs investigation.

Notice I have also submitted the patch to llvm itself and it passes their CI (llvm/llvm-project#66430 (comment))

Very likely the pre-merge CI does not test a large scale module setup. You will have to test this yourself and prove that the changes don't regress the case when modules are used (almost) completely. This will be needed to convince the wider community that it is solving an actual problem, or at least not regressing the "normal" compiler use case.

ktf commented 1 year ago

I cannot reproduce it either, but a number of Linux builds agree that the assertion can still be triggered. PRs should be built from a clean state (in Jenkins), but even if not you are not changing the on-disk format so incremental builds must also work correctly. This needs investigation.

What I think is happening is that the broken PR generated some broken file, and the new PR is not able to survive such broken file. IMHO, this is a problem with the build setup (not being able to notice broken files and delete them) more than with my PR (which of course has no dependency on some intermediate state). Shall I just open a new PR and we see how that goes?

ktf commented 1 year ago

Very likely the pre-merge CI does not test a large scale module setup. You will have to test this yourself and prove that the changes don't regress the case when modules are used (almost) completely. This will be needed to convince the wider community that it is solving an actual problem, or at least not regressing the "normal" compiler use case.

I'll do what it takes, thank you for pointing that out.

hahnjo commented 1 year ago

I cannot reproduce it either, but a number of Linux builds agree that the assertion can still be triggered. PRs should be built from a clean state (in Jenkins), but even if not you are not changing the on-disk format so incremental builds must also work correctly. This needs investigation.

What I think is happening is that the broken PR generated some broken file, and the new PR is not able to survive such broken file. IMHO, this is a problem with the build setup (not being able to notice broken files and delete them) more than with my PR (which of course has no dependency on some intermediate state). Shall I just open a new PR and we see how that goes?

I'm telling that this is not the case - there are no left-behind files (in Jenkins) that influence future PR runs. We know this from changing the on-disk format / upgrading LLVM / etc. There is a problem and it needs to be investigated.

dpiparo commented 9 months ago

I think quite some improvements are there in 6.30.04 (released) and master. @ktf is it difficult for you to check if the issue is fixed, and, if yes, close the item? If not I can start from your repro and proceed.

ktf commented 9 months ago

Indeed the patch I provided cuts half of the overhead, however there is still 40 MB per process I cannot really justify at the moment. In our case that translates to 4 GB of RSS. While I appreciate that being completely lazy in the PCM loading is probably complicated, maybe some tactical solution could be employed (like it was done for the PagedVector). For example I am not convinced ReadSLocEntry needs to keep around the buffer. There is moreover a few more places where the PagedVector could be used effectively, I will try to propose a separate PR for that.

The reproducer is as easy as opening a ROOT file, see the main issue. A new profile is:

vgvassilev commented 9 months ago

I am wondering if we are not pushing repetitive pragma diagnostic mappings. We have tons for pragrams coming from the Linkdef files, not sure if they are needed in the module file itself...

vgvassilev commented 9 months ago

In theory, if the software stack (not just ROOT) uses modules the RSS should also go down. Maybe that's worth trying on your end...

ktf commented 9 months ago

I am wondering if we are not pushing repetitive pragma diagnostic mappings. We have tons for pragrams coming from the Linkdef files, not sure if they are needed in the module file itself...

any way i can confirm this?

vgvassilev commented 9 months ago

I am wondering if we are not pushing repetitive pragma diagnostic mappings. We have tons for pragrams coming from the Linkdef files, not sure if they are needed in the module file itself...

any way i can confirm this?

Not directly. Either we can run rootcling with -v4 and count the mappings or we can insert some printing statements on ::WritePragmaDiagnosticMapping... I suspect making the pragma diagnostic mappings lazy would be art...

dpiparo commented 2 months ago

Dear @ktf, I am closing the issue assuming that this is not an issue any more for ALICE when adopting ROOT 6.32.X. Please do not hesitate to re-open in case this is not the case, providing the context necessary for us to fix the problem.

ktf commented 2 months ago

It's not an issue for adoption. The underlying issue is still there, though. ROOT still loads a bunch of unneeded PCM when simply opening a file, it's just the cost is half what it was before and the reproducer is the same as above.

root-project / root