root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.68k stars 1.27k forks source link

Build performance does not scale to many cores/threads #6432

Open krasznaa opened 4 years ago

krasznaa commented 4 years ago

Explain what you would like to see improved

I know that this is a very first world problem, but it has been bugging me since a while. The build of ROOT using its CMake setup is not scaling well to many core systems at all. :frowning:

This is a snapshot of how ROOT 6.20/08 used my system's resources during its build:

root-6 20 08-build

The build starts "pretty much" at the left hand side of the timeline, and lasts until "pretty much" the right hand side of it.

As you can see, the build starts out very well. Building LLVM scales perfectly to 64 threads. And I believe it would scale well to even beyond that. But once the LLVM build is done, many bottlenecks show up. First there is a big bottleneck with building libCling and rootcling, but after that the build of libRIO is also taking a surprising amount of time. And the build is stuck waiting for all of these.

Towards the end things improve a bit once more, as many libraries / source files can build in parallel once more. But even then, very rarely does the build manage to make use of all of the available cores.

Optional: share how it could be improved

From a quick glance it seems that ROOT's CMake configuration sets up way too many unnecessary dependencies between its build targets. Most of the issues seem to arise from how the dictionary generation is set up as far as I can see.

In ATLAS I use the following code to set up the generation of dictionary source files:

https://gitlab.cern.ch/atlas/atlasexternals/-/blob/master/Build/AtlasCMake/modules/AtlasDictionaryFunctions.cmake

And that provides a much better behaviour. Mainly because in ATLAS's setup dictionary generations do not need to wait for anything. Even if the library that a dictionary is being produced for depends on a number of upstream libraries, the dictionary for that library can be generated before all the upstream libraries would have finished building. In practice this actually means that the start of any ATLAS software build is dominated by running dictionary generation. As GNU Make and Ninja both prefer running those build steps first. (As they do not have any dependencies themselves.)

The reason I blame the dictionary generation code is that regular C(++) code building with Ninja scales very well to many cores. Even when one has many small libraries in a project, Ninja can start the build of object files before all of the libraries that they depend on would've finished building. (In ATLAS's offline software the very end of a build is taken up purely by library/executable linking steps.)

To Reproduce

Unfortunately you need a pretty powerful machine to do so... But once you do, just do something similar to what I did:

cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_STANDARD=17 \
   -Dall=ON -Dbuiltin_gsl=ON -Dbuiltin_freetype=ON -Dbuiltin_lzma=ON -Dbuiltin_veccore=ON \
   -DXROOTD_ROOT_DIR=~/software/xrootd/4.12.2/x86_64-ubuntu2004-gcc9-opt \
   -DTBB_ROOT_DIR=~/software/oneTBB/2020.2/x86_64-ubuntu2004-gcc9-opt \
   -DCMAKE_INSTALL_PREFIX=~/software/root/6.20.08/x86_64-ubuntu2004-gcc9-opt ../root-6.20.08/
ninja

Setup

As mentioned earlier, I used ROOT 6.20/08 for this particular test. But the behaviour has been like this since forever. I performed the build on Ubuntu 20.04 with GCC 9, but that should make little difference to the overall behaviour.

Additional context

N/A

Axel-Naumann commented 4 years ago

I'm aware of this. This is mostly caused by dictionary dependencies. I have a prototype that fixes this; I need to invest some dev time to get this into PR quality. I.e. thanks for the the report, problem acknowledged!

vgvassilev commented 7 months ago

What can be done here is rather simple. The bottleneck last time I checked is rootcling (dictionary generation). There are two reasons:

Axel-Naumann commented 7 months ago

This is mostly caused by dictionary dependencies. I have a prototype that fixes this; I need to invest some dev time to get this into PR quality.

Moved on, giving up on this - here's what I ended up with last time I looked at it. I added some comments to explain what's happening.

(It also fixes the "changed a header included by a header that's passed to rootcling" transitional dependency issue...)

dpiparo commented 6 months ago

I personally do not think that the runtime of rootcling is the problem here, but rather the dependency tree. Of course making anything faster is good.