hcho3 / xgboost-fast-hist-perf-lab

Deeper look into performance of tree_method='hist' for multi-core CPUs
5 stars 2 forks source link

Some initial profiling output #2

Open thvasilo opened 5 years ago

thvasilo commented 5 years ago

I did some basic profiling with cachegrind, running for 6 threads like so: valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes ./perflab record/ 6

I've attached the call graph visualization from kCacheGrind and the file for anyone who might want to explore on their own. Since a lot of work is done through openMP it might be hard to see what's happening clearly without recompiling that to add debug information.

Note that due to the overhead introduced by valgrind the data loading numbers may be exaggerated. In this case it seems to take ~50% of the the time which is definitely not representative of a real run.

The output was

[14:54:24] /home/tvas/xgboost-fast-hist-perf-lab/src/main.cc:61: Data loaded in 1387.56 seconds
[15:33:42] /home/tvas/xgboost-fast-hist-perf-lab/src/main.cc:71: Gradient histograms computed in 2358 seconds

which shows the level of overhead, data loading takes ~18 seconds and gradient computation ~11 seconds normally.

callgraph-6-threads

callgrind.out.txt

@hcho3 Do you have an tips for profiling multi-threaded applications?

hcho3 commented 5 years ago

You may want to try using Intel VTune profiler.

thvasilo commented 5 years ago

I looked into that but seems like it's not free, so I'm stuck with open source tools.

hcho3 commented 5 years ago

A quick Google search gave me this page: https://stackoverflow.com/a/7190210.

Laurae2 commented 5 years ago

@hcho3 @thvasilo Intel VTune is free only if you are a student, educator, researcher, or a contributor to significant open source projects in computing. xgboost probably falls in the last section which means you should have access to Intel Parallel Studio Professional for Linux for free for 12 months (non Professional for Windows).

thvasilo commented 5 years ago

Thanks @Laurae2 I had originally thought VTune is not free even for students. This should help!

Laurae2 commented 5 years ago

@thvasilo You have to make sure you are downloading the Professional (or Cluster) version of Intel Parallel Studio. The Composer edition (the default one) is provided without Intel VTune.

I get Intel Parallel Studio Cluster Edition from the Educator package, students should get the same version.

Open source contributors get Professional instead of Cluster, when using Linux. In Windows, it is Composer. Intel does not make this obvious, it could be Cluster edition for everyone eligible...

Laurae2 commented 5 years ago

@thvasilo Small update: Intel Parallel Studio Cluster Edition is free for students for 12 months. You have all Intel software inside, including Intel VTune (currently, it's at 2018 Update 4 although there is a 2019 Initial Release).

If you use a recent OS with a recent kernel, make sure to use a newer hardware sampling driver otherwise Intel VTune will complain about an error: https://software.intel.com/sites/default/files/managed/ac/c1/sepdk_v5_575421.tar.gz

Ubuntu users must also install libelf: sudo apt-get install libelf-dev.

thvasilo commented 5 years ago

@Laurae2 thanks that's the one I ended up with. I'll try it out this week.

Laurae2 commented 5 years ago

@thvasilo There is a bug with Intel web interface which prohibits the direct download of Intel VTune 2019, which is required for the most recent kernels / OS (it has updated sampling drivers).

You can download Intel Parallel Studio XE 2019 Cluster Edition for Linux, and use "Customize" from the GUI (sudo ./install_GUI.sh) to choose Intel VTune 2019 specifically. For profiling, I recommend the following:

Laurae2 commented 5 years ago

@thvasilo Did you manage to find interesting stuff with Intel software? (they are not too hard to understand if you are used to perf / valgrind, especially because the Intel software got a GUI to do everything)

If you use Intel compilers which you will require for profiling with maximum data gathering, on a new terminal: source /opt/intel/compilers_and_libraries_2019.0.117/linux/bin/compilervars.sh intel64 with cmake mods: cmake -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc

I use the following flags to compile cmake with profiling: cmake -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-O3 -g -DNDEBUG -xHost -qopt-report=5 -qopt-zmm-usage=high -fopenmp -I/usr/include/x86_64-linux-gnu/c++/8" ..

Intel VTune Amplifier, on a new terminal: source /opt/intel/vtune_amplifier_2018.4.0.574913/amplxe-vars.sh then amplxe-gui

Intel Advisor, on a new terminal: source /opt/intel/advisor_2019.0.0.570901/advixe-vars.sh then advixe-gui

Intel Inspector, on a new terminal: source /opt/intel/inspector_2019.0.0.569751/inspxe-vars.sh then inspxe-gui

thvasilo commented 5 years ago

Hello @Laurae2, I've been working on some Phd stuff lately so I haven't had time to look into this. Thanks for all the advice though it will come in very handy when I get around to it.