The `hist` method of XGBoost scales poorly on multi-core CPUs: a demo script

Currently, the hist tree-growing algorithm (tree_method=hist) of XGBoost scales poorly on multi-core CPUs: for some datasets, performance deteriorates as the number of threads is increased. This issue was discovered by @Laurae2's Gradient Boosting Benchmark.

To make things easier for contributors, I went ahead and isolated the performance bottleneck. A vast majority of time (> 95 %) is spent in a stage known as gradient histogram construction. This repository isolates this stage so that it is easy to fix and improve.

How to compile and run

Compile the script by running CMake:
```
mkdir build
cd build
cmake ..
make
```
Download record.tar.bz2 in the same directory.
Extract record.tar.bz2 by running tar xvf record.tar.bz2.

Run the script:

# Usage: ./perflab record/ [number of threads]
./perflab record/ 36

Running with different number of threads should produce the following trend of performance: Performance scaling on C5.9xlarge

What this script does

The script reads from record.tar.bz2, which was processed from the Bosch dataset. Its job is to compute histograms for gradient pairs, where each bin of histogram is a partial sum.

Some background:

A gradient for a given instance (X_i, y_i) is a pair of double values that quantify the distance between the true label y_i and predicted label yhat_i.
There are as many gradient pairs as there are instances in a training dataset.
In order to find optimal splits for decision trees, we compute a histogram of gradients. Each bin of the histogram stands for a range of feature values. The value of the bin is given by the sum of gradients corresponding to the data points lying inside the range.
In each boosting iteration, we have to compute multiple histograms, each histogram corresponding to a set of instances.

Setting build types

By default, 'Release' build type will be used, with flags -O3 -DNDEBUG.
For perfiling, you may want to add debug symbols by choosing 'RelWithDebInfo' build type instead:
```
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
```
This build type uses the following flags: -O2 -g -DNDEBUG.
For full control over the compilation flags, specify CMAKE_CXX_FLAGS_RELEASE:
```
cmake -DCMAKE_CXX_FLAGS_RELEASE="-O3 -g -DNDEBUG -march=native" ..
```
This give you full control over the optimization flags. Here, we are compiling with -O3 -g -DNDEBUG -march=native flags.

You can check whether they are applied using make VERBOSE=1 and looking at the C++ compilation lines for the existence of the flags you used:
```
/usr/bin/c++   -I/home/ubuntu/xgboost-fast-hist-perf-lab/include  -O3 -g -DNDEBUG -march=native
  -fopenmp -std=gnu++11 -o CMakeFiles/perflab.dir/src/main.cc.o
  -c /home/ubuntu/xgboost-fast-hist-perf-lab/src/main.cc
```

hcho3 / xgboost-fast-hist-perf-lab

readme

The `hist` method of XGBoost scales poorly on multi-core CPUs: a demo script

How to compile and run

What this script does

Setting build types

hcho3 / xgboost-fast-hist-perf-lab

readme

The hist method of XGBoost scales poorly on multi-core CPUs: a demo script

How to compile and run

What this script does

Setting build types

The `hist` method of XGBoost scales poorly on multi-core CPUs: a demo script