hcho3 / xgboost-fast-hist-perf-lab

Deeper look into performance of tree_method='hist' for multi-core CPUs
5 stars 2 forks source link

The hist method of XGBoost scales poorly on multi-core CPUs: a demo script

Currently, the hist tree-growing algorithm (tree_method=hist) of XGBoost scales poorly on multi-core CPUs: for some datasets, performance deteriorates as the number of threads is increased. This issue was discovered by @Laurae2's Gradient Boosting Benchmark.

To make things easier for contributors, I went ahead and isolated the performance bottleneck. A vast majority of time (> 95 %) is spent in a stage known as gradient histogram construction. This repository isolates this stage so that it is easy to fix and improve.

How to compile and run

  1. Compile the script by running CMake:

    mkdir build
    cd build
    cmake ..
    make
  2. Download record.tar.bz2 in the same directory.

  3. Extract record.tar.bz2 by running tar xvf record.tar.bz2.

  4. Run the script:

    # Usage: ./perflab record/ [number of threads]
    ./perflab record/ 36

Running with different number of threads should produce the following trend of performance: Performance scaling on C5.9xlarge

What this script does

The script reads from record.tar.bz2, which was processed from the Bosch dataset. Its job is to compute histograms for gradient pairs, where each bin of histogram is a partial sum.

Some background:

Setting build types