hcho3 / xgboost-fast-hist-perf-lab

Deeper look into performance of tree_method='hist' for multi-core CPUs
5 stars 2 forks source link

Possible optimizations for Intel CPUs #15

Open SmirnovEgorRu opened 5 years ago

SmirnovEgorRu commented 5 years ago

Hi, I'm from Intel® DAAL development team. We have our implementation of Gradient Boosting which highly optimized for CPUs. I have looked at the bench and applied practice used in DAAL optimizations to this benchmarks, result are below.

My code is available in my fork. I measured all examples on Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz (2 sockets, 14 cores per socket).

Removing of floating point numbers conversion:

We have found that XGBoost uses “float” for gradients pairs, but “double” for histograms. When XGBoost computes histograms it always converts floats to doubles in runtime, this operation is expensive. If replace “doubles” data type to “floats” in histograms performance are following:

# of cores 1 2 4 8 14 20 24 28
original (-O2 option), s 3.4 3.6 7.1 7.5 12.6 19.1 20.1 25.3
w/o conversion, s 2.5 2.4 3.3 5.3 7.0 10.3 11.5 12.7

Threading:

At the current implementation we see that XGBoost builds a histogram by blocks of size 8. In the data set "rowind-198-?.txt" - indexes of samples for some node in a tree. We can see that many of them contains only 10-500 samples. It is too small number to try to parallelize it, in this case overhead for creation of the tasks by OMP can be larger than useful work for histogram computation. We tried to limit block size by 512 rows and now we don’t see performance degradation with increase of thread number:

# of cores 1 2 4 8 14 20 24 28
original (-O2 option), s 3.4 3.6 7.1 7.5 12.6 19.1 20.1 25.3
w/o conversion, s 2.5 2.4 3.3 5.3 7.0 10.3 11.5 12.7
512-rows blocking, s 2.6 2.2 2.8 2.6 2.9 3.2 2.8 2.0

However, in this case we just utilize only one core of CPU for low levels of the tree and don’t archive optimal performance on multi-core CPU. How do we solve this issue in Intel® DAAL? - we have 2 levels of parallelism: inside each node (represented in XGBoost) and by nodes (not represented in XGBoost). On the first levels of the tree the main threading are inside nodes, for last levels - threading by nodes. Adding of parallelism by nodes in XGBoost – required for good thread scalability. We had done experiments with this block size in DAAL and 512 rows – was optimal number for data sets which we used.

Implicit usage of C-arrays + no unroll:

For build of histograms you use STL-containers and do unroll by 8 rows. However, usage of STL can bring some overhead, build of histograms is hotspot, so let’s use data() method for access to internal data of std::vector. Also, unroll like this in no efficient, due to copy of data from arrays to buffers – additional overhead and working with memory.

size_t kUnroll = 8;
size_t rid[kUnroll];
size_t ibegin[kUnroll];
size_t iend[kUnroll];
GradientPair stat[kUnroll];
for (int k = 0; k < kUnroll; ++k) {
  rid[k] = instance_set[i + k];
}
for (int k = 0; k < kUnroll; ++k) {
  ibegin[k] = gmat.row_ptr[rid[k]];
  iend[k] = gmat.row_ptr[rid[k] + 1];
}
for (int k = 0; k < kUnroll; ++k) {
  stat[k] = gpair[rid[k]];
}
for (int k = 0; k < kUnroll; ++k) {
  for (size_t j = ibegin[k]; j < iend[k]; ++j) {
    const uint32_t bin = gmat.index[j];
    builder->data_[off + bin].Add(stat[k]);
  }
}

If remove this unroll and use C-style arrays we can get significant performance improvement and make code simpler :) Modified code (build_hist.cc:171):

for(size_t i = iStart; i < iEnd; ++i)
{
    const size_t iColStart = row_ptr[rid[i]];
    const size_t iColEnd = row_ptr[rid[i]+1];
    const auto gh = gpair[rid[i]];

    for (size_t j = iColStart; j < iColEnd; ++j)
        data_local_hist[index[j]].Add(gh); // will be inlined by compiler with –O2 option
}

Let’s get indexes for 1 node: rowind-198-1 (contains 20.6k samples from 1m source samples – sparse variant) and measure performance for it.

# of cores 1 2 4 8 14 20 24 28
original (with unroll + STL, -O2 option), ms 11.5 6.5 3.9 3.2 3.5 5.3 6.0 6.6
w/o unroll and STL, ms 10.8 4.7 3.2 2.0 1.6 1.7 1.8 2.0

Enabling of modern instruction sets (AVX-512):

At AVX-512 "scatter" (does stores to memory with some stride) instruction has been introduced. It can be useful for histogram computation together with "gather" (does loads of memory with some stride) instruction (available from AVX2). Unfortunately, compiler can't put this instructions in our case and we need to do by intrinsics. Usage of "gather-add-scatter" template written by intrinsics provides some performance improvement over baseline code built by compiler. Code with usage of AVX-512 (can be built by Intel compiler at the moment, build it for gcc is possible, but need to spend some time for it) in the fork.

# of cores 1 2 4 8 14 20 24 28
original (with unroll + STL, -O2 option), ms 11.5 6.5 3.9 3.2 3.5 5.3 6.0 6.6
w/o unroll and STL, ms 10.8 4.7 3.2 2.0 1.6 1.7 1.8 2.0
w/o unroll and STL + AVX512, ms 10.2 4.7 3.2 2.0 1.5 1.6 1.7 1.8
thvasilo commented 5 years ago

We have found that XGBoost uses “float” for gradients pairs, but “double” for histograms.

I remember noticing that when going through the codebase as well, is there any reason for the added accuracy @hcho3? Using floats for histograms would also cut communication cost in half for distributed training. This is what I was using for example, Tencent Angel does the same, LighGBM's default is also float.

Laurae2 commented 5 years ago

@SmirnovEgorRu Here are some results for 2x Xeon Gold 6154. 2x 50 runs with gcc (your 512 row block) and icc (your 512 row block + extra optimizations you did). Rebooted server between each run, NUMA mode on. I got rid of the forced -O2 optimization flag from cmake.

Past 18 threads it's expected to be slower.

Threads gcc icc Imp.
1 01.674 01.469 1.1x
2 01.350 01.120 1.2x
4 01.095 00.975 1.1x
6 01.018 00.917 1.1x
9 00.969 00.870 1.1x
17 00.995 00.811 1.2x
18 01.008 00.830 1.2x
19 01.221 00.952 1.3x
27 01.465 01.181 1.2x
35 01.498 02.026 0.7x
36 01.560 02.041 0.8x
37 02.558 01.733 1.5x
54 03.032 03.362 0.9x
71 03.148 05.202 0.6x
72 03.378 05.513 0.6x

image

image

image

image

Full table:

Threads gcc icc Imp.
1 01.674 01.469 1.1x
2 01.350 01.120 1.2x
3 01.207 01.014 1.2x
4 01.095 00.975 1.1x
5 01.056 00.926 1.1x
6 01.018 00.917 1.1x
7 00.988 00.899 1.1x
8 00.979 00.891 1.1x
9 00.969 00.870 1.1x
10 00.967 00.886 1.1x
11 00.972 00.891 1.1x
12 00.966 00.885 1.1x
13 00.978 00.810 1.2x
14 00.984 00.850 1.2x
15 00.974 00.869 1.1x
16 00.999 00.885 1.1x
17 00.995 00.811 1.2x
18 01.008 00.830 1.2x
19 01.221 00.952 1.3x
20 01.362 01.010 1.3x
21 01.370 01.069 1.3x
22 01.525 01.109 1.4x
23 01.393 01.156 1.2x
24 01.425 01.170 1.2x
25 01.413 01.209 1.2x
26 01.494 01.187 1.3x
27 01.465 01.181 1.2x
28 01.589 01.182 1.3x
29 01.392 01.230 1.1x
30 01.683 01.242 1.4x
31 01.674 01.232 1.4x
32 01.533 01.264 1.2x
33 01.773 01.984 0.9x
34 01.675 02.209 0.8x
35 01.498 02.026 0.7x
36 01.560 02.041 0.8x
37 02.558 01.733 1.5x
38 02.762 01.710 1.6x
39 02.744 01.676 1.6x
40 02.641 01.804 1.5x
41 02.700 01.411 1.9x
42 02.891 01.393 2.1x
43 02.789 01.389 2.0x
44 02.750 01.389 2.0x
45 02.738 01.377 2.0x
46 02.749 01.339 2.1x
47 02.757 01.329 2.1x
48 02.738 01.327 2.1x
49 02.785 03.307 0.8x
50 02.826 03.283 0.9x
51 02.760 03.228 0.9x
52 02.787 03.278 0.9x
53 02.805 03.399 0.8x
54 03.032 03.362 0.9x
55 02.888 03.219 0.9x
56 02.856 03.153 0.9x
57 02.876 03.195 0.9x
58 02.868 03.135 0.9x
59 02.854 02.978 1.0x
60 02.892 02.925 1.0x
61 02.907 03.428 0.8x
62 02.981 03.294 0.9x
63 02.932 03.062 1.0x
64 02.928 02.637 1.1x
65 02.958 05.008 0.6x
66 03.036 05.040 0.6x
67 02.943 05.089 0.6x
68 02.968 04.961 0.6x
69 03.035 05.026 0.6x
70 03.049 05.029 0.6x
71 03.148 05.202 0.6x
72 03.378 05.513 0.6x
RAMitchell commented 5 years ago

Thanks everyone for your great benchmarks and insights :)

Removing of floating point numbers conversion

I have tried summation into floating point in my GPU version and the accuracy degrades significantly on larger data sets (1M rows). We need accuracy first and speed second so this will not be possible unless you have a scheme to reduce the loss of accuracy (e.g. tree reduction).

Threading

I don't think parallelism by node is possible if we are using the "lossguide" style algorithm that greedily opens one node at a time. This is a problem I also struggled with for the GPU version.