Possible optimizations for Intel CPUs

SmirnovEgorRu commented 5 years ago

Hi, I'm from Intel® DAAL development team. We have our implementation of Gradient Boosting which highly optimized for CPUs. I have looked at the bench and applied practice used in DAAL optimizations to this benchmarks, result are below.

My code is available in my fork. I measured all examples on Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz (2 sockets, 14 cores per socket).

Removing of floating point numbers conversion:

We have found that XGBoost uses “float” for gradients pairs, but “double” for histograms. When XGBoost computes histograms it always converts floats to doubles in runtime, this operation is expensive. If replace “doubles” data type to “floats” in histograms performance are following:

# of cores	1	2	4	8	14	20	24	28
original (-O2 option), s	3.4	3.6	7.1	7.5	12.6	19.1	20.1	25.3
w/o conversion, s	2.5	2.4	3.3	5.3	7.0	10.3	11.5	12.7

Threading:

At the current implementation we see that XGBoost builds a histogram by blocks of size 8. In the data set "rowind-198-?.txt" - indexes of samples for some node in a tree. We can see that many of them contains only 10-500 samples. It is too small number to try to parallelize it, in this case overhead for creation of the tasks by OMP can be larger than useful work for histogram computation. We tried to limit block size by 512 rows and now we don’t see performance degradation with increase of thread number:

# of cores	1	2	4	8	14	20	24	28
original (-O2 option), s	3.4	3.6	7.1	7.5	12.6	19.1	20.1	25.3
w/o conversion, s	2.5	2.4	3.3	5.3	7.0	10.3	11.5	12.7
512-rows blocking, s	2.6	2.2	2.8	2.6	2.9	3.2	2.8	2.0

However, in this case we just utilize only one core of CPU for low levels of the tree and don’t archive optimal performance on multi-core CPU. How do we solve this issue in Intel® DAAL? - we have 2 levels of parallelism: inside each node (represented in XGBoost) and by nodes (not represented in XGBoost). On the first levels of the tree the main threading are inside nodes, for last levels - threading by nodes. Adding of parallelism by nodes in XGBoost – required for good thread scalability. We had done experiments with this block size in DAAL and 512 rows – was optimal number for data sets which we used.

Implicit usage of C-arrays + no unroll:

For build of histograms you use STL-containers and do unroll by 8 rows. However, usage of STL can bring some overhead, build of histograms is hotspot, so let’s use data() method for access to internal data of std::vector. Also, unroll like this in no efficient, due to copy of data from arrays to buffers – additional overhead and working with memory.

size_t kUnroll = 8;
size_t rid[kUnroll];
size_t ibegin[kUnroll];
size_t iend[kUnroll];
GradientPair stat[kUnroll];
for (int k = 0; k < kUnroll; ++k) {
  rid[k] = instance_set[i + k];
}
for (int k = 0; k < kUnroll; ++k) {
  ibegin[k] = gmat.row_ptr[rid[k]];
  iend[k] = gmat.row_ptr[rid[k] + 1];
}
for (int k = 0; k < kUnroll; ++k) {
  stat[k] = gpair[rid[k]];
}
for (int k = 0; k < kUnroll; ++k) {
  for (size_t j = ibegin[k]; j < iend[k]; ++j) {
    const uint32_t bin = gmat.index[j];
    builder->data_[off + bin].Add(stat[k]);
  }
}

If remove this unroll and use C-style arrays we can get significant performance improvement and make code simpler :) Modified code (build_hist.cc:171):

for(size_t i = iStart; i < iEnd; ++i)
{
    const size_t iColStart = row_ptr[rid[i]];
    const size_t iColEnd = row_ptr[rid[i]+1];
    const auto gh = gpair[rid[i]];

    for (size_t j = iColStart; j < iColEnd; ++j)
        data_local_hist[index[j]].Add(gh); // will be inlined by compiler with –O2 option
}

Let’s get indexes for 1 node: rowind-198-1 (contains 20.6k samples from 1m source samples – sparse variant) and measure performance for it.

# of cores	1	2	4	8	14	20	24	28
original (with unroll + STL, -O2 option), ms	11.5	6.5	3.9	3.2	3.5	5.3	6.0	6.6
w/o unroll and STL, ms	10.8	4.7	3.2	2.0	1.6	1.7	1.8	2.0

Enabling of modern instruction sets (AVX-512):

At AVX-512 "scatter" (does stores to memory with some stride) instruction has been introduced. It can be useful for histogram computation together with "gather" (does loads of memory with some stride) instruction (available from AVX2). Unfortunately, compiler can't put this instructions in our case and we need to do by intrinsics. Usage of "gather-add-scatter" template written by intrinsics provides some performance improvement over baseline code built by compiler. Code with usage of AVX-512 (can be built by Intel compiler at the moment, build it for gcc is possible, but need to spend some time for it) in the fork.

# of cores	1	2	4	8	14	20	24	28
original (with unroll + STL, -O2 option), ms	11.5	6.5	3.9	3.2	3.5	5.3	6.0	6.6
w/o unroll and STL, ms	10.8	4.7	3.2	2.0	1.6	1.7	1.8	2.0
w/o unroll and STL + AVX512, ms	10.2	4.7	3.2	2.0	1.5	1.6	1.7	1.8

thvasilo commented 5 years ago

We have found that XGBoost uses “float” for gradients pairs, but “double” for histograms.

I remember noticing that when going through the codebase as well, is there any reason for the added accuracy @hcho3? Using floats for histograms would also cut communication cost in half for distributed training. This is what I was using for example, Tencent Angel does the same, LighGBM's default is also float.

Laurae2 commented 5 years ago

@SmirnovEgorRu Here are some results for 2x Xeon Gold 6154. 2x 50 runs with gcc (your 512 row block) and icc (your 512 row block + extra optimizations you did). Rebooted server between each run, NUMA mode on. I got rid of the forced -O2 optimization flag from cmake.

gcc optimization flags cmake: -DCMAKE_CXX_FLAGS="-O3 -g -DNDEBUG -march=native"
icc optimization flags cmake: -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-O3 -g -DNDEBUG -xCORE-AVX512 -qopt-zmm-usage=high -fopenmp

Past 18 threads it's expected to be slower.

Threads	gcc	icc	Imp.
1	01.674	01.469	1.1x
2	01.350	01.120	1.2x
4	01.095	00.975	1.1x
6	01.018	00.917	1.1x
9	00.969	00.870	1.1x
17	00.995	00.811	1.2x
18	01.008	00.830	1.2x
19	01.221	00.952	1.3x
27	01.465	01.181	1.2x
35	01.498	02.026	0.7x
36	01.560	02.041	0.8x
37	02.558	01.733	1.5x
54	03.032	03.362	0.9x
71	03.148	05.202	0.6x
72	03.378	05.513	0.6x

Full table:

Threads	gcc	icc	Imp.
1	01.674	01.469	1.1x
2	01.350	01.120	1.2x
3	01.207	01.014	1.2x
4	01.095	00.975	1.1x
5	01.056	00.926	1.1x
6	01.018	00.917	1.1x
7	00.988	00.899	1.1x
8	00.979	00.891	1.1x
9	00.969	00.870	1.1x
10	00.967	00.886	1.1x
11	00.972	00.891	1.1x
12	00.966	00.885	1.1x
13	00.978	00.810	1.2x
14	00.984	00.850	1.2x
15	00.974	00.869	1.1x
16	00.999	00.885	1.1x
17	00.995	00.811	1.2x
18	01.008	00.830	1.2x
19	01.221	00.952	1.3x
20	01.362	01.010	1.3x
21	01.370	01.069	1.3x
22	01.525	01.109	1.4x
23	01.393	01.156	1.2x
24	01.425	01.170	1.2x
25	01.413	01.209	1.2x
26	01.494	01.187	1.3x
27	01.465	01.181	1.2x
28	01.589	01.182	1.3x
29	01.392	01.230	1.1x
30	01.683	01.242	1.4x
31	01.674	01.232	1.4x
32	01.533	01.264	1.2x
33	01.773	01.984	0.9x
34	01.675	02.209	0.8x
35	01.498	02.026	0.7x
36	01.560	02.041	0.8x
37	02.558	01.733	1.5x
38	02.762	01.710	1.6x
39	02.744	01.676	1.6x
40	02.641	01.804	1.5x
41	02.700	01.411	1.9x
42	02.891	01.393	2.1x
43	02.789	01.389	2.0x
44	02.750	01.389	2.0x
45	02.738	01.377	2.0x
46	02.749	01.339	2.1x
47	02.757	01.329	2.1x
48	02.738	01.327	2.1x
49	02.785	03.307	0.8x
50	02.826	03.283	0.9x
51	02.760	03.228	0.9x
52	02.787	03.278	0.9x
53	02.805	03.399	0.8x
54	03.032	03.362	0.9x
55	02.888	03.219	0.9x
56	02.856	03.153	0.9x
57	02.876	03.195	0.9x
58	02.868	03.135	0.9x
59	02.854	02.978	1.0x
60	02.892	02.925	1.0x
61	02.907	03.428	0.8x
62	02.981	03.294	0.9x
63	02.932	03.062	1.0x
64	02.928	02.637	1.1x
65	02.958	05.008	0.6x
66	03.036	05.040	0.6x
67	02.943	05.089	0.6x
68	02.968	04.961	0.6x
69	03.035	05.026	0.6x
70	03.049	05.029	0.6x
71	03.148	05.202	0.6x
72	03.378	05.513	0.6x

RAMitchell commented 5 years ago

Thanks everyone for your great benchmarks and insights :)

Removing of floating point numbers conversion

I have tried summation into floating point in my GPU version and the accuracy degrades significantly on larger data sets (1M rows). We need accuracy first and speed second so this will not be possible unless you have a scheme to reduce the loss of accuracy (e.g. tree reduction).

Threading

I don't think parallelism by node is possible if we are using the "lossguide" style algorithm that greedily opens one node at a time. This is a problem I also struggled with for the GPU version.

hcho3 / xgboost-fast-hist-perf-lab