hcho3 / xgboost-fast-hist-perf-lab

Deeper look into performance of tree_method='hist' for multi-core CPUs
5 stars 2 forks source link

build_hist is (heavily) latency bounded #12

Closed Laurae2 closed 5 years ago

Laurae2 commented 5 years ago

After the enhancements of #11, I am noticing the build_hist is heavily latency bounded when running in parallel.

Specifically, this line just eats all the CPU time in parallel: https://github.com/hcho3/xgboost-fast-hist-perf-lab/blob/master/src/build_hist.cc#L12

Multiple solutions from there:

According to VTune:

image

image

In general:

image

The memory latency issue on the data allocation:

image

Laurae2 commented 5 years ago

With a loop instead of fill.

gcc:

image

image

icc:

image

image

Laurae2 commented 5 years ago

Found a workaround, new issue incoming to describe the workaround. But I had to ignore the private BuildHist variable private (public to main thread) though.