We will look at this code (it takes alone 25% of time with 36 threads, the 75% left is before it, in total it's about 30 seconds of CPU time):
/* reduction */
const uint32_t nbin = nbin_;
#pragma omp parallel for num_threads(nthread) schedule(static)
for (dmlc::omp_uint bin_id = 0; bin_id < dmlc::omp_uint(nbin); ++bin_id) {
for (dmlc::omp_uint tid = 0; tid < nthread; ++tid) {
(*hist)[bin_id].Add(data_[tid * nbin_ + bin_id]);
}
}
Because I did not setup yet Intel VTune for pinpointing the exact reason behind this (did not install the kernel drivers yet), I used Intel Inspector to look at that specific loop to check for what happens, and it seems it creates a huge blocking at this specific line:
With Intel VTune, it seems the issue is located here: https://github.com/hcho3/xgboost-fast-hist-perf-lab/blob/master/src/build_hist.cc#L56-L62 - with 36 threads, it seems to work 1000 seconds of CPU time there (nearly 100% of CPU time), with 85% waiting time and 15% of effective computation time (more on this in another issue).
We will look at this code (it takes alone 25% of time with 36 threads, the 75% left is before it, in total it's about 30 seconds of CPU time):
Because I did not setup yet Intel VTune for pinpointing the exact reason behind this (did not install the kernel drivers yet), I used Intel Inspector to look at that specific loop to check for what happens, and it seems it creates a huge blocking at this specific line:
(all the other "data race" below P2 is just on other threads, and the P1 is just it complains about this in general: https://github.com/hcho3/xgboost-fast-hist-perf-lab/blob/master/src/build_hist.cc#L19 but it's expected)
The exact asm code causing the race issue is below:
Full details are below (warning, big):