Some Intel Inspector results

With Intel VTune, it seems the issue is located here: https://github.com/hcho3/xgboost-fast-hist-perf-lab/blob/master/src/build_hist.cc#L56-L62 - with 36 threads, it seems to work 1000 seconds of CPU time there (nearly 100% of CPU time), with 85% waiting time and 15% of effective computation time (more on this in another issue).

We will look at this code (it takes alone 25% of time with 36 threads, the 75% left is before it, in total it's about 30 seconds of CPU time):

  /* reduction */
  const uint32_t nbin = nbin_;
  #pragma omp parallel for num_threads(nthread) schedule(static)
  for (dmlc::omp_uint bin_id = 0; bin_id < dmlc::omp_uint(nbin); ++bin_id) {
    for (dmlc::omp_uint tid = 0; tid < nthread; ++tid) {
      (*hist)[bin_id].Add(data_[tid * nbin_ + bin_id]);
    }
  }

Because I did not setup yet Intel VTune for pinpointing the exact reason behind this (did not install the kernel drivers yet), I used Intel Inspector to look at that specific loop to check for what happens, and it seems it creates a huge blocking at this specific line:

      (*hist)[bin_id].Add(data_[tid * nbin_ + bin_id]);

(all the other "data race" below P2 is just on other threads, and the P1 is just it complains about this in general: https://github.com/hcho3/xgboost-fast-hist-perf-lab/blob/master/src/build_hist.cc#L19 but it's expected)

The exact asm code causing the race issue is below:

Full details are below (warning, big):

P2: Error: Data race: New
 P2.69: Error: Data race: New
  /usr/include/c++/8/ext/new_allocator.h(111): Error X168: Allocation site: Function allocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   109        }
   110  #endif
  >111      return static_cast<_Tp*>(::operator new(__n * sizeof(_Tp)));
   112        }
   113  

  Stack (1 of 1 instance(s))
  >perflab!allocate - /usr/include/c++/8/ext/new_allocator.h:111

  /usr/include/c++/8/bits/stl_algobase.h(685): Error X169: Write: Function __fill_a<perflab::GHistEntry*, perflab::GHistEntry>: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   683      {
   684        for (; __first != __last; ++__first)
  >685      *__first = __value;
   686      }
   687  

  Stack (1 of 1 instance(s))
  >perflab!__fill_a<perflab::GHistEntry*, perflab::GHistEntry> - /usr/include/c++/8/bits/stl_algobase.h:685

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(115): Error X170: Read: Function BuildHist._omp_fn.0: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   113    /*! \brief add a GradientPair to the sum */
   114    inline void Add(const GradientPair& e) {
  >115      sum_grad += e.GetGrad();
   116      sum_hess += e.GetHess();
   117    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.0 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:115
   libgomp.so.1![Unknown] - /usr/lib/x86_64-linux-gnu/libgomp.so.1:0x16b4b

 P2.71: Error: Data race: New
  /usr/include/c++/8/ext/new_allocator.h(111): Error X174: Allocation site: Function allocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   109        }
   110  #endif
  >111      return static_cast<_Tp*>(::operator new(__n * sizeof(_Tp)));
   112        }
   113  

  Stack (1 of 1 instance(s))
  >perflab!allocate - /usr/include/c++/8/ext/new_allocator.h:111

  /usr/include/c++/8/bits/stl_algobase.h(685): Error X175: Write: Function __fill_a<perflab::GHistEntry*, perflab::GHistEntry>: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   683      {
   684        for (; __first != __last; ++__first)
  >685      *__first = __value;
   686      }
   687  

  Stack (1 of 1 instance(s))
  >perflab!__fill_a<perflab::GHistEntry*, perflab::GHistEntry> - /usr/include/c++/8/bits/stl_algobase.h:685

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(116): Error X176: Read: Function BuildHist._omp_fn.0: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   114    inline void Add(const GradientPair& e) {
   115      sum_grad += e.GetGrad();
  >116      sum_hess += e.GetHess();
   117    }
   118  

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.0 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:116
   libgomp.so.1![Unknown] - /usr/lib/x86_64-linux-gnu/libgomp.so.1:0x16b4b

 P2.76: Error: Data race: New
  /usr/include/c++/8/ext/new_allocator.h(111): Error X189: Allocation site: Function allocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   109        }
   110  #endif
  >111      return static_cast<_Tp*>(::operator new(__n * sizeof(_Tp)));
   112        }
   113  

  Stack (1 of 1 instance(s))
  >perflab!allocate - /usr/include/c++/8/ext/new_allocator.h:111

  /usr/include/c++/8/bits/stl_algobase.h(685): Error X190: Write: Function __fill_a<perflab::GHistEntry*, perflab::GHistEntry>: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   683      {
   684        for (; __first != __last; ++__first)
  >685      *__first = __value;
   686      }
   687  

  Stack (1 of 1 instance(s))
  >perflab!__fill_a<perflab::GHistEntry*, perflab::GHistEntry> - /usr/include/c++/8/bits/stl_algobase.h:685

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/src/build_hist.cc(60): Error X191: Read: Function BuildHist._omp_fn.1: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   58    for (dmlc::omp_uint bin_id = 0; bin_id < dmlc::omp_uint(nbin); ++bin_id) {
   59      for (dmlc::omp_uint tid = 0; tid < nthread; ++tid) {
  >60        (*hist)[bin_id].Add(data_[tid * nbin_ + bin_id]);
   61      }
   62    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.1 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/src/build_hist.cc:60
   libgomp.so.1![Unknown] - /usr/lib/x86_64-linux-gnu/libgomp.so.1:0x16b4b

 P2.77: Error: Data race: New
  /usr/include/c++/8/ext/new_allocator.h(111): Error X192: Allocation site: Function allocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   109        }
   110  #endif
  >111      return static_cast<_Tp*>(::operator new(__n * sizeof(_Tp)));
   112        }
   113  

  Stack (1 of 1 instance(s))
  >perflab!allocate - /usr/include/c++/8/ext/new_allocator.h:111

  /usr/include/c++/8/bits/stl_algobase.h(685): Error X193: Write: Function __fill_a<perflab::GHistEntry*, perflab::GHistEntry>: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   683      {
   684        for (; __first != __last; ++__first)
  >685      *__first = __value;
   686      }
   687  

  Stack (1 of 1 instance(s))
  >perflab!__fill_a<perflab::GHistEntry*, perflab::GHistEntry> - /usr/include/c++/8/bits/stl_algobase.h:685

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(121): Error X194: Read: Function BuildHist._omp_fn.1: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   119    /*! \brief add a GHistEntry to the sum */
   120    inline void Add(const GHistEntry& e) {
  >121      sum_grad += e.sum_grad;
   122      sum_hess += e.sum_hess;
   123    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.1 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:121
   libgomp.so.1![Unknown] - /usr/lib/x86_64-linux-gnu/libgomp.so.1:0x16b4b

 P2.80: Error: Data race: New
  /usr/include/c++/8/ext/new_allocator.h(111): Error X201: Allocation site: Function allocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   109        }
   110  #endif
  >111      return static_cast<_Tp*>(::operator new(__n * sizeof(_Tp)));
   112        }
   113  

  Stack (1 of 1 instance(s))
  >perflab!allocate - /usr/include/c++/8/ext/new_allocator.h:111

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(116): Error X202: Write: Function BuildHist._omp_fn.0: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   114    inline void Add(const GradientPair& e) {
   115      sum_grad += e.GetGrad();
  >116      sum_hess += e.GetHess();
   117    }
   118  

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.0 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:116

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/src/build_hist.cc(60): Error X203: Read: Function BuildHist._omp_fn.1: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   58    for (dmlc::omp_uint bin_id = 0; bin_id < dmlc::omp_uint(nbin); ++bin_id) {
   59      for (dmlc::omp_uint tid = 0; tid < nthread; ++tid) {
  >60        (*hist)[bin_id].Add(data_[tid * nbin_ + bin_id]);
   61      }
   62    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.1 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/src/build_hist.cc:60
   libgomp.so.1!GOMP_parallel - /usr/lib/x86_64-linux-gnu/libgomp.so.1:0xe09d

 P2.81: Error: Data race: New
  /usr/include/c++/8/ext/new_allocator.h(111): Error X204: Allocation site: Function allocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   109        }
   110  #endif
  >111      return static_cast<_Tp*>(::operator new(__n * sizeof(_Tp)));
   112        }
   113  

  Stack (1 of 1 instance(s))
  >perflab!allocate - /usr/include/c++/8/ext/new_allocator.h:111

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(115): Error X205: Write: Function BuildHist._omp_fn.0: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   113    /*! \brief add a GradientPair to the sum */
   114    inline void Add(const GradientPair& e) {
  >115      sum_grad += e.GetGrad();
   116      sum_hess += e.GetHess();
   117    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.0 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:115

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(121): Error X206: Read: Function BuildHist._omp_fn.1: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   119    /*! \brief add a GHistEntry to the sum */
   120    inline void Add(const GHistEntry& e) {
  >121      sum_grad += e.sum_grad;
   122      sum_hess += e.sum_hess;
   123    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.1 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:121
   libgomp.so.1!GOMP_parallel - /usr/lib/x86_64-linux-gnu/libgomp.so.1:0xe09d

 P2.82: Error: Data race: New
  /usr/include/c++/8/ext/new_allocator.h(111): Error X207: Allocation site: Function allocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   109        }
   110  #endif
  >111      return static_cast<_Tp*>(::operator new(__n * sizeof(_Tp)));
   112        }
   113  

  Stack (1 of 1 instance(s))
  >perflab!allocate - /usr/include/c++/8/ext/new_allocator.h:111

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(115): Error X208: Write: Function BuildHist._omp_fn.0: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   113    /*! \brief add a GradientPair to the sum */
   114    inline void Add(const GradientPair& e) {
  >115      sum_grad += e.GetGrad();
   116      sum_hess += e.GetHess();
   117    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.0 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:115

  /usr/include/c++/8/bits/stl_algobase.h(685): Error X209: Write: Function __fill_a<perflab::GHistEntry*, perflab::GHistEntry>: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   683      {
   684        for (; __first != __last; ++__first)
  >685      *__first = __value;
   686      }
   687  

  Stack (1 of 1 instance(s))
  >perflab!__fill_a<perflab::GHistEntry*, perflab::GHistEntry> - /usr/include/c++/8/bits/stl_algobase.h:685
   perflab!main - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/src/main.cc:69

 P2.83: Error: Data race: New
  /usr/include/c++/8/ext/new_allocator.h(111): Error X210: Allocation site: Function allocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   109        }
   110  #endif
  >111      return static_cast<_Tp*>(::operator new(__n * sizeof(_Tp)));
   112        }
   113  

  Stack (1 of 1 instance(s))
  >perflab!allocate - /usr/include/c++/8/ext/new_allocator.h:111

  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(116): Error X211: Write: Function BuildHist._omp_fn.0: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   114    inline void Add(const GradientPair& e) {
   115      sum_grad += e.GetGrad();
  >116      sum_hess += e.GetHess();
   117    }
   118  

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.0 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:116

  /usr/include/c++/8/bits/stl_algobase.h(685): Error X212: Write: Function __fill_a<perflab::GHistEntry*, perflab::GHistEntry>: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   683      {
   684        for (; __first != __last; ++__first)
  >685      *__first = __value;
   686      }
   687  

  Stack (1 of 1 instance(s))
  >perflab!__fill_a<perflab::GHistEntry*, perflab::GHistEntry> - /usr/include/c++/8/bits/stl_algobase.h:685
   perflab!main - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/src/main.cc:69

 P2.123: Error: Data race: New
  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(115): Error X331: Write: Function BuildHist._omp_fn.0: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   113    /*! \brief add a GradientPair to the sum */
   114    inline void Add(const GradientPair& e) {
  >115      sum_grad += e.GetGrad();
   116      sum_hess += e.GetHess();
   117    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.0 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:115

  /usr/include/c++/8/ext/new_allocator.h(125): Error X332: Write: Function deallocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   123        }
   124  #endif
  >125      ::operator delete(__p);
   126        }
   127  

  Stack (1 of 1 instance(s))
  >perflab!deallocate - /usr/include/c++/8/ext/new_allocator.h:125
   perflab!_start - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab:0x57b4

 P2.124: Error: Data race: New
  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(116): Error X334: Write: Function BuildHist._omp_fn.0: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   114    inline void Add(const GradientPair& e) {
   115      sum_grad += e.GetGrad();
  >116      sum_hess += e.GetHess();
   117    }
   118  

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.0 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:116

  /usr/include/c++/8/ext/new_allocator.h(125): Error X335: Write: Function deallocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   123        }
   124  #endif
  >125      ::operator delete(__p);
   126        }
   127  

  Stack (1 of 1 instance(s))
  >perflab!deallocate - /usr/include/c++/8/ext/new_allocator.h:125
   perflab!_start - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab:0x57b4

 P2.125: Error: Data race: New
  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(121): Error X337: Write: Function BuildHist._omp_fn.1: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   119    /*! \brief add a GHistEntry to the sum */
   120    inline void Add(const GHistEntry& e) {
  >121      sum_grad += e.sum_grad;
   122      sum_hess += e.sum_hess;
   123    }

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.1 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:121

  /usr/include/c++/8/ext/new_allocator.h(125): Error X338: Write: Function deallocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   123        }
   124  #endif
  >125      ::operator delete(__p);
   126        }
   127  

  Stack (1 of 1 instance(s))
  >perflab!deallocate - /usr/include/c++/8/ext/new_allocator.h:125
   perflab!main - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/src/main.cc:54

 P2.126: Error: Data race: New
  /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h(122): Error X340: Write: Function BuildHist._omp_fn.1: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   120    inline void Add(const GHistEntry& e) {
   121      sum_grad += e.sum_grad;
  >122      sum_hess += e.sum_hess;
   123    }
   124  

  Stack (1 of 1 instance(s))
  >perflab!BuildHist._omp_fn.1 - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/include/perflab/data_structure.h:122

  /usr/include/c++/8/ext/new_allocator.h(125): Error X341: Write: Function deallocate: Module /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/perflab
  Code snippet:
   123        }
   124  #endif
  >125      ::operator delete(__p);
   126        }
   127  

  Stack (1 of 1 instance(s))
  >perflab!deallocate - /usr/include/c++/8/ext/new_allocator.h:125
   perflab!main - /home/laurae/Documents/xgboost/xgboost-fast-hist-perf-lab-vtune/src/main.cc:54

hcho3 / xgboost-fast-hist-perf-lab

Some Intel Inspector results #5