xgboost multicore scaling and NUMA improvement by xgboost version

szilard commented 3 years ago

xgboost improved significantly in multicore scaling and NUMA

Runtimes by version on r4.16xlarge (2so, 16c+HT) on 1, 16 and 64 cores on 1M rows:

version	date	t 1c [s]	t 16c [s]	t 64c [s]
0.70	2017-12-31	30	15	37
0.72	2018-06-01	36	14	37
0.80	2018-08-13	27	14	34
0.82	2019-03-04	27	19	53
0.90	2019-05-20	27	18.7	53
1.0.0	2020-02-20	25	5.7	7.5
1.1.0	2020-05-17	30	3.8	3.6
1.2.0	2020-08-22	33	4.6	5


git clone --recursive https://github.com/dmlc/xgboost 

VER=v1.2.0
cd xgboost && git checkout tags/$VER && git submodule init && git submodule update && cd R-package && R CMD INSTALL . && cd /

taskset -c 0 R < GBM-perf/cpu/run/2-xgboost.R
taskset -c 0-15 R < GBM-perf/cpu/run/2-xgboost.R
R < GBM-perf/cpu/run/2-xgboost.R

szilard commented 3 years ago

To be easier to reproduce my numbers and to get new ones in the future and or other hardware, I made a separate Dockerfile for this:

https://github.com/szilard/GBM-perf/tree/master/analysis/xgboost_cpu_by_version

You'll need to set the CPU core ids for the first socket, no hyper threaded cores (e.g. 0-15 on r4.16xlarge, which has 2 sockets, 16c+16HT each) and the xgboost version:

VER=v1.2.0
CORES_1SO_NOHT=0-15    ## set physical core ids on first socket, no hyperthreading
sudo docker build --build-arg CACHE_DATE=$(date +%Y-%m-%d) --build-arg VER=$VER -t gbmperf_xgboost_cpu_ver .
sudo docker run --rm -e CORES_1SO_NOHT=$CORES_1SO_NOHT gbmperf_xgboost_cpu_ver

It might be worth running the script several times, the training times on all cores usually show somewhat higher variability, not sure if because of the virtualization environment (EC2) or because of NUMA.

szilard commented 3 years ago

discussion continued here https://github.com/dmlc/xgboost/issues/3810#issuecomment-694715060

szilard / GBM-perf

xgboost multicore scaling and NUMA improvement by xgboost version #40