Closed szilard closed 3 years ago
Using your docker image (thanks!), on an AMD 3970X, I get this, so "just" slow, but not super-linear:
1 core (ENV R_CMD="taskset -c 0 R --slave"
):
100k: 21s
1M: 101s
2 cores (ENV R_CMD="taskset -c 0-1 R --slave"
):
100k: 14.4s
1M: 55s
Will have a look into the absolute speed, looks a bit slow to me (probably because it's doing 64-bit math by default to be 100% reproducible on all hard/software platforms, and because it's not self-regularizing as much by default and hence making more tree splits). Options to try are nbins_cats = 100
and min_split_improvement = 1e-3
.
Awesome, thanks @arnocandel for looking into this. Interesting on your CPU (AMD) not super-linear. I should check on another Intel CPU (EC2). Will report back here.
On m5.xlarge (Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz) 10M rows:
10:h2o:1:0::2232.079:0.7763276
10:h2o:1:0::2224.741:0.7763276
10:h2o:1:0::2212.771:0.7763276
10:h2o:2:0-1::920.554:0.7763224
10:h2o:2:0-1::953.029:0.7763224
10:h2o:2:0-1::939.754:0.7763224
so super-linear (2.3x).
Btw I noticed you ran only 0.1M and 1M sizes and I got super-linearity only for the 10M size. Maybe you can run it for 10M as well on your CPU?
@arnocandel I found this super-linear scaling also in lightgbm on certain CPUs (c5 instances) but in that case the reason was:
guolinke commented 7 hours ago
in 3.0.0 version, LightGBM implements 2 different algorithms for tree learning.
one is better for single-thread, another one is good for multi-thread.
It will have a small test for these two before training, and choose the faster one.
So it is possible.
If you output the logs of LightGBM [Warning], there will be information about the chosen algorithms.
Is it maybe something along the same lines in h2o?
For 10M rows, h2o is 24x faster on 16 cores compared to 1 core. Any ideas why? @arnocandel
Timings (boxplot of 3 runs) on 0.1,1,10M rows on 1,2,4,8,16 cores on r4x.16xlarge restrained to physical cores (no HT) on 1-socket only:
Speedups from
n/2
ton
cores:these should be <2 that is below red line.
How can the speedup from 1 core to 2 cores be >2 (2.4) on 10M rows?
Code:
run as:
where
$LCORES
is 0,0-1,0-3,0-7,0-15.