szilard / GBM-perf

Performance of various open source GBM implementations
MIT License
215 stars 28 forks source link

h2o weird multi-core scaling #39

Closed szilard closed 3 years ago

szilard commented 4 years ago

For 10M rows, h2o is 24x faster on 16 cores compared to 1 core. Any ideas why? @arnocandel

Timings (boxplot of 3 runs) on 0.1,1,10M rows on 1,2,4,8,16 cores on r4x.16xlarge restrained to physical cores (no HT) on 1-socket only:

Screen Shot 2020-09-17 at 4 19 22 AM

Speedups from n/2 to n cores:

Screen Shot 2020-09-17 at 4 21 50 AM

these should be <2 that is below red line.

How can the speedup from 1 core to 2 cores be >2 (2.4) on 10M rows?

Code:

library(h2o)

h2o.init()

dx_train <- h2o.importFile("train.csv")
dx_test <- h2o.importFile("test.csv")

Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]

cat(system.time({
  md <- h2o.gbm(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, 
          distribution = "bernoulli", 
          ntrees = 100, max_depth = 10, learn_rate = 0.1, 
          nbins = 100)
})[[3]]," ",sep="")

run as:

taskset -c $LCORES R --slave < $TOOL.R $NCORES 

where $LCORES is 0,0-1,0-3,0-7,0-15.

arnocandel commented 4 years ago

Using your docker image (thanks!), on an AMD 3970X, I get this, so "just" slow, but not super-linear:

1 core (ENV R_CMD="taskset -c 0 R --slave"): 100k: 21s 1M: 101s

2 cores (ENV R_CMD="taskset -c 0-1 R --slave"): 100k: 14.4s 1M: 55s

Will have a look into the absolute speed, looks a bit slow to me (probably because it's doing 64-bit math by default to be 100% reproducible on all hard/software platforms, and because it's not self-regularizing as much by default and hence making more tree splits). Options to try are nbins_cats = 100 and min_split_improvement = 1e-3.

szilard commented 4 years ago

Awesome, thanks @arnocandel for looking into this. Interesting on your CPU (AMD) not super-linear. I should check on another Intel CPU (EC2). Will report back here.

szilard commented 4 years ago

On m5.xlarge (Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz) 10M rows:

10:h2o:1:0::2232.079:0.7763276
10:h2o:1:0::2224.741:0.7763276
10:h2o:1:0::2212.771:0.7763276

10:h2o:2:0-1::920.554:0.7763224
10:h2o:2:0-1::953.029:0.7763224
10:h2o:2:0-1::939.754:0.7763224

so super-linear (2.3x).

Btw I noticed you ran only 0.1M and 1M sizes and I got super-linearity only for the 10M size. Maybe you can run it for 10M as well on your CPU?

szilard commented 4 years ago

@arnocandel I found this super-linear scaling also in lightgbm on certain CPUs (c5 instances) but in that case the reason was:

guolinke commented 7 hours ago
in 3.0.0 version, LightGBM implements 2 different algorithms for tree learning.
one is better for single-thread, another one is good for multi-thread.
It will have a small test for these two before training, and choose the faster one.
So it is possible.
If you output the logs of LightGBM [Warning], there will be information about the chosen algorithms.

Is it maybe something along the same lines in h2o?