szilard / GBM-perf

Performance of various open source GBM implementations
MIT License
215 stars 28 forks source link

Spark MLlib GBT #17

Open szilard opened 5 years ago

szilard commented 5 years ago

https://github.com/szilard/GBM-perf/tree/master/wip-testing/spark

szilard commented 5 years ago

run first for 10 trees (time, AUC):

0.1m res1: Double = 85.617916949 res2: Double = 0.7111801947131664 observed 2 partitions for train (auto)

1m res1: Double = 117.134682544 res2: Double = 0.7281780861344342 observed 12 partitions

10m res1: Double = 857.679993995 res2: Double = 0.7307362798392467 observed 32 partitions (r4.8xlarge 32 cores)

       
  Lightgbm Spark 10 trees adjusted ratio
0.1m 2.4 85 354
1m 5.2 117 225
10m 42 857 204

(ratio is adjusted, assuming 100 trees takes 10x the time of 10 trees)

szilard commented 5 years ago

100 trees:

0.1m res1: Double = 1020.154942769 res2: Double = 0.7212927860682342

1m res1: Double = 1383.33980618 res2: Double = 0.7484692273842806

10m res1: Double = 8390.772561576 res2: Double = 0.7553002016056503

  time lgbm time spark ratio AUC lgbm AUC spark
0.1m 2.4 1020 425 0.730 0.721
1m 5.2 1380 265 0.764 0.748
10m 42 8390 200 0.774 0.755
szilard commented 5 years ago

old results (2015-2016) summary for comparison:

random forest:

max_depth = 20, n_trees = 500

  size time [s] RAM [GB] AUC
h2o 1M 600 5 0.755
spark 1M 2000 130 0.714 (bug)
xgboost 1M 170 2 0.753

GBM:

learn_rate = 0.1, max_depth = 6, n_trees = 300

  size time [s] RAM [GB] AUC
h2o 1M 60 2 0.743
spark 1M 6000 30 0.738
xgboost 1M 45 1 0.745
szilard commented 5 years ago

various numbers of trees and RAM usage:

10M recs (2GB RDD)

1 tree (depth 10): 92.995553313 Double = 0.7124803198872883 112GB

10 trees: 828.706118911 0.7308451726164384 125GB

100 trees: res1: Double = 8068.992507948 res2: Double = 0.7552995954933053 235GB


1 tree depth 1: res1: Double = 70.316245739 res2: Double = 0.6346946512846476 110GB

szilard commented 5 years ago
    100M     10M    
trees depth time [s] AUC RAM [GB] time [s] AUC RAM [GB]
1 1 1150 0.634 620 70 0.635 110
1 10 1350 0.712 620 90 0.712 112
10 10 7850 0.731 780 830 0.731 125
100 10 crash OOM   >960 (OOM) 8070 0.755 230

100M ran on: x1e.8xlarge (32 cores, 1 NUMA, 960GB RAM)

10M ran on: r4.8xlarge (32 cores, 1 NUMA, 240GB RAM)

szilard commented 5 years ago

Spark multicore scaling (and NUMA):

server cores partitions time [s] speedup vs 1c
r4.8xlarge 32 32 830 7.9
r4.8xlarge 16 16 810 8.1
r4.8xlarge 8 8 1040 6.3
r4.8xlarge 2 2(m) 3520 1.9
r4.8xlarge 1 1(m) 6560 1.0
r4.16xlarge 64 63(a) 680 9.6
r4.8xlarge 32 64(m) 830 7.9

(Spark does not slow down on multi-socket/NUMA servers; it's already so slow that NUMA is not a factor)