szilard / GBM-perf

Performance of various open source GBM implementations
MIT License
213 stars 28 forks source link

Spark MLlib GBT 100M dataset #18

Open szilard opened 5 years ago

szilard commented 5 years ago
du -sm *.csv
467     train-10m.csv
47      train-1m.csv
5       train-0.1m.csv

du -sm *.parquet
2385    spark_ohe-train-100m.parquet
239     spark_ohe-train-10m.parquet
25      spark_ohe-train-1m.parquet
3       spark_ohe-train-0.1m.parquet

free -m
              total        used        free      shared  buff/cache   available
Mem:         245854         568      244920           8         365      244043

lscpu
CPU(s):                32

${SPARK_ROOT}/bin/spark-shell --master local[*] --driver-memory 220G --executor-memory 220G

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.GBTClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val d_train = spark.read.parquet("spark_ohe-train.parquet").cache()
val d_test = spark.read.parquet("spark_ohe-test.parquet").cache()
(d_train.count(), d_test.count())

free -m
              total        used        free      shared  buff/cache   available
Mem:         245854       64579      178405           8        2868      180025

Screen Shot 2019-05-08 at 5 50 53 AM


val rf = new GBTClassifier().setLabelCol("label").setFeaturesCol("features").
  setMaxIter(100).setMaxDepth(10).setStepSize(0.1).
  setMaxBins(100).setMaxMemoryInMB(10240)     // max possible setMaxMemoryInMB (otherwise errors out)
val pipeline = new Pipeline().setStages(Array(rf))

val now = System.nanoTime
val model = pipeline.fit(d_train)
szilard commented 5 years ago

Screen Shot 2019-05-08 at 5 55 24 AM

Screen Shot 2019-05-08 at 5 55 57 AM

starts spilling to disk

Screen Shot 2019-05-08 at 6 00 31 AM

Screen Shot 2019-05-08 at 6 01 28 AM

Screen Shot 2019-05-08 at 6 01 48 AM

Screen Shot 2019-05-08 at 6 03 32 AM

Screen Shot 2019-05-08 at 6 08 55 AM

no more disk writes

Screen Shot 2019-05-08 at 6 25 47 AM

Screen Shot 2019-05-08 at 6 26 34 AM Screen Shot 2019-05-08 at 6 27 01 AM

job fails

Screen Shot 2019-05-08 at 6 36 01 AM

Screen Shot 2019-05-08 at 6 36 28 AM

szilard commented 5 years ago

moar RAM:

x1e.8xlarge (32 cores, 1 NUMA, 960GB RAM)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz
Stepping:              4
CPU MHz:               2699.984
CPU max MHz:           3100.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.10
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida
~/spark-2.4.2-bin-hadoop2.7/bin/spark-shell --master local[*] --driver-memory 940G --executor-memory 940G

Screen Shot 2019-05-08 at 1 41 27 PM

scala> val model = pipeline.fit(d_train)
[Stage 443:>                                                      (0 + 32) / 32]OpenJDK 
64-Bit Server VM warning:
 INFO: os::commit_memory(0x00007eb838e80000, 51384942592, 0) failed; 
error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 51384942592 bytes 
for committing reserved memory.
# An error report file with more information is saved as:
# /home/ubuntu/GBM-perf/wip-testing/spark/hs_err_pid2301.log

Screen Shot 2019-05-08 at 11 51 11 PM

szilard commented 5 years ago

Let's try to learn only 1 tree of depth 1:

runs 1150 sec, AUC=0.634, RAM usage 620GB

Screen Shot 2019-05-09 at 12 18 36 PM Screen Shot 2019-05-09 at 12 18 56 PM Screen Shot 2019-05-09 at 12 19 16 PM Screen Shot 2019-05-09 at 12 19 45 PM

szilard commented 5 years ago

1 tree depth 10:

runs 1350 sec, AUC=0.712, RAM usage 620GB

Screen Shot 2019-05-09 at 12 56 09 PM Screen Shot 2019-05-09 at 12 56 21 PM Screen Shot 2019-05-09 at 12 56 36 PM Screen Shot 2019-05-09 at 12 56 51 PM

szilard commented 5 years ago

10 trees depth 10:

runs 7850 sec, AUC=0.731, RAM usage 780GB

szilard commented 5 years ago
    100M     10M    
trees depth time [s] AUC RAM [GB] time [s] AUC RAM [GB]
1 1 1150 0.634 620 70 0.635 110
1 10 1350 0.712 620 90 0.712 112
10 10 7850 0.731 780 830 0.731 125
100 10 crash OOM   >960 (OOM) 8070 0.755 230

100M ran on: x1e.8xlarge (32 cores, 1 NUMA, 960GB RAM)

10M ran on: r4.8xlarge (32 cores, 1 NUMA, 240GB RAM)