szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

upgrade H2O to 3.0 #13

Closed szilard closed 8 years ago

szilard commented 9 years ago

Spot check runtime/AUC/RAM for linear and RF for at least 1 size.

szilard commented 9 years ago

Random forests n=1M, 500 trees: h2o 2.8: time 600s RAM 5GB AUC 75.5 https://github.com/szilard/benchm-ml/blob/master/2-rf/4-h2o.R h2o 3.0: time 450s RAM 5GB AUC 73.4 https://github.com/szilard/benchm-ml/blob/master/2-rf/4-h2o-v3.R AUC is lower in 3.0 cc: @arnocandel

szilard commented 9 years ago

GLM n=10M: h2o 2.8: time 5s RAM 3GB AUC 71.0 https://github.com/szilard/benchm-ml/blob/master/1-linear/4-h2o.R h2o 3.0: time 25s RAM 4GB AUC 71.1 https://github.com/szilard/benchm-ml/blob/master/1-linear/4-h2o-v3.R Run time is larger

szilard commented 9 years ago

GLM: @arnocandel says GLM is slower because "the models now also compute training/validation metrics such as AUC while building the model".

szilard commented 9 years ago

h2o 3.0.0.16: time 600s RAM 5GB AUC 75.2 (AUC better) https://github.com/szilard/benchm-ml/blob/master/2-rf/4-h2o-v3.R

szilard commented 9 years ago

GBM n=1M learn_rate = 0.1 max_depth = 6 n_trees = 300 (experiment B in main README) h2o-2: time 60s RAM 5GB AUC 74.3 h2o-3.0.0.16: time 40s RAM 10GB AUC 75.1 (+++)

learn_rate = 0.01 max_depth = 16 n_trees = 1000 (experiment A in main README) h2o-2: time 900s RAM 9GB AUC 75.9 h2o-3.0.0.16: time 900s RAM 10GB AUC 76.0

n=10M learn_rate = 0.01 max_depth = 20 n_trees = 5000 nbins=1000 h2o-2: time 7.5hrs AUC 79.8 h2o-3.0.0.16: time 9.5hrs AUC 81.2 (+++)