A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
Splitting https://github.com/szilard/benchm-ml/issues/5 in two: random forest here, logistic regression in different issue.
Summary: Random forest in Spark has low AUC (and is slower/larger memory footprint).
For
n = 100K
Spark getsAUC = 0.65
vs e.g. 0.72/0.73 in H2O/xgboost.Code here https://github.com/szilard/benchm-ml/blob/master/2-rf/5b-spark.txt Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-0.1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-0.1m.csv
Originally ran on 1.3.0, but same in 1.4.0 (a bit faster, but same AUC).
Can you guys look at the code and optimize it/make it better, especially get better AUC?