Spark random forest low AUC etc

szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

MIT License

1.87k stars 335 forks source link

Splitting https://github.com/szilard/benchm-ml/issues/5 in two: random forest here, logistic regression in different issue.

Summary: Random forest in Spark has low AUC (and is slower/larger memory footprint).

For n = 100K Spark gets AUC = 0.65 vs e.g. 0.72/0.73 in H2O/xgboost.

Code here https://github.com/szilard/benchm-ml/blob/master/2-rf/5b-spark.txt Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-0.1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-0.1m.csv

Originally ran on 1.3.0, but same in 1.4.0 (a bit faster, but same AUC).

Can you guys look at the code and optimize it/make it better, especially get better AUC?

szilard / benchm-ml

Spark random forest low AUC etc #16