szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

Spark random forest low AUC etc #16

Closed szilard closed 8 years ago

szilard commented 9 years ago

Splitting https://github.com/szilard/benchm-ml/issues/5 in two: random forest here, logistic regression in different issue.

Summary: Random forest in Spark has low AUC (and is slower/larger memory footprint).

For n = 100K Spark gets AUC = 0.65 vs e.g. 0.72/0.73 in H2O/xgboost.

Code here https://github.com/szilard/benchm-ml/blob/master/2-rf/5b-spark.txt Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-0.1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-0.1m.csv

Originally ran on 1.3.0, but same in 1.4.0 (a bit faster, but same AUC).

Can you guys look at the code and optimize it/make it better, especially get better AUC?

szilard commented 8 years ago

Continued here https://github.com/szilard/benchm-ml/issues/19