A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
Great initiative, thanks for making this public!
You might be interested in extending your benchmarking to the auto-sklearn. https://github.com/automl/auto-sklearn
I have created a script that can take in a sparse dataset in the pandas HDFS dataframe .h5 format and run a binary classification on it on multiprocessing cluster with auto-sklearn. https://github.com/Motorrat/autosklearn-zeroconf Myself I will try to duplicate your benchmark, but just in case you are on it you might want to try out yourself.
Great initiative, thanks for making this public! You might be interested in extending your benchmarking to the auto-sklearn. https://github.com/automl/auto-sklearn I have created a script that can take in a sparse dataset in the pandas HDFS dataframe .h5 format and run a binary classification on it on multiprocessing cluster with auto-sklearn. https://github.com/Motorrat/autosklearn-zeroconf Myself I will try to duplicate your benchmark, but just in case you are on it you might want to try out yourself.