szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 334 forks source link

ranger - R RF package #34

Closed szilard closed 8 years ago

szilard commented 8 years ago

Request from Carlos Ortega http://datascience.la/benchmarking-random-forest-implementations/#comment-65711

Quick code here: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/8b-ranger.R

Results (32 cores/1M records/100 trees): AUC 0.715, train time 90s

Maybe it can be improved...

PhilippPro commented 8 years ago

Why don't you run it with 500 trees, like the other implementations? ;)

szilard commented 8 years ago

Because I'm running the simplified "absolute minimum benchmark" https://github.com/szilard/benchm-ml/tree/master/z-other-tools for all the tools other than the ones of primary focus.

PhilippPro commented 8 years ago

Ok, that's reasonable. For the auc-benchmarking I see the problem, that there is a high variance (as not always the same trees are trained). See for example my post, where you can see the quantiles of the repetitions: http://stats.stackexchange.com/questions/194560/difference-in-randomforestsrc-and-randomforest-package-increasing-oob-error-cu

For more comparable results one should repeat the training several times or increase the number of trees, as random forest converges (see Breiman-Paper for that). Moreover I think ranger is better for high-dimensional problems, see http://arxiv.org/pdf/1508.04409v1.pdf and Rborist for high number of observations (in R).

I will do/am doing a benchmark experiment with many datasets for parametersettings and will also compare different r- and software-packages. As your post is very helpful for using several software packages I will cite it (if you have a special wish for how to cite, you can tell me).

suiji commented 8 years ago

PhillipPro,

The Wright/Ziegler preprint reports comparisons using the initial CRAN release of the Arborist, a version now quite out of date. The Github version, on the other hand, is quite recent and features a number of improvements, especially with respect to memory footprint and core occupancy. In fact, Szilard's results are obtained using the Github version.

That being said, the premise of Marvin and Andreas' preprint still holds, but I would modify your claim as follows:

i) For high-dimensional problems, Ranger is faster when 'mtry' is low as a percentage of predictor count. The relative advantage decreases as 'mtry' approaches the full predictor count - a fact which may or may not be useful depending, for example, on how highly-correlated the predictors are expected to be.

ii) Ranger's relative speed advantage diminishes with row count. Again, this may or may not be useful for a given data set.

The reason the Arborist tends to do well at higher row count is that it restages the data at each level of the tree. That is, it employs a stable partition to effect a re-sorting of each predictor at each node. This scheme improves data locality and is quite amenable to SIMD parallelization, which makes porting to the GPU very straightforward. The drawback to restaging is that it is not particularly work-efficient when node/predictor pairs have a low probability of splitting, such as when 'mtry' is a small fraction of the the predictor count.

Very early versions of the Arborist did not employ restaging, but instead used an approach based on accumulators. These versions scaled well with predictor count but did not promote data locality, nor did they port well to GPUs. Ultimately, though, the Arborist should probably re-introduce such an approach for the low mtry-as-a-fraction-of-predictor-count regime.

szilard commented 8 years ago

@PhilippPro the variability should not be very high esp with a large number of records. for example in the "abs min bm" with 1M records and only 100 trees using xgboost 10 times AUC is

> mean(auc)
[1] 0.7514123
> sd(auc)
[1] 0.001860182
> 
> summary(auc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7474  0.7508  0.7519  0.7514  0.7523  0.7538 

(and with 500 trees the variability should be even smaller).

The diff between tools/implementations is higher than this.

PhilippPro commented 8 years ago

Ok. But as you pointed out in your blog post for a real benchmarking study more datasets are needed than only this one, so I think it is dangerous to draw big conclusions about the performance of the different packages due to only one dataset. More interesting is probably the speed comparison, as the variabilty (of rankings) should not be as high as the permormance variabilty.

szilard commented 8 years ago

Yes, "more datasets are needed...". Yes, "speed comparison..."

One thing 1 dataset is good for AUC is that if there are 5 implementations and say 4 of them give you AUC say between 0.70-0.71 but one implementation gets you say 0.61, then this can point out a potential bug or other issue.