Open andirey opened 6 years ago
The sklearn implementation is quite fast, too. It was fully reimplemented in C++ (Cython) since not too long. My own experiments with the diamonds data in ggplot showed it was even faster than ranger if no OOB scoring was done (regression problem, not classification). With OOB, they were quite equal. But of course it depends not only on the data set tested but also the system. It would be great to see an updated version of the benchmark in your link. It is from 2015.
Here is a link to updated benchmarks with source code you can modify to compare ranger: https://github.com/szilard/benchm-ml
Link to benchmarking code: https://github.com/szilard/benchm-ml/tree/master/z-other-tools
Thx for the update @samleegithub . I could not find the word "ranger" on that page. It it really part of that benchmark?
I didn't see any benchmarks for ranger specifically, but they do have code that you can modify to compare ranger versus other implementations.
For data with 1M observations and 100 trees on MacBook Pro with 4-cores.
ranger
128 seconds scikit-learn
509 secondsFor data with 1M observations and 100 trees on MacBook Pro with 4-cores.
ranger
128 secondsscikit-learn
509 seconds That's exactly what I also see - R "ranger" is much faster than Python
Another fact to keep in mind while bench marking is that ranger offers respect.unordered.factors = "partition"
('ignore' is nearly label encoding) offers a way to handle categorical variables by looking at all possible splits (2^k - 1 possibilities are evaluated) for k-level unordered factor. While, trees and thereby the random forest classifier/regressor in sklearn 0.22 does not support categorical variables. Categorical variables need some encoding.
Just wanted to throw my experience in here, simply because it differs heavily from yours. data set 39668 observation, 166 features. Windows 10, Ryzen 5 3600, scikit run in a jupyter and ranger run from Ubuntu 20.04 through WSL. Both forests using all cores and 500 trees.
ranger
~1 minutescikit_learn
~6 secondsrandomForest (R)
~2.5 minutesRanger is a huge improvement over the original Random Forest implementation for R but from my experience it appears to be immensely slower than scikit's implementation.
For anyone who wants to try this, skranger is a Python ranger implementation (in my experience it matches or beats sklearn)
I know and tested that 'ranger' is the faster R package. But after reading http://datascience.la/benchmarking-random-forest-implementations/ I have this question. Maybe some benchmarks show that ranger is also the faster in R & Py world also.