imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
765 stars 190 forks source link

Is ranger faster than Python RandomForestClassifier? #248

Open andirey opened 6 years ago

andirey commented 6 years ago

I know and tested that 'ranger' is the faster R package. But after reading http://datascience.la/benchmarking-random-forest-implementations/ I have this question. Maybe some benchmarks show that ranger is also the faster in R & Py world also.

mayer79 commented 6 years ago

The sklearn implementation is quite fast, too. It was fully reimplemented in C++ (Cython) since not too long. My own experiments with the diamonds data in ggplot showed it was even faster than ranger if no OOB scoring was done (regression problem, not classification). With OOB, they were quite equal. But of course it depends not only on the data set tested but also the system. It would be great to see an updated version of the benchmark in your link. It is from 2015.

samleegithub commented 6 years ago

Here is a link to updated benchmarks with source code you can modify to compare ranger: https://github.com/szilard/benchm-ml

samleegithub commented 6 years ago

Link to benchmarking code: https://github.com/szilard/benchm-ml/tree/master/z-other-tools

mayer79 commented 6 years ago

Thx for the update @samleegithub . I could not find the word "ranger" on that page. It it really part of that benchmark?

samleegithub commented 6 years ago

I didn't see any benchmarks for ranger specifically, but they do have code that you can modify to compare ranger versus other implementations.

pnavaro commented 4 years ago

For data with 1M observations and 100 trees on MacBook Pro with 4-cores.

andirey commented 4 years ago

For data with 1M observations and 100 trees on MacBook Pro with 4-cores.

  • ranger 128 seconds
  • scikit-learn 509 seconds That's exactly what I also see - R "ranger" is much faster than Python
talegari commented 4 years ago

Another fact to keep in mind while bench marking is that ranger offers respect.unordered.factors = "partition" ('ignore' is nearly label encoding) offers a way to handle categorical variables by looking at all possible splits (2^k - 1 possibilities are evaluated) for k-level unordered factor. While, trees and thereby the random forest classifier/regressor in sklearn 0.22 does not support categorical variables. Categorical variables need some encoding.

Devilmoon commented 3 years ago

Just wanted to throw my experience in here, simply because it differs heavily from yours. data set 39668 observation, 166 features. Windows 10, Ryzen 5 3600, scikit run in a jupyter and ranger run from Ubuntu 20.04 through WSL. Both forests using all cores and 500 trees.

Ranger is a huge improvement over the original Random Forest implementation for R but from my experience it appears to be immensely slower than scikit's implementation.

alexis-intellegens commented 3 years ago

For anyone who wants to try this, skranger is a Python ranger implementation (in my experience it matches or beats sklearn)