imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
765 stars 190 forks source link

Honest ranger #547

Open ellenxtan opened 3 years ago

ellenxtan commented 3 years ago

I was wondering if there is an implementation for an honest ranger? I know there is an honest regression forest in the grf package, but their package does not have support for ordinal/categorical variables that exists in ranger (i.e., respect.unordered.factors).

Thank you!

sheffe commented 3 years ago

There's a reasonably straightforward way to implement all kinds of ranger variations, including the honest random forest algorithm, by using custom sampling and the terminal node memberships.

The strategy I would use:

The full terminal node membership tables can get large for extremely large forests, but in my (never formalized!) experience, large forests show the smallest differences between honest and regular RFs in the first place. You can run checks over a subset of trees to see how much difference it makes in the outcome.

('Honest Ranger' should definitely be called 'Lawful Good Ranger' ....)

ellenxtan commented 3 years ago

Thank you! Sorry, I am not familiar with C++ that's implemented in ranger. I was wondering if this could be a future feature? Really appreciate it!

swager commented 3 years ago

Just in case it's relevant -- another way to get an honest random forest with factor variables is to first appropriately expand out the factor variables by hand (this paper discusses several options for doing so: https://arxiv.org/pdf/1908.09874.pdf), and then use the honest regression forest functionality in grf. This wouldn't require any extra coding in c++.

ellenxtan commented 3 years ago

Yes, I actually tried to use that but in my scenario, I found the honest RF in grf with factor variable reencoded has far less ideal performance than the common ranger (it could basically due to the fact that my factor variable has multiple categories (>20). That's why I am curious about similar implementation in ranger for the honest version. Anyway, thanks, Professor!

swager commented 3 years ago

Got it, thanks for sharing! Yes -- overall it seems like forests are quite sensitive to how one encodes and processes factor variables, and there's no one-size-fits all solution. (Among the methods we tried in the pre-print, one-hot-encoding was not very good at all. The "means encoding" seemed more promising in a number of applications.)