Open ellenxtan opened 3 years ago
There's a reasonably straightforward way to implement all kinds of ranger variations, including the honest random forest algorithm, by using custom sampling and the terminal node memberships.
The strategy I would use:
The full terminal node membership tables can get large for extremely large forests, but in my (never formalized!) experience, large forests show the smallest differences between honest and regular RFs in the first place. You can run checks over a subset of trees to see how much difference it makes in the outcome.
('Honest Ranger' should definitely be called 'Lawful Good Ranger' ....)
Thank you! Sorry, I am not familiar with C++ that's implemented in ranger. I was wondering if this could be a future feature? Really appreciate it!
Just in case it's relevant -- another way to get an honest random forest with factor variables is to first appropriately expand out the factor variables by hand (this paper discusses several options for doing so: https://arxiv.org/pdf/1908.09874.pdf), and then use the honest regression forest functionality in grf
. This wouldn't require any extra coding in c++.
Yes, I actually tried to use that but in my scenario, I found the honest RF in grf
with factor variable reencoded has far less ideal performance than the common ranger (it could basically due to the fact that my factor variable has multiple categories (>20). That's why I am curious about similar implementation in ranger for the honest version. Anyway, thanks, Professor!
Got it, thanks for sharing! Yes -- overall it seems like forests are quite sensitive to how one encodes and processes factor variables, and there's no one-size-fits all solution. (Among the methods we tried in the pre-print, one-hot-encoding was not very good at all. The "means encoding" seemed more promising in a number of applications.)
I was wondering if there is an implementation for an honest ranger? I know there is an honest regression forest in the grf package, but their package does not have support for ordinal/categorical variables that exists in ranger (i.e., respect.unordered.factors).
Thank you!