Open A-Pai opened 2 years ago
respect.unordered.factors = "order"
forces string-valued variables (categorical variables) to be treated as “ordered”. This allows ranger to skip any of the expensive re-encoding of such variables as contrasts, dummies or indicators. This is achieved in ranger by only using ordered cuts in its underlying trees and is equivalent to re-encoding the categorical variable as the numeric order codes. Basically, think of it like applying as.numeric(as.factor(ordered = TRUE))
to all unordered categorical variables in your data frame. These variables are thus essentially treated as numeric, and ranger appears to run faster over fairly complicated variables.
Although we would never want to treat unordered categorical variables as ordered for linear models, tree-based models are typically undeterred by this.
thanks for your reply,I'd like to know more: what is difference between ’ignore’ and ’order’?
Hi @A-Pai. Finding the optimal split in a decision tree for a categorical variable with J categories would require searching through 2^(J−1) − 1 potential splits. Fortunately, for binary classification and regression (at least when using the standard split rules, like Gini, entropy, or SSE) a shortcut exists that reduces the search to J - 1 possibilities (a massive reduction for large J). The shortcut essentially requires mean/target encoding the categorical in question prior to each split in each tree, which is what's described here for respect_unordered_factors = "order"
. I can't, however, pretend to understand what is meant by ignore here. This question is probably more appropriate for the ranger package issues tab: https://github.com/imbs-hl/ranger/issues.
@A-Pai I'll add that for binary classification and regression, respect_unordered_factors = "order"
is a good choice, as it corresponds to a quicker, but exact search through all possible splits involving unordered (i.e., nominal) factors. For multiclass classification (and I believe censored outcomes as well), no such shortcut exists, so setting it to ignore likely leads to a reasonable compromise between finding a reasonable split, and computational effort.
https://bradleyboehmke.github.io/HOML/random-forest.html what is the meaning of "respect.unordered.factors = "order"?