Categorical predictors are not handled correctly in randomForest

This issue was reported by Professor Markus Loecher. Here is the example he provided: tidyRF_Test.zip

I think the issue lies in

https://github.com/nalzok/tree.interpreter/blob/0a04a7a790aa128141fc3592ac70da20d90633d4/src/tidyRF.cpp#L298-L299

Let me begin by explaining why tree.interpreter has the observed behavior. Originally, the randomForest package provides a helper function named getTree, whose output includes a column named "split point". The Details section of its document indicates that "For numerical predictors, data with values of the variable less than or equal to the splitting point go to the left daughter node". The current implementation of tree.interpreter looks at the same underlying data as getTree when dealing with randomForest (specifically, it copies RF$forest$xbestsplit[,k] to tidy.RF$split.values[[k]]), and incorrectly interprets all predictors as numeric, so a node with "split point" of 3 will have samples whose pClass is less than or equal to 3 sorted to its left daughter node. This effectively put all samples to the left daughter, which is why node 1 has exactly the same prediction as its left daughter node 3, whereas node 4 gets a prediction of 0 as a result of getting no samples.

I will try implementing the correct behavior for categorical predictors, i.e. "For categorical predictors, the splitting point is represented by an integer, whose binary expansion gives the identities of the categories that goes to left or right." Not sure how long that will take, though, since the development of tree.interpreter has halted for a while. As a temporary workaround, you can one-hot encode the categorical regressors to convert them to numerical ones.

nalzok / tree.interpreter

Categorical predictors are not handled correctly in randomForest #3