Closed cttnguyen closed 3 years ago
Additionally, another question I have is how random VIMP is computed. According to your documentation,
when an OOB case is dropped down a tree, the case goes left or right randomly whenever a split is encountered for the target variable.
However, in Ishwaran's paper, "Variable importance in binary regression trees and forests," he describes the following for random splitting.
To assign a terminal value to a case $x$, drop $x$ down $T$ and follow its path until either a terminal node is reached or a node with a split depending upon $x_v$ is reached. In the latter case choose the right or left daughter of the node with equal probability. Now continue down the tree, randomly choosing right and left daughter nodes whenever a split is encountered (whether the split depends upon $x_v$ or not) until reaching a terminal nod
Do you mind clarifying how the "random" VIMP is calculated?
Hi @kogalur, I was just wondering if there was any update on this? Thanks in advance.
Regarding "Additionally, another question I have is how random VIMP is computed. According to your documentation". Those two definitions are consistent with each other and say the same thing.
Regarding your question about subset vimp. It appears you are not using the "subset" option correctly. You set it to ".x" which is not clear. From the manual, subset should be defined as follows:
subset: Vector indicating which rows of the grow data to restrict VIMP calculations to; i.e. this option yields VIMP which is restricted to a specific subset of the data. Note that the vector should correspond to the rows of ‘object$xvar’ and not the original data passed in the grow call. All rows used if not specified.
@ishwaran Oh I must have misunderstood. The package documentation implies, to me, that only splits on the target variable are perturbed whereas the manuscript implies that splits on the target variable and any child split from a target variable split are all perturbed. Do you mind clarifying which of my interpretations is correct, my package documentation or my manuscript interpretation?
For subset vimp, do you mind to clarify how I would extract the VI for the variable Solar.R
for each individual observation? If it helps, my code snippet above to calculate single_vi
is equivalent to the following.
single_vi <- data.frame(Solar.R = rep(0, nrow(df)))
for(i in 1:nrow(df)){
single_vi$Solar.R[i] = vimp(rf, 'Solar.R', subset = i, importance = 'random')$importance
}
Sorry I didn't realize you were referencing the 2007 paper. In that paper, I simplified the definition of VIMP in order to study it theoretically, but that is not the version of randomized VIMP used in the code. The way we implement this in the code is the first definition which is the description given in the package documentation:
"When an OOB case is dropped down a tree, the case goes left or right randomly whenever a split is encountered for the target variable"
Regarding your question about VIMP and subsetting. I think I understand what you want now. It seems that you want the VIMP value for each individual case. However the subset option will not provide you with that value. Currently we do not output that information. The only thing I could suggest that gets you closer to your goal, is to use the predict.rfsrc command with the get.tree option. This will allow you to extract VIMP for each tree (but not each case). See the help file for more details.
Thank you!!
I was able to implement your suggestion below in combination with the membership
option from rfsrc
to extract trees for which an observation is OOB, but your previous message implies that this should have only gotten me "closer" to my goal. Is single_VI
not the observation-level OOB VI?
library(tidyverse)
library(randomForestSRC)
set.seed(2020)
df <- na.omit(airquality)
rf <- rfsrc(Ozone ~ ., df, importance = 'random', seed = 2020, membership = T)
single_VI <- map_dfr(1:nrow(df),
~ predict(object = rf,
newdata = slice(df, .x),
get.tree = which(rf$inbag[.x, ] == 0),
importance = 'random')$importance
)
We've taken a deep look at the code and realize there's an issue with the way subset VIMP is being calculated in the vimp() function. We are working to fix this and I will keep you posted on this (hopefully soon). Thanks again for your comments.
Please see the latest version 2.11 of randomForestSRC on CRAN which now includes case-specific VIMP (what you are calling observation-level VI statistics).
I would like to obtain the observation-level VI statistics to see how much the VI varies from observation to observation. These questions arose as I attempted to extract such data from the
rfsrc
package, which, I really appreciate your maintenance of. Please let me know if you believe there is a better/easier way to extract this information.I noticed that the vimp measures on subsets of the data (particularly subsets of size 1) are orders of magnitude smaller than vimp measures on the entire dataset. For example,
single_vi
ranges from -5.0 to 13.2 whilefull_vi
forSolar.R
is 103.4. I tried to make sense of the 2007 Ishwaran paper, "Variable importance in binary regression trees and forests," but I still was not able to figure out why the observation-level VI would be so drastically different from the overall. I suppose I expected for the average observation-level VI to be close to the VI from the overall forest.Additionally, and this may be related to #67, the
vimp
function can return different results fromfull_vi
, which seems quite confusing to me.From the related issue, I understand that the above line of code will return different results each time I run it, even if I set the seed. However, I am not sure why this should return different results from the forest output VI.