kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
115 stars 18 forks source link

Question on subset vimp #77

Closed cttnguyen closed 3 years ago

cttnguyen commented 3 years ago

I would like to obtain the observation-level VI statistics to see how much the VI varies from observation to observation. These questions arose as I attempted to extract such data from the rfsrc package, which, I really appreciate your maintenance of. Please let me know if you believe there is a better/easier way to extract this information.

I noticed that the vimp measures on subsets of the data (particularly subsets of size 1) are orders of magnitude smaller than vimp measures on the entire dataset. For example,

library(tidyverse)
library(randomForestSRC)
set.seed(2020)
df <- na.omit(airquality)
rf <- rfsrc(Ozone ~ ., df, importance = 'random', seed = 2020)
full_vi <- rf$importance
single_vi <- map_dfr(1:nrow(df), ~ vimp(rf, 'Solar.R', subset = .x, importance = 'random')$importance)

single_vi ranges from -5.0 to 13.2 while full_vi for Solar.R is 103.4. I tried to make sense of the 2007 Ishwaran paper, "Variable importance in binary regression trees and forests," but I still was not able to figure out why the observation-level VI would be so drastically different from the overall. I suppose I expected for the average observation-level VI to be close to the VI from the overall forest.

Additionally, and this may be related to #67, the vimp function can return different results from full_vi, which seems quite confusing to me.

vimp(rf, importance = 'random')$importance

From the related issue, I understand that the above line of code will return different results each time I run it, even if I set the seed. However, I am not sure why this should return different results from the forest output VI.

cttnguyen commented 3 years ago

Additionally, another question I have is how random VIMP is computed. According to your documentation,

when an OOB case is dropped down a tree, the case goes left or right randomly whenever a split is encountered for the target variable.

However, in Ishwaran's paper, "Variable importance in binary regression trees and forests," he describes the following for random splitting.

To assign a terminal value to a case $x$, drop $x$ down $T$ and follow its path until either a terminal node is reached or a node with a split depending upon $x_v$ is reached. In the latter case choose the right or left daughter of the node with equal probability. Now continue down the tree, randomly choosing right and left daughter nodes whenever a split is encountered (whether the split depends upon $x_v$ or not) until reaching a terminal nod

Do you mind clarifying how the "random" VIMP is calculated?

cttnguyen commented 3 years ago

Hi @kogalur, I was just wondering if there was any update on this? Thanks in advance.

ishwaran commented 3 years ago

Regarding "Additionally, another question I have is how random VIMP is computed. According to your documentation". Those two definitions are consistent with each other and say the same thing.

Regarding your question about subset vimp. It appears you are not using the "subset" option correctly. You set it to ".x" which is not clear. From the manual, subset should be defined as follows:

subset: Vector indicating which rows of the grow data to restrict VIMP calculations to; i.e. this option yields VIMP which is restricted to a specific subset of the data. Note that the vector should correspond to the rows of ‘object$xvar’ and not the original data passed in the grow call. All rows used if not specified.

cttnguyen commented 3 years ago

@ishwaran Oh I must have misunderstood. The package documentation implies, to me, that only splits on the target variable are perturbed whereas the manuscript implies that splits on the target variable and any child split from a target variable split are all perturbed. Do you mind clarifying which of my interpretations is correct, my package documentation or my manuscript interpretation?

For subset vimp, do you mind to clarify how I would extract the VI for the variable Solar.R for each individual observation? If it helps, my code snippet above to calculate single_vi is equivalent to the following.

single_vi <- data.frame(Solar.R = rep(0, nrow(df)))
for(i in 1:nrow(df)){
   single_vi$Solar.R[i] = vimp(rf, 'Solar.R', subset = i, importance = 'random')$importance
}
ishwaran commented 3 years ago

Sorry I didn't realize you were referencing the 2007 paper. In that paper, I simplified the definition of VIMP in order to study it theoretically, but that is not the version of randomized VIMP used in the code. The way we implement this in the code is the first definition which is the description given in the package documentation:

"When an OOB case is dropped down a tree, the case goes left or right randomly whenever a split is encountered for the target variable"

Regarding your question about VIMP and subsetting. I think I understand what you want now. It seems that you want the VIMP value for each individual case. However the subset option will not provide you with that value. Currently we do not output that information. The only thing I could suggest that gets you closer to your goal, is to use the predict.rfsrc command with the get.tree option. This will allow you to extract VIMP for each tree (but not each case). See the help file for more details.

cttnguyen commented 3 years ago

Thank you!!

I was able to implement your suggestion below in combination with the membership option from rfsrc to extract trees for which an observation is OOB, but your previous message implies that this should have only gotten me "closer" to my goal. Is single_VInot the observation-level OOB VI?

library(tidyverse)
library(randomForestSRC)
set.seed(2020)
df <- na.omit(airquality)
rf <- rfsrc(Ozone ~ ., df, importance = 'random', seed = 2020, membership = T)
single_VI <- map_dfr(1:nrow(df),
                                   ~ predict(object = rf, 
                                                   newdata = slice(df, .x), 
                                                   get.tree = which(rf$inbag[.x, ] == 0), 
                                                   importance = 'random')$importance
                      )
ishwaran commented 3 years ago

We've taken a deep look at the code and realize there's an issue with the way subset VIMP is being calculated in the vimp() function. We are working to fix this and I will keep you posted on this (hopefully soon). Thanks again for your comments.

ishwaran commented 3 years ago

Please see the latest version 2.11 of randomForestSRC on CRAN which now includes case-specific VIMP (what you are calling observation-level VI statistics).