sorhawell / forestFloor

R package to visualize mapping structures of random forests with feature contributions
http://forestFloor.dk
GNU General Public License v2.0
42 stars 6 forks source link

How to use forestFloor() on randomForest output from caret package? #22

Closed brianstock closed 7 years ago

brianstock commented 7 years ago

Hello,

First, thank you for an awesome package! Everything was going great until I tried to use your forestFloor function on a random forest output from the caret package. I need to use caret instead of randomForest for my dataset because I have severely imbalanced classes, so using the SMOTE sampling strategy. I fixed a couple things, passing keep.inbag=TRUE and keep.forest=TRUE into caret::train, and then finding the randomForest object hidden in the train class object, $finalModel.

I still get the error: Error in eval(substitute(expr), envir, enclos) : index out of bounds

Please see the following minimal working example, taken from your Pima Indians diabetes post (http://stats.stackexchange.com/questions/183852/can-i-see-the-contribution-way-of-an-input-variable-in-random-forest-model/184000#184000). Thank you in advance!

library(mlbench)
library(randomForest)
library(forestFloor)
library(caret)

data(PimaIndiansDiabetes)
y = PimaIndiansDiabetes$diabetes
X = PimaIndiansDiabetes
X = X[,!names(X)=="diabetes"]
rf.randomForest  = randomForest(X,y,sampsize=25,ntree=5000,mtry=4,
 keep.inbag = T,keep.forest = T)

## Use forestFloor on randomForest output, works great
ff = forestFloor(rf.randomForest,X,binary_reg = T,calc_np=T)
Col = fcol(ff,cols=1,outlier.lim = 2.5)
plot(ff,col=Col,plot_GOF = T)

## Now fit random forest using SMOTE from caret package
ctrl <- trainControl(method = "cv", number=5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = "smote")
rf.SMOTE <- train(x=X,y=y,
                      method = "rf",
                      tuneGrid = data.frame(mtry = 3),
                      metric = "ROC",
                      trControl = ctrl,
                      keep.inbag=TRUE,
                      keep.forest=TRUE)
rf.caret <- rf.SMOTE$finalModel

## Use forestFloor on caret output, throws error
ff = forestFloor(rf.caret,X,binary_reg = T,calc_np=T)
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

other attached packages:
[1] DMwR_0.4.1          caret_6.0-70        ggplot2_2.1.0      
[4] lattice_0.20-34     forestFloor_1.9.5   randomForest_4.6-12
[7] mlbench_2.1-1
sorhawell commented 7 years ago

I see the smote caret version of inbag is transposed and the trainingset has been resampled. I can work on a simple adaptor, basicly the inbag matrix must match the dimensionality of the matrices in the forest list. I'm on vacation right now, but will return home end of January. Maybe I do it on vacation, on a rainy day.

Thx for feedback :)

check that rf.caret is not structured as rf.randomForest

length(predict(rf.caret))
length(predict(rf.randomForest))
sapply(rf.caret,dim)
sapply(rf.randomForest,dim)
brianstock commented 7 years ago

Ah, thanks for the quick reply and the hint. Enjoy your vacation, I'll see if I can make it work on my own...

sorhawell commented 7 years ago

So the immediate problem is the 'smote' resampling of the training data. To estimate feature contributions you need to pass the same data set (X) as used for training to forestFloor. Smote generates another trainingset, not sure if that is saved anywhere. Feature contribution calculations with forestFloor rely on out-of-bag sampling and the inbag matrix to match the provided training set. If you disable smote as in code below everything works fine. If you have to down sample, maybe just use the 'strata' parameter within the randomForest model or use smote first outside caret to create a fixed resampled data set. See also forestFloor(Xtest parameter for computing non-out-of-bag feature contributions.

all the best

`

library(mlbench)
library(randomForest)
library(forestFloor)
library(caret)

data(PimaIndiansDiabetes)
y = PimaIndiansDiabetes$diabetes
X = PimaIndiansDiabetes
X = X[,!names(X)=="diabetes"]
rf.randomForest  = randomForest(X,y,sampsize=25,ntree=500,mtry=4,
                                keep.inbag = T,keep.forest = T)

## Use forestFloor on randomForest output, works great
ff = forestFloor(rf.randomForest,X,binary_reg = T,calc_np=T)
Col = fcol(ff,cols=1,outlier.lim = 2.5)
plot(ff,col=Col,plot_GOF = T)

## Now fit random forest using SMOTE from caret package
ctrl <- trainControl(method = "cv", number=5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = NULL)
rf.SMOTE <- train(x=X,y=y,
                  method = "rf",
                  tuneGrid = data.frame(mtry = 3),
                  metric = "ROC",
                  trControl = ctrl,
                  keep.inbag=TRUE,
                  keep.forest=TRUE)

rf.caret = rf.SMOTE$finalModel

sapply(rf.caret,dim)
sapply(rf.randomForest,dim)

## Use forestFloor on caret output, throws error
ff = forestFloor(rf.caret,X,binary_reg = T,calc_np=T)
plot(ff,col=fcol(ff,1))

`

brianstock commented 7 years ago

Hi Soren,

Just wanted to let you know I figured this out and you can close this issue. Thanks for your time and a great package.

It's possible to use the SMOTE function in the DMwR package instead of using SMOTE within the caret package. DMwR::SMOTE returns the re-sampled training data that you then can put into randomForest and then forestFloor.

library(randomForest)
library(DMwR)
library(forestFloor)
# covar = names of covariate/feature column names (character vector)
# target = response column names (character)
X <- cbind(dat[,covar], factor(dat[,target]))
names(X) <- c(covar, target)
bin.formula <- formula(paste0(target," ~ ",paste0(covar,collapse=" + ")))
X.SMOTE <- SMOTE(bin.formula, data=X, k=5)
rf.SMOTE <- randomForest(x=X.SMOTE[,covar], y=X.SMOTE[,target], keep.inbag=T, ...)
ff.bin = forestFloor(rf.fit = rf.SMOTE, X = X.SMOTE[,covar])
nick-s89 commented 5 years ago

I know this is fixed (or at least worked around) from a technical standpoint and the issue is closed, but from a more theoretical standpoint, I wonder how over/undersampling with SMOTE (or more generally) affects the interpretation of the feature contributions and the shape of the observed relationship. Is some further fix required to rescale the feature contributions based on the original odds observed in the data?