Closed brianstock closed 7 years ago
I see the smote caret version of inbag is transposed and the trainingset has been resampled. I can work on a simple adaptor, basicly the inbag matrix must match the dimensionality of the matrices in the forest list. I'm on vacation right now, but will return home end of January. Maybe I do it on vacation, on a rainy day.
Thx for feedback :)
check that rf.caret is not structured as rf.randomForest
length(predict(rf.caret))
length(predict(rf.randomForest))
sapply(rf.caret,dim)
sapply(rf.randomForest,dim)
Ah, thanks for the quick reply and the hint. Enjoy your vacation, I'll see if I can make it work on my own...
So the immediate problem is the 'smote' resampling of the training data. To estimate feature contributions you need to pass the same data set (X) as used for training to forestFloor. Smote generates another trainingset, not sure if that is saved anywhere. Feature contribution calculations with forestFloor rely on out-of-bag sampling and the inbag matrix to match the provided training set. If you disable smote as in code below everything works fine. If you have to down sample, maybe just use the 'strata' parameter within the randomForest model or use smote first outside caret to create a fixed resampled data set. See also forestFloor(Xtest parameter for computing non-out-of-bag feature contributions.
all the best
`
library(mlbench)
library(randomForest)
library(forestFloor)
library(caret)
data(PimaIndiansDiabetes)
y = PimaIndiansDiabetes$diabetes
X = PimaIndiansDiabetes
X = X[,!names(X)=="diabetes"]
rf.randomForest = randomForest(X,y,sampsize=25,ntree=500,mtry=4,
keep.inbag = T,keep.forest = T)
## Use forestFloor on randomForest output, works great
ff = forestFloor(rf.randomForest,X,binary_reg = T,calc_np=T)
Col = fcol(ff,cols=1,outlier.lim = 2.5)
plot(ff,col=Col,plot_GOF = T)
## Now fit random forest using SMOTE from caret package
ctrl <- trainControl(method = "cv", number=5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
sampling = NULL)
rf.SMOTE <- train(x=X,y=y,
method = "rf",
tuneGrid = data.frame(mtry = 3),
metric = "ROC",
trControl = ctrl,
keep.inbag=TRUE,
keep.forest=TRUE)
rf.caret = rf.SMOTE$finalModel
sapply(rf.caret,dim)
sapply(rf.randomForest,dim)
## Use forestFloor on caret output, throws error
ff = forestFloor(rf.caret,X,binary_reg = T,calc_np=T)
plot(ff,col=fcol(ff,1))
`
Hi Soren,
Just wanted to let you know I figured this out and you can close this issue. Thanks for your time and a great package.
It's possible to use the SMOTE
function in the DMwR
package instead of using SMOTE within the caret
package. DMwR::SMOTE
returns the re-sampled training data that you then can put into randomForest
and then forestFloor
.
library(randomForest)
library(DMwR)
library(forestFloor)
# covar = names of covariate/feature column names (character vector)
# target = response column names (character)
X <- cbind(dat[,covar], factor(dat[,target]))
names(X) <- c(covar, target)
bin.formula <- formula(paste0(target," ~ ",paste0(covar,collapse=" + ")))
X.SMOTE <- SMOTE(bin.formula, data=X, k=5)
rf.SMOTE <- randomForest(x=X.SMOTE[,covar], y=X.SMOTE[,target], keep.inbag=T, ...)
ff.bin = forestFloor(rf.fit = rf.SMOTE, X = X.SMOTE[,covar])
I know this is fixed (or at least worked around) from a technical standpoint and the issue is closed, but from a more theoretical standpoint, I wonder how over/undersampling with SMOTE (or more generally) affects the interpretation of the feature contributions and the shape of the observed relationship. Is some further fix required to rescale the feature contributions based on the original odds observed in the data?
Hello,
First, thank you for an awesome package! Everything was going great until I tried to use your
forestFloor
function on a random forest output from thecaret
package. I need to usecaret
instead ofrandomForest
for my dataset because I have severely imbalanced classes, so using the SMOTE sampling strategy. I fixed a couple things, passingkeep.inbag=TRUE
andkeep.forest=TRUE
intocaret::train
, and then finding therandomForest
object hidden in thetrain
class object,$finalModel
.I still get the error:
Error in eval(substitute(expr), envir, enclos) : index out of bounds
Please see the following minimal working example, taken from your Pima Indians diabetes post (http://stats.stackexchange.com/questions/183852/can-i-see-the-contribution-way-of-an-input-variable-in-random-forest-model/184000#184000). Thank you in advance!