Open olivierroncalez opened 6 years ago
The default scoring functions assume that the data are numeric. That's not documented so I'll fix that.
Further, it uses
scores <- apply(x, 2, sbfControl$functions$score, y = y)
so any non-numeric columns triggers a conversion of the whole data set to character.
I'll fix that so that it works and put some error traps in with the default scoring functions. The bad news is that I won't get this done for a few weeks.
Thanks @topepo. I'll just stick with numeric predictors for now.
I did notice some other strange behavior with fitting glm
models, with the full five class summary statistics showing when printing the resampling results, but not when examining the model fit object. Furthermore, the resampling is diffeent. Using the simulated data above:
set.seed(337)
my_glm_model <- sbf(train[, -ncol(train)],
train$y,
method = 'glm',
family = 'binomial',
preProcess = c('center', 'scale'),
sbfControl = sbfCtrl)
# Five class summary statistics
my_glm_model
# Different resampling and only kappa & accuracy
my_glm_model$fit
Is this normal/expected behavior?
my_glm_model
resamples the entire process (including feature selection) while the resampling stats in my_glm_model$fit
only know about the model fit variation (for the fixed, final feature set).
Thank you @topepo. Is there a way to include those additional statistics? They're present for rf
after the final feature selection, but that includes the trCtrl
element for model tuning, which I'm assuming is the reason.
While on the topic, I noticed there isn't an option to subsample for class imbalance prior to the feature selection. I thought about including that in the myFunc$fit
function, but I'm assuming that would happen after the feature selection correct? Is there an easier way to combine say SMOTE with SBF?
myFunc$fit <- function (x, y, ...)
{
# SMOTE subsampling
checkInstall("DMwR")
library(DMwR)
dat <- if (is.data.frame(x)) {
if (inherits(x, "tbl_df"))
as.data.frame(x)
else x
}
else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat)
x <- dat[, !grepl(".y", colnames(dat), fixed = TRUE),
drop = FALSE]
y <- dat$.y
if (ncol(x) > 0) {
train(x, y, ...)
}
else nullModel(y = y)
}
s there a way to include those additional statistics? They're present for rf after the final feature selection, but that includes the
trCtrl
element for model tuning, which I'm assuming is the reason.
No sure what you mean. Can you elaborate?
While on the topic, I noticed there isn't an option to subsample for class imbalance prior to the feature selection. I thought about including that in the
myFunc$fit function
, but I'm assuming that would happen after the feature selection correct? Is there an easier way to combine say SMOTE with SBF?
Yes, you could include it in the fit
code and the score
code.
It will be easier soon(ish) when I'm done integrating recipes
with the feature selection routines. I was going to start working on sbf
today or tomorrow in that branch. I've got the model fitting pieces done for rfe
, gafs
, and safs
but the predict methods are not changed yet.
Amazing, definitely looking forward to that integration.
Yes, you could include it in the
fit
code and thescore
code.
Would there be support for univariate filtering with multivariate subsampling? Turning multivariate = TRUE
should allow me to first subsample the data and then run a Relief
filter in both fit
and score
. However, subsampling the full set of predictors and running a univariate t-test filter on each predictor one at a time is not possible when multivariate = FALSE
due to the apply function in the base code (which clashes with the multivariate subsampling process). I can't think of a workaround at this time short of modifying the source code.
No sure what you mean. Can you elaborate?
Sure. The end goal for me is to obtain the standard fiveClass summary statistics for both the resampling results, and fitting the final model on the final set of predictors. If I understand correctly there is no need for a trControl
element for logistic regression as there are no parameters to be tuned. If this is ommited, the fiveClass summary will be available for the SBF resampling, but not when fitting the final model on the full training data. Including a trControl
element, while redundant, allows these stats to be present.
I've included 3 models to showcase thise. In the first model, the random forest has fiveClass summary stats for the resampled results and the final fit. The second model have fiveClass summary stats for the resampling results, but different resampling method and stats for the final fit. The third model has the same methods and resampling stats for both SBF and the final fit.
Is there a way to include the fiveClass summary stats and methods for glm
withount including the trControl
element in model 3?
As a quick follow up question, if I understand correctly SBF will fit the final set of predictors on the full set of training data at the end. Given that a different set of predictors may be selected at each cross-fold, how does caret choose the optimal set?
library(caret)
library(tidyverse)
# Simulate Data
data <- twoClassSim(n = 500)
y <- data[, ncol(data)]
# Add binary data
set.seed(337)
data <- bind_cols(data[, -ncol(data)], LPH07_1(n = 500, factors = FALSE)[1:3])
data <- bind_cols(data, as.data.frame(y))
# Train/test
idx <- createDataPartition(data$y, p = .8, list = FALSE)
train <- data[idx, ]
test <- data[-idx, ]
# Index
dummy_index <- createMultiFolds(y = train$y, times = 3)
# Five stats summary
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
# Caret functions
myFunc <- caretSBF
myFunc$summary <- twoClassSummary
myFunc$score <- function(x, y) {
numX <- length(unique(x))
if(numX > 2) {
out <- t.test(x ~ y)$p.value
} else {
out <- fisher.test(factor(x), y)$p.value
}
out
}
myFunc$filter <- function(score, x, y) {
keepers <- (score <= 0.05)
keepers
}
# SBF Control
sbfCtrl <- sbfControl(method = "repeatedcv",
repeats = 3,
verbose = TRUE,
returnResamp = 'final',
saveDetails = TRUE,
allowParallel = TRUE,
index = dummy_index,
functions = myFunc)
# Train control
rf_grid <- expand.grid(mtry = c(1:7))
trCtrl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE,
summaryFunction = fiveStats)
### Models below
# Model 1 (fiveClasssummary statistics for final model)
set.seed(337)
rf_model1 <- sbf(train[, -ncol(train)],
train$y,
trControl = trCtrl,
sbfControl = sbfCtrl,
## now arguments to `train`:
method = "rf",
tuneGrid = rf_grid,
metric = 'ROC')
rf_model1 # Five class summary
rf_model1$fit # Five class summary
# Model 2 (No fiveClass summary statistics for final model)
set.seed(337)
my_glm_model <- sbf(train[, -ncol(train)],
train$y,
method = 'glm',
family = 'binomial',
preProcess = c('center', 'scale'),
sbfControl = sbfCtrl)
my_glm_model # FiveClass stats
my_glm_model$fit # Different resampling methods (i.e., Boostrapping) & no fiveClass stats
# Model 3 (fiveClass summary statistics for final model)
set.seed(337)
my_glm_model2 <- sbf(train[, -ncol(train)],
train$y,
method = 'glm',
family = 'binomial',
preProcess = c('center', 'scale'),
sbfControl = sbfCtrl,
# New element here. Technically not needed as not tuning is required
trControl = trCtrl)
my_glm_model2 # FiveClass stats
my_glm_model2$fit # Same resampling methods (i.e., CV) & fiveClass stats
I'm currently trying to run a custom SBF using the random forest model with factor data. However, while my custom SBF works for the same data which has not been listed as a factor, it does not when the factor labels have been added. Given that I'm attempting to examine grouped vs independent factor data, this is an issue.
The error I get is: Error in { : task 1 failed - "missing value where TRUE/FALSE needed"
Changing
factors = FALSE
in the# Add binary data
will allow the code to run, but whenfactors = TRUE
, the aforementioned error is printed. Is this normal, or am I doing something wrong?While I could simply use numeric inputs, some of my factor data are not binary, and I'd still like to examine it as a grouped predictor, rather than dummy coding it.
Session Info