topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

SBF issues with factor data #908

Open olivierroncalez opened 6 years ago

olivierroncalez commented 6 years ago

I'm currently trying to run a custom SBF using the random forest model with factor data. However, while my custom SBF works for the same data which has not been listed as a factor, it does not when the factor labels have been added. Given that I'm attempting to examine grouped vs independent factor data, this is an issue.

The error I get is: Error in { : task 1 failed - "missing value where TRUE/FALSE needed"

library(caret)
library(tidyverse)

# Simulate Data
data <- twoClassSim(n = 500)
y <- data[, ncol(data)]

# Add binary data
set.seed(337)
data <- bind_cols(data[, -ncol(data)], LPH07_1(n = 500, factors = FALSE)[1:3]) 
data <- bind_cols(data, as.data.frame(y))

# Train/test
idx <- createDataPartition(data$y, p = .8, list = FALSE) 
train <- data[idx, ]
test <- data[-idx, ]

# Index
dummy_index <- createMultiFolds(y = train$y, times = 3)

# Five stats summary
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))

# Caret functions
myFunc <- caretSBF
myFunc$summary <- twoClassSummary
myFunc$score <- function(x, y) {
     numX <- length(unique(x))
     if(numX > 2) {
          out <- t.test(x ~ y)$p.value
     } else {
          out <- fisher.test(factor(x), y)$p.value
     }
     out
}
myFunc$filter <- function(score, x, y) {
     keepers <- (score <= 0.05)
     keepers
}

# SBF Control
sbfCtrl <- sbfControl(method = "repeatedcv",
                      repeats = 3,
                      verbose = TRUE,
                      returnResamp = 'final',
                      saveDetails = TRUE,
                      allowParallel = TRUE,
                      index = dummy_index,
                      functions = myFunc)

# Train control 
rf_grid <- expand.grid(mtry = c(1:7))

trCtrl <-  trainControl(method = "cv",
                        number = 5,
                        classProbs = TRUE,
                        verboseIter = TRUE,
                        summaryFunction = fiveStats)

set.seed(337)
my_model <- sbf(train[, -ncol(train)],
                train$y,
                trControl = trCtrl,
                sbfControl = sbfCtrl,
                ## now arguments to `train`:
                method = "rf",
                tuneGrid = rf_grid,
                metric = 'ROC')

Changing factors = FALSE in the # Add binary data will allow the code to run, but when factors = TRUE, the aforementioned error is printed. Is this normal, or am I doing something wrong?

While I could simply use numeric inputs, some of my factor data are not binary, and I'd still like to examine it as a grouped predictor, rather than dummy coding it.

Session Info

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.3.0   stringr_1.3.1   dplyr_0.7.5     purrr_0.2.5     readr_1.1.1     tidyr_0.8.1    
 [7] tibble_1.4.2    tidyverse_1.2.1 caret_6.0-80    ggplot2_2.2.1   lattice_0.20-35

loaded via a namespace (and not attached):
 [1] httr_1.3.1          magic_1.5-8         ddalpha_1.3.3       sfsmisc_1.1-2      
 [5] jsonlite_1.5        splines_3.4.3       foreach_1.4.4       prodlim_2018.04.18 
 [9] modelr_0.1.2        assertthat_0.2.0    stats4_3.4.3        DRR_0.0.3          
[13] cellranger_1.1.0    yaml_2.1.19         robustbase_0.93-0   ipred_0.9-6        
[17] pillar_1.2.3        glue_1.2.0          randomForest_4.6-14 rvest_0.3.2        
[21] colorspace_1.3-2    recipes_0.1.2       Matrix_1.2-14       plyr_1.8.4         
[25] psych_1.8.4         timeDate_3043.102   pkgconfig_2.0.1     CVST_0.2-2         
[29] broom_0.4.4         haven_1.1.1         scales_0.5.0        gower_0.1.2        
[33] lava_1.6.1          withr_2.1.2         nnet_7.3-12         lazyeval_0.2.1     
[37] cli_1.0.0           mnormt_1.5-5        survival_2.42-3     magrittr_1.5       
[41] crayon_1.3.4        readxl_1.1.0        nlme_3.1-137        MASS_7.3-50        
[45] xml2_1.2.0          dimRed_0.1.0        foreign_0.8-70      class_7.3-14       
[49] tools_3.4.3         hms_0.4.2           kernlab_0.9-26      munsell_0.4.3      
[53] bindrcpp_0.2.2      e1071_1.6-8         compiler_3.4.3      RcppRoll_0.3.0     
[57] rlang_0.2.1         grid_3.4.3          iterators_1.0.9     rstudioapi_0.7     
[61] geometry_0.3-6      gtable_0.2.0        ModelMetrics_1.1.0  codetools_0.2-15   
[65] abind_1.4-5         reshape2_1.4.3      R6_2.2.2            lubridate_1.7.4    
[69] bindr_0.1.1         stringi_1.2.2       parallel_3.4.3      Rcpp_0.12.17       
[73] rpart_4.1-13        DEoptimR_1.0-8      tidyselect_0.2.4   
topepo commented 6 years ago

The default scoring functions assume that the data are numeric. That's not documented so I'll fix that.

Further, it uses

scores <- apply(x, 2, sbfControl$functions$score, y = y)

so any non-numeric columns triggers a conversion of the whole data set to character.

I'll fix that so that it works and put some error traps in with the default scoring functions. The bad news is that I won't get this done for a few weeks.

olivierroncalez commented 6 years ago

Thanks @topepo. I'll just stick with numeric predictors for now.

I did notice some other strange behavior with fitting glm models, with the full five class summary statistics showing when printing the resampling results, but not when examining the model fit object. Furthermore, the resampling is diffeent. Using the simulated data above:

set.seed(337)
my_glm_model <- sbf(train[, -ncol(train)],
                   train$y, 
                   method = 'glm',
                   family = 'binomial',
                   preProcess = c('center', 'scale'),
                   sbfControl = sbfCtrl)

# Five class summary statistics
my_glm_model

# Different resampling and only  kappa & accuracy
my_glm_model$fit

Is this normal/expected behavior?

topepo commented 6 years ago

my_glm_model resamples the entire process (including feature selection) while the resampling stats in my_glm_model$fit only know about the model fit variation (for the fixed, final feature set).

olivierroncalez commented 6 years ago

Thank you @topepo. Is there a way to include those additional statistics? They're present for rf after the final feature selection, but that includes the trCtrl element for model tuning, which I'm assuming is the reason.

While on the topic, I noticed there isn't an option to subsample for class imbalance prior to the feature selection. I thought about including that in the myFunc$fit function, but I'm assuming that would happen after the feature selection correct? Is there an easier way to combine say SMOTE with SBF?

myFunc$fit <- function (x, y, ...) 
{
     # SMOTE subsampling
     checkInstall("DMwR")
     library(DMwR)
     dat <- if (is.data.frame(x)) {
          if (inherits(x, "tbl_df")) 
               as.data.frame(x)
          else x
     }
     else as.data.frame(x)
     dat$.y <- y
     dat <- SMOTE(.y ~ ., data = dat)
     x <-  dat[, !grepl(".y", colnames(dat), fixed = TRUE), 
                  drop = FALSE]
     y <-  dat$.y

     if (ncol(x) > 0) {
          train(x, y, ...)
     }
     else nullModel(y = y)
}
topepo commented 6 years ago

s there a way to include those additional statistics? They're present for rf after the final feature selection, but that includes thetrCtrl element for model tuning, which I'm assuming is the reason.

No sure what you mean. Can you elaborate?

While on the topic, I noticed there isn't an option to subsample for class imbalance prior to the feature selection. I thought about including that in the myFunc$fit function, but I'm assuming that would happen after the feature selection correct? Is there an easier way to combine say SMOTE with SBF?

Yes, you could include it in the fit code and the score code.

It will be easier soon(ish) when I'm done integrating recipes with the feature selection routines. I was going to start working on sbf today or tomorrow in that branch. I've got the model fitting pieces done for rfe, gafs, and safs but the predict methods are not changed yet.

olivierroncalez commented 6 years ago

Amazing, definitely looking forward to that integration.

Yes, you could include it in the fit code and the score code.

Would there be support for univariate filtering with multivariate subsampling? Turning multivariate = TRUE should allow me to first subsample the data and then run a Relief filter in both fit and score. However, subsampling the full set of predictors and running a univariate t-test filter on each predictor one at a time is not possible when multivariate = FALSE due to the apply function in the base code (which clashes with the multivariate subsampling process). I can't think of a workaround at this time short of modifying the source code.

No sure what you mean. Can you elaborate?

Sure. The end goal for me is to obtain the standard fiveClass summary statistics for both the resampling results, and fitting the final model on the final set of predictors. If I understand correctly there is no need for a trControl element for logistic regression as there are no parameters to be tuned. If this is ommited, the fiveClass summary will be available for the SBF resampling, but not when fitting the final model on the full training data. Including a trControl element, while redundant, allows these stats to be present.

I've included 3 models to showcase thise. In the first model, the random forest has fiveClass summary stats for the resampled results and the final fit. The second model have fiveClass summary stats for the resampling results, but different resampling method and stats for the final fit. The third model has the same methods and resampling stats for both SBF and the final fit.

Is there a way to include the fiveClass summary stats and methods for glm withount including the trControl element in model 3?

As a quick follow up question, if I understand correctly SBF will fit the final set of predictors on the full set of training data at the end. Given that a different set of predictors may be selected at each cross-fold, how does caret choose the optimal set?


library(caret)
library(tidyverse)

# Simulate Data
data <- twoClassSim(n = 500)
y <- data[, ncol(data)]

# Add binary data
set.seed(337)
data <- bind_cols(data[, -ncol(data)], LPH07_1(n = 500, factors = FALSE)[1:3]) 
data <- bind_cols(data, as.data.frame(y))

# Train/test
idx <- createDataPartition(data$y, p = .8, list = FALSE) 
train <- data[idx, ]
test <- data[-idx, ]

# Index
dummy_index <- createMultiFolds(y = train$y, times = 3)

# Five stats summary
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))

# Caret functions
myFunc <- caretSBF
myFunc$summary <- twoClassSummary
myFunc$score <- function(x, y) {
     numX <- length(unique(x))
     if(numX > 2) {
          out <- t.test(x ~ y)$p.value
     } else {
          out <- fisher.test(factor(x), y)$p.value
     }
     out
}

myFunc$filter <- function(score, x, y) {
     keepers <- (score <= 0.05)
     keepers
}

# SBF Control
sbfCtrl <- sbfControl(method = "repeatedcv",
                      repeats = 3,
                      verbose = TRUE,
                      returnResamp = 'final',
                      saveDetails = TRUE,
                      allowParallel = TRUE,
                      index = dummy_index,
                      functions = myFunc)

# Train control 
rf_grid <- expand.grid(mtry = c(1:7))

trCtrl <-  trainControl(method = "cv",
                        number = 5,
                        classProbs = TRUE,
                        verboseIter = TRUE,
                        summaryFunction = fiveStats)

### Models below

# Model 1 (fiveClasssummary statistics for final model)
set.seed(337)
rf_model1 <- sbf(train[, -ncol(train)],
                train$y,
                trControl = trCtrl,
                sbfControl = sbfCtrl,
                ## now arguments to `train`:
                method = "rf",
                tuneGrid = rf_grid,
                metric = 'ROC')
rf_model1 # Five class summary
rf_model1$fit # Five class summary

# Model 2 (No fiveClass summary statistics for final model)
set.seed(337)
my_glm_model <- sbf(train[, -ncol(train)],
                   train$y, 
                   method = 'glm',
                   family = 'binomial',
                   preProcess = c('center', 'scale'),
                   sbfControl = sbfCtrl)
my_glm_model # FiveClass stats
my_glm_model$fit # Different resampling methods (i.e., Boostrapping) & no fiveClass stats

# Model 3 (fiveClass summary statistics for final model)
set.seed(337)
my_glm_model2 <- sbf(train[, -ncol(train)],
                    train$y, 
                    method = 'glm',
                    family = 'binomial',
                    preProcess = c('center', 'scale'),
                    sbfControl = sbfCtrl,
                    # New element here. Technically not needed as not tuning is required
                    trControl = trCtrl)
my_glm_model2 # FiveClass stats
my_glm_model2$fit # Same resampling methods (i.e., CV) & fiveClass stats