topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

Feature Request: place ... in evalSummaryFunction() for train() #447

Closed SteveBronder closed 7 years ago

SteveBronder commented 8 years ago

I'm currently testing out new performance criterion and some of the variables I need cannot currently be accessed by my custom summary functions since evalSummaryFunction() does not allow ....

This change would allow train()` (line 510) to be something like

 performance <- evalSummaryFunction(y, wts = weights, ctrl = trControl, 
                                       lev = classLevels, metric = metric, 
                                       method = method, extraStuff = TheStuff)

by changing evalSummaryFunction() in misc (line 31) to

evalSummaryFunction <- function(y, wts, ctrl, lev, metric, method,...) {
...
}

Since train() allows ... this would allow extra parameters to be passed to the summary function.

Thank you for your time,

Steve

topepo commented 8 years ago

That is an issue. I'd like to pass arguments to the summary function or preProcess etc. The issue is the parsing the ... is difficult and likely to lead to a lot of bugs. A change like the one that you mentioned would break a lot of backwards compatibility.

The solution, which I don't think is very onerous, is to create your own summary function and pass it in through trainControl.

SteveBronder commented 8 years ago

I've been trying to create my own summary function, but the criteria for model evaluation uses an outside variable, like the second example below.

library(caret)

####
# EX: 1
# This works
####
scale_summary <- function(data = NULL, lev = NULL,model = NULL ){
  pred <- data$pred
  observ <- data$obs
  q_score <- (pred - observ) / abs(observ - mean(observ))
  MASE <- mean(q_score)
  names(MASE) <- c("MASE")
  MASE
}

fit1Control <- trainControl(method = "cv",
                            number = 2,
                            summaryFunction = scale_summary,
                            allowParallel = FALSE)

xIris <- iris[,c(3:5)]
yIris <- iris[,1]

fit1 <- train( x = xIris,
               y = yIris,
               trControl = fit1Control,
               method = "lm",
               metric = "MASE")
fit1
###
# Everything works and is great
###
#  Linear Regression 
#
#  150 samples
#    3 predictor
#
#  No pre-processing
#  Resampling: Cross-Validated (2 fold) 
#  Summary of sample sizes: 77, 73 
#  Resampling results:
#
#  MASE     
#  0.8629896

## Arbitrary value for scoring against
alt_scoring <- iris[,2]

######
# EX: 2
# This does not work
######
alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
                          alt_score = NULL){
  pred <- data$pred
  observ <- data$obs
  q_score <- (pred - observ) / abs(observ - alt_score)
  MASE <- mean(q_score)
  names(MASE) <- c("MASE")
  MASE
}

fit2Control <- trainControl(method = "cv",
                            number = 2,
                            summaryFunction = alt_scale_summary,
                            allowParallel = FALSE)
###
# Passing alt_score through train does not work 
#  and bumps a bunch of other things on its way to failure
###
fit2 <- train( x = xIris,
               y = yIris,
               trControl = fit2Control,
               method = "lm",
               metric = "MASE",
               alt_score = alt_scoring)
# Something is wrong; all the MASE metric values are missing:
#      MASE    
# Min.   : NA  
# 1st Qu.: NA  
# Median : NA  
# Mean   :NaN  
# 3rd Qu.: NA  
# Max.   : NA  
# NA's   :1    
# Error in train.default(x = xIris, y = yIris, trControl = fit2Control,  : 
#  Stopping
# In addition: Warning messages:
# 1: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
#  extra argument ‘alt_score’ will be disregarded 
# 2: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
#  extra argument ‘alt_score’ will be disregarded 
# 3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
#   There were missing values in resampled performance measures.

One partial solution is to pass the alt_scoring directly through to the summary function and let lexical scoping do it's thing. This actually works sort of, but only if we can subset the scoring by whatever index the data came in with. Otherwise you get the 'longer object length is not a multiple of shorter object length' error.

alt_scoring <- iris[,2]

####
# EX: 3
# This works sometimes
####
alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
                          alt_score = alt_scoring){
  pred <- data$pred
  observ <- data$obs
  q_score <- (pred - observ) / abs(observ - alt_score)
  MASE <- mean(q_score)
  names(MASE) <- c("MASE")
  MASE
}

fit2Control <- trainControl(method = "cv",
                            number = 10,
                            summaryFunction = alt_scale_summary,
                            allowParallel = FALSE)

fit2 <- train( x = xIris,
               y = yIris,
               trControl = fit2Control,
               method = "lm",
               metric = "MASE")
 warnings()
# Warning messages:
# 1: In observ - alt_score :
#   longer object length is not a multiple of shorter object length
# 2: In (pred - observ)/abs(observ - alt_score) :
# ...  
# 12: In (pred - observ)/abs(observ - alt_score) :
#   longer object length is not a multiple of shorter object length

I've tried some things, but am having no luck. Is it possible to tell the summary function what the current index or sampling is? Something like

alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
                          alt_score = alt_scoring[index])
SteveBronder commented 8 years ago

After some snooping I used the data$rowIndex to make the appropriate cut / sample


xIris <- iris[,c(3:5)]
yIris <- iris[,1]
alt_scoring <- iris[,2]

alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
                          alt_score = alt_scoring,...){

  pred <- data$pred
  observ <- data$obs
  alt_score <- alt_score[data$rowIndex]
  q_score <- (pred - observ) / abs(observ - alt_score)
  MASE <-  mean(q_score)
  names(MASE) <- c("MASE")
  MASE
}

fit2_folds <- createFolds(iris[,2],k=10,list=TRUE,returnTrain = TRUE)
fit2Control <- trainControl(index = fit2_folds,
                            summaryFunction = alt_scale_summary,
                            allowParallel = FALSE)

fit2 <- train( x = xIris,
               y = yIris,
               trControl = fit2Control,
               method = "lm",
               metric = "MASE")

Closing the issue as this does what I need.

SteveBronder commented 8 years ago

If I may reopen this request, the above script does not work in parallel.

library(doParallel)
library(caret)
xIris <- iris[,c(2:3)]
yIris <- iris[,1]
alt_scoring <- iris[,2]

alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
                          alt_score = alt_scoring,...){

  pred <- data$pred
  observ <- data$obs
  alt_score <- alt_score[data$rowIndex]
  q_score <- (pred - observ) / abs(observ - alt_score)
  MASE <-  mean(q_score)
  names(MASE) <- c("MASE")
  MASE
}

fit2_folds <- createFolds(iris[,2],k=10,list=TRUE,returnTrain = TRUE)
fit2Control <- trainControl(index = fit2_folds,
                            summaryFunction = alt_scale_summary,
                            allowParallel = TRUE)

cubeTune <- expand.grid(committees = c(20,30,50,65,75,85,90,100), neighbors = c(0,1,2,3,7,9))

cl <- makeCluster(3)
registerDoParallel(cl)
fit2 <- train( x = xIris,
               y = yIris,
               trControl = fit2Control,
               method = "cubist",
               tuneGrid = cubeTune,
               metric = "MASE")

## Error in { : task 1 failed - "object 'alt_scoring' not found"
stopCluster(cl)

I'm guessing this is because the worker cannot go up the tree to find alt_scoring. Is there a way to pass alt_scoring to the underlying worker?

Session Info:

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3     Cubist_0.0.18     caret_6.0-70      ggplot2_2.1.0     lattice_0.20-33  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5        magrittr_1.5       splines_3.3.0      MASS_7.3-45        munsell_0.4.3      colorspace_1.2-6   minqa_1.2.4       
 [8] stringr_1.0.0      car_2.1-2          plyr_1.8.3         tools_3.3.0        nnet_7.3-12        pbkrtest_0.4-6     grid_3.3.0        
[15] gtable_0.2.0       nlme_3.1-127       mgcv_1.8-12        quantreg_5.24      MatrixModels_0.4-1 lme4_1.1-12        Matrix_1.2-6      
[22] nloptr_1.0.4       reshape2_1.4.1     codetools_0.2-14   stringi_1.0-1      compiler_3.3.0     scales_0.4.0       stats4_3.3.0      
[29] SparseM_1.7 
topepo commented 7 years ago

Thsi looks like it is dependent on the parallel processing technology:

> library(doMC)
> registerDoMC(cores=4)
> library(caret)
> xIris <- iris[,c(2:3)]
> yIris <- iris[,1]
> alt_scoring <- iris[,2]
> 
> alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
+                           alt_score = alt_scoring,...){
+ 
+   pred <- data$pred
+   observ <- data$obs
+   alt_score <- alt_score[data$rowIndex]
+   q_score <- (pred - observ) / abs(observ - alt_score)
+   MASE <-  mean(q_score)
+   names(MASE) <- c("MASE")
+   MASE
+ }
> 
> fit2_folds <- createFolds(iris[,2],k=10,list=TRUE,returnTrain = TRUE)
> fit2Control <- trainControl(index = fit2_folds,
+                             summaryFunction = alt_scale_summary,
+                             allowParallel = TRUE)
> 
> cubeTune <- expand.grid(committees = c(20,30,50,65,75,85,90,100), neighbors = c(0,1,2,3,7,9))
> fit2 <- train( x = xIris,
+                y = yIris,
+                trControl = fit2Control,
+                method = "cubist",
+                tuneGrid = cubeTune,
+                metric = "MASE")
Loading required package: Cubist
>

Can you send the results of sessionInfo()?

SteveBronder commented 7 years ago

I terribly apologize for not getting back to this!

This looks like it is dependent on the parallel processing technology:

You are absolutely correct. Changing

cl <- makeCluster(3)

to

cl <- makeForkCluster(3)

Allowed it to work. So sockets fail. It appears that the alt_scale_summary is not being passed to each new R process. Using custom loss functions in parallel is probably not possible on windows without having the ability to pass the function along to each slave.

Below is my session info

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Cubist_0.0.19     caret_6.0-73      ggplot2_2.2.1     lattice_0.20-34   doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3     BigVAR_1.0.1     
 [9] mlr_2.11          ParamHelpers_1.10 checkmate_1.8.2   testthat_1.0.2    devtools_1.12.0  

loaded via a namespace (and not attached):
 [1] zoo_1.7-14         reshape2_1.4.2     splines_3.3.2      colorspace_1.3-2   stats4_3.3.2       htmltools_0.3.5    mgcv_1.8-16        survival_2.40-1   
 [9] ModelMetrics_1.1.0 nloptr_1.0.4       withr_1.0.2        DBI_0.6            plyr_1.8.4         stringr_1.2.0      MatrixModels_0.4-1 munsell_0.4.3     
[17] gtable_0.2.0       codetools_0.2-15   memoise_1.0.0      parallelMap_1.3    SparseM_1.76       httpuv_1.3.3       quantreg_5.29      pbkrtest_0.4-6    
[25] xts_0.9.874        Rcpp_0.12.9        xtable_1.8-2       scales_0.4.1       backports_1.0.5    mime_0.5           lme4_1.1-12        ggvis_0.4.3       
[33] digest_0.6.12      stringi_1.1.2      BBmisc_1.11        dplyr_0.5.0        shiny_1.0.0        grid_3.3.2         tools_3.3.2        magrittr_1.5      
[41] lazyeval_0.2.0     tibble_1.2         crayon_1.3.2       car_2.1-4          MASS_7.3-45        Matrix_1.2-8       data.table_1.10.4  lubridate_1.6.0   
[49] rstudioapi_0.6     assertthat_0.1     minqa_1.2.4        R6_2.2.0           compiler_3.3.2     nnet_7.3-12        nlme_3.1-131    

PS: @topepo I'm the guy a while back who found your answer to life the universe and everything joke in Applied Predictive Modeling and posted it to Reddit. I ended up working on mlr for my master's thesis and did some nice things with them for forecasting you can find here. I think it would be nice to have more communication between both of the packages' developers since the packages handle a lot of similar tasks.

topepo commented 7 years ago

There is a new branch of the package ("recipe_integration") that I hope to merge in within two weeks that uses the recipes package that can help here. An example of how the code works can be found here.

If you take a look please let me know if you see any issues by posting something on the project page.

SteveBronder commented 7 years ago

Very cool! Random, would recipes allow post processing on the predictions?

topepo commented 7 years ago

No, it is mostly for preprocessing. I'll have a whole other package for that type of calibration (it's on my list).