Closed SteveBronder closed 7 years ago
That is an issue. I'd like to pass arguments to the summary function or preProcess
etc. The issue is the parsing the ...
is difficult and likely to lead to a lot of bugs. A change like the one that you mentioned would break a lot of backwards compatibility.
The solution, which I don't think is very onerous, is to create your own summary function and pass it in through trainControl
.
I've been trying to create my own summary function, but the criteria for model evaluation uses an outside variable, like the second example below.
library(caret)
####
# EX: 1
# This works
####
scale_summary <- function(data = NULL, lev = NULL,model = NULL ){
pred <- data$pred
observ <- data$obs
q_score <- (pred - observ) / abs(observ - mean(observ))
MASE <- mean(q_score)
names(MASE) <- c("MASE")
MASE
}
fit1Control <- trainControl(method = "cv",
number = 2,
summaryFunction = scale_summary,
allowParallel = FALSE)
xIris <- iris[,c(3:5)]
yIris <- iris[,1]
fit1 <- train( x = xIris,
y = yIris,
trControl = fit1Control,
method = "lm",
metric = "MASE")
fit1
###
# Everything works and is great
###
# Linear Regression
#
# 150 samples
# 3 predictor
#
# No pre-processing
# Resampling: Cross-Validated (2 fold)
# Summary of sample sizes: 77, 73
# Resampling results:
#
# MASE
# 0.8629896
## Arbitrary value for scoring against
alt_scoring <- iris[,2]
######
# EX: 2
# This does not work
######
alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
alt_score = NULL){
pred <- data$pred
observ <- data$obs
q_score <- (pred - observ) / abs(observ - alt_score)
MASE <- mean(q_score)
names(MASE) <- c("MASE")
MASE
}
fit2Control <- trainControl(method = "cv",
number = 2,
summaryFunction = alt_scale_summary,
allowParallel = FALSE)
###
# Passing alt_score through train does not work
# and bumps a bunch of other things on its way to failure
###
fit2 <- train( x = xIris,
y = yIris,
trControl = fit2Control,
method = "lm",
metric = "MASE",
alt_score = alt_scoring)
# Something is wrong; all the MASE metric values are missing:
# MASE
# Min. : NA
# 1st Qu.: NA
# Median : NA
# Mean :NaN
# 3rd Qu.: NA
# Max. : NA
# NA's :1
# Error in train.default(x = xIris, y = yIris, trControl = fit2Control, :
# Stopping
# In addition: Warning messages:
# 1: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# extra argument ‘alt_score’ will be disregarded
# 2: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# extra argument ‘alt_score’ will be disregarded
# 3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
# There were missing values in resampled performance measures.
One partial solution is to pass the alt_scoring
directly through to the summary function and let lexical scoping do it's thing. This actually works sort of, but only if we can subset the scoring by whatever index the data came in with. Otherwise you get the 'longer object length is not a multiple of shorter object length' error.
alt_scoring <- iris[,2]
####
# EX: 3
# This works sometimes
####
alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
alt_score = alt_scoring){
pred <- data$pred
observ <- data$obs
q_score <- (pred - observ) / abs(observ - alt_score)
MASE <- mean(q_score)
names(MASE) <- c("MASE")
MASE
}
fit2Control <- trainControl(method = "cv",
number = 10,
summaryFunction = alt_scale_summary,
allowParallel = FALSE)
fit2 <- train( x = xIris,
y = yIris,
trControl = fit2Control,
method = "lm",
metric = "MASE")
warnings()
# Warning messages:
# 1: In observ - alt_score :
# longer object length is not a multiple of shorter object length
# 2: In (pred - observ)/abs(observ - alt_score) :
# ...
# 12: In (pred - observ)/abs(observ - alt_score) :
# longer object length is not a multiple of shorter object length
I've tried some things, but am having no luck. Is it possible to tell the summary function what the current index or sampling is? Something like
alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
alt_score = alt_scoring[index])
After some snooping I used the data$rowIndex
to make the appropriate cut / sample
xIris <- iris[,c(3:5)]
yIris <- iris[,1]
alt_scoring <- iris[,2]
alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
alt_score = alt_scoring,...){
pred <- data$pred
observ <- data$obs
alt_score <- alt_score[data$rowIndex]
q_score <- (pred - observ) / abs(observ - alt_score)
MASE <- mean(q_score)
names(MASE) <- c("MASE")
MASE
}
fit2_folds <- createFolds(iris[,2],k=10,list=TRUE,returnTrain = TRUE)
fit2Control <- trainControl(index = fit2_folds,
summaryFunction = alt_scale_summary,
allowParallel = FALSE)
fit2 <- train( x = xIris,
y = yIris,
trControl = fit2Control,
method = "lm",
metric = "MASE")
Closing the issue as this does what I need.
If I may reopen this request, the above script does not work in parallel.
library(doParallel)
library(caret)
xIris <- iris[,c(2:3)]
yIris <- iris[,1]
alt_scoring <- iris[,2]
alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
alt_score = alt_scoring,...){
pred <- data$pred
observ <- data$obs
alt_score <- alt_score[data$rowIndex]
q_score <- (pred - observ) / abs(observ - alt_score)
MASE <- mean(q_score)
names(MASE) <- c("MASE")
MASE
}
fit2_folds <- createFolds(iris[,2],k=10,list=TRUE,returnTrain = TRUE)
fit2Control <- trainControl(index = fit2_folds,
summaryFunction = alt_scale_summary,
allowParallel = TRUE)
cubeTune <- expand.grid(committees = c(20,30,50,65,75,85,90,100), neighbors = c(0,1,2,3,7,9))
cl <- makeCluster(3)
registerDoParallel(cl)
fit2 <- train( x = xIris,
y = yIris,
trControl = fit2Control,
method = "cubist",
tuneGrid = cubeTune,
metric = "MASE")
## Error in { : task 1 failed - "object 'alt_scoring' not found"
stopCluster(cl)
I'm guessing this is because the worker cannot go up the tree to find alt_scoring
. Is there a way to pass alt_scoring
to the underlying worker?
Session Info:
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3 Cubist_0.0.18 caret_6.0-70 ggplot2_2.1.0 lattice_0.20-33
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 magrittr_1.5 splines_3.3.0 MASS_7.3-45 munsell_0.4.3 colorspace_1.2-6 minqa_1.2.4
[8] stringr_1.0.0 car_2.1-2 plyr_1.8.3 tools_3.3.0 nnet_7.3-12 pbkrtest_0.4-6 grid_3.3.0
[15] gtable_0.2.0 nlme_3.1-127 mgcv_1.8-12 quantreg_5.24 MatrixModels_0.4-1 lme4_1.1-12 Matrix_1.2-6
[22] nloptr_1.0.4 reshape2_1.4.1 codetools_0.2-14 stringi_1.0-1 compiler_3.3.0 scales_0.4.0 stats4_3.3.0
[29] SparseM_1.7
Thsi looks like it is dependent on the parallel processing technology:
> library(doMC)
> registerDoMC(cores=4)
> library(caret)
> xIris <- iris[,c(2:3)]
> yIris <- iris[,1]
> alt_scoring <- iris[,2]
>
> alt_scale_summary <- function(data = NULL, lev = NULL,model = NULL,
+ alt_score = alt_scoring,...){
+
+ pred <- data$pred
+ observ <- data$obs
+ alt_score <- alt_score[data$rowIndex]
+ q_score <- (pred - observ) / abs(observ - alt_score)
+ MASE <- mean(q_score)
+ names(MASE) <- c("MASE")
+ MASE
+ }
>
> fit2_folds <- createFolds(iris[,2],k=10,list=TRUE,returnTrain = TRUE)
> fit2Control <- trainControl(index = fit2_folds,
+ summaryFunction = alt_scale_summary,
+ allowParallel = TRUE)
>
> cubeTune <- expand.grid(committees = c(20,30,50,65,75,85,90,100), neighbors = c(0,1,2,3,7,9))
> fit2 <- train( x = xIris,
+ y = yIris,
+ trControl = fit2Control,
+ method = "cubist",
+ tuneGrid = cubeTune,
+ metric = "MASE")
Loading required package: Cubist
>
Can you send the results of sessionInfo()
?
I terribly apologize for not getting back to this!
This looks like it is dependent on the parallel processing technology:
You are absolutely correct. Changing
cl <- makeCluster(3)
to
cl <- makeForkCluster(3)
Allowed it to work. So sockets fail. It appears that the alt_scale_summary
is not being passed to each new R process. Using custom loss functions in parallel is probably not possible on windows without having the ability to pass the function along to each slave.
Below is my session info
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Cubist_0.0.19 caret_6.0-73 ggplot2_2.2.1 lattice_0.20-34 doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3 BigVAR_1.0.1
[9] mlr_2.11 ParamHelpers_1.10 checkmate_1.8.2 testthat_1.0.2 devtools_1.12.0
loaded via a namespace (and not attached):
[1] zoo_1.7-14 reshape2_1.4.2 splines_3.3.2 colorspace_1.3-2 stats4_3.3.2 htmltools_0.3.5 mgcv_1.8-16 survival_2.40-1
[9] ModelMetrics_1.1.0 nloptr_1.0.4 withr_1.0.2 DBI_0.6 plyr_1.8.4 stringr_1.2.0 MatrixModels_0.4-1 munsell_0.4.3
[17] gtable_0.2.0 codetools_0.2-15 memoise_1.0.0 parallelMap_1.3 SparseM_1.76 httpuv_1.3.3 quantreg_5.29 pbkrtest_0.4-6
[25] xts_0.9.874 Rcpp_0.12.9 xtable_1.8-2 scales_0.4.1 backports_1.0.5 mime_0.5 lme4_1.1-12 ggvis_0.4.3
[33] digest_0.6.12 stringi_1.1.2 BBmisc_1.11 dplyr_0.5.0 shiny_1.0.0 grid_3.3.2 tools_3.3.2 magrittr_1.5
[41] lazyeval_0.2.0 tibble_1.2 crayon_1.3.2 car_2.1-4 MASS_7.3-45 Matrix_1.2-8 data.table_1.10.4 lubridate_1.6.0
[49] rstudioapi_0.6 assertthat_0.1 minqa_1.2.4 R6_2.2.0 compiler_3.3.2 nnet_7.3-12 nlme_3.1-131
PS: @topepo I'm the guy a while back who found your answer to life the universe and everything
joke in Applied Predictive Modeling and posted it to Reddit. I ended up working on mlr for my master's thesis and did some nice things with them for forecasting you can find here. I think it would be nice to have more communication between both of the packages' developers since the packages handle a lot of similar tasks.
There is a new branch of the package ("recipe_integration") that I hope to merge in within two weeks that uses the recipes
package that can help here. An example of how the code works can be found here.
If you take a look please let me know if you see any issues by posting something on the project page.
Very cool! Random, would recipes allow post processing on the predictions?
No, it is mostly for preprocessing. I'll have a whole other package for that type of calibration (it's on my list).
I'm currently testing out new performance criterion and some of the variables I need cannot currently be accessed by my custom summary functions since
evalSummaryFunction()
does not allow...
.This change would allow train()` (line 510) to be something like
by changing
evalSummaryFunction()
in misc (line 31) toSince
train()
allows...
this would allow extra parameters to be passed to the summary function.Thank you for your time,
Steve