custom holdout indexes: train reports wrong samples, calculations are fine

bleutner commented 8 years ago

When specifying the holdout samples indexOut for cross-validation myself, the index values reported in the train object are wrong. They are are created as if indexOut was not provided at all.

library(caret)
data <- data.frame(response = c("a","b"), x = rnorm(10), y = rnorm(10))
out  <- list(c(1:4), c(5:10))
set.seed(1)
m <- train(response~.,data = data, trControl = trainControl(method = "cv", number = 2, indexOut = out, savePredictions = "final"))
m$control$index      # --> wrong, should be 5:10 and 1:4 respectively
#> $Fold1
#> [1]  1  2  5  8 10
#> 
#> $Fold2
#> [1] 3 4 6 7 9

m$control$indexOut  # --> is correct
#> $Resample1
#> [1] 1 2 3 4
#> 
#> $Resample2
#> [1]  5  6  7  8  9 10

However, looking at the resampled predictions it looks like the train workflow itself is not affected, thank goodness (I guess it relies on ìndexOutinstead ofindex`, right?):

m$pred
#>  mtry pred obs rowIndex Resample
#> 1     2    a   a        1    Fold1
#> 2     2    b   b        2    Fold1
#> 3     2    b   a        3    Fold1
#> 4     2    b   b        4    Fold1
#> 5     2    a   a        5    Fold2
#> 6     2    b   b        6    Fold2
#> 7     2    a   a        7    Fold2
#> 8     2    b   b        8    Fold2
#> 9     2    a   a        9    Fold2
#> 10    2    b   b       10    Fold2

topepo commented 8 years ago

That's a use case that I did not expect. I'll add some logic to the code to account for this.

bleutner commented 8 years ago

thanks. The use-case is pre-defined, spatially clustered, hold-out folds.

lbcommer commented 7 years ago

I was obtaining unexpected results and opened a stackoverflow question. Investigating it looked like a bug and I was going to report when I have seen it is actually an open bug.

http://stackoverflow.com/questions/43214712/caret-and-custom-validation-set-with-indexout-produces-strange-result

maciejrosolowski commented 1 year ago

This bug is probably in the function "make_resamples" in the file: https://github.com/topepo/caret/blob/master/pkg/caret/R/createDataPartition.R

The line 289 (see the first bold line below) checks if index is NULL. If this is the case, the variable index is created depending on the value of method. For example, for method = "cv" this is done in line 297 using the function createFolds (see the second bold line below). But it is not ckecked if indexOut is NULL first. The consequence is that even if indexOut is provided, index will be created using, e. g. the function createFolds.

make_resamples <- function(ctrl_obj, outcome) { n <- length(outcome) if(is.null(ctrl_obj$index)) { if(ctrl_obj$method == "custom") stop("'custom' resampling is appropriate when the trControl argument index is used", call. = FALSE) ctrl_obj$index <- switch(tolower(ctrl_obj$method), oob = NULL, none = list(seq(along = outcome)), apparent = list(all = seq(along = outcome)), alt_cv =, cv = createFolds(outcome, ctrl_obj$number, returnTrain = TRUE), repeatedcv =, adaptive_cv = createMultiFolds(outcome, ctrl_obj$number, ctrl_obj$repeats), loocv = createFolds(outcome, n, returnTrain = TRUE), boot =, boot632 =, optimism_boot =, boot_all =, adaptive_boot = createResample(outcome, ctrl_obj$number), test = createDataPartition(outcome, 1, ctrl_obj$p), adaptive_lgocv =, lgocv = createDataPartition(outcome, ctrl_obj$number, ctrl_obj$p), timeslice = createTimeSlices(seq(along = outcome), initialWindow = ctrl_obj$initialWindow, horizon = ctrl_obj$horizon, fixedWindow = ctrl_obj$fixedWindow, skip = ctrl_obj$skip)$train, stop("Not a recognized resampling method.", call. = FALSE)) } else {

This bug can lead to false results. I checked this running train with method = "glmnet" on a random matrix and setting indexOut. My current workaround is not to use indexOut but use index instead, if I want to fix the split into folds.

topepo / caret

custom holdout indexes: train reports wrong samples, calculations are fine #348