Open bleutner opened 8 years ago
That's a use case that I did not expect. I'll add some logic to the code to account for this.
thanks. The use-case is pre-defined, spatially clustered, hold-out folds.
I was obtaining unexpected results and opened a stackoverflow question. Investigating it looked like a bug and I was going to report when I have seen it is actually an open bug.
This bug is probably in the function "make_resamples" in the file: https://github.com/topepo/caret/blob/master/pkg/caret/R/createDataPartition.R
The line 289 (see the first bold line below) checks if index
is NULL. If this is the case, the variable index
is created depending on the value of method
. For example, for method = "cv"
this is done in line 297 using the function createFolds
(see the second bold line below). But it is not ckecked if indexOut
is NULL first. The consequence is that even if indexOut
is provided, index
will be created using, e. g. the function createFolds
.
make_resamples <- function(ctrl_obj, outcome) {
n <- length(outcome)
if(is.null(ctrl_obj$index)) {
if(ctrl_obj$method == "custom")
stop("'custom' resampling is appropriate when the trControl
argument index
is used", call. = FALSE)
ctrl_obj$index <-
switch(tolower(ctrl_obj$method),
oob = NULL,
none = list(seq(along = outcome)),
apparent = list(all = seq(along = outcome)),
alt_cv =, cv = createFolds(outcome, ctrl_obj$number, returnTrain = TRUE),
repeatedcv =, adaptive_cv = createMultiFolds(outcome, ctrl_obj$number, ctrl_obj$repeats),
loocv = createFolds(outcome, n, returnTrain = TRUE),
boot =, boot632 =, optimism_boot =, boot_all =,
adaptive_boot = createResample(outcome, ctrl_obj$number),
test = createDataPartition(outcome, 1, ctrl_obj$p),
adaptive_lgocv =, lgocv = createDataPartition(outcome, ctrl_obj$number, ctrl_obj$p),
timeslice = createTimeSlices(seq(along = outcome),
initialWindow = ctrl_obj$initialWindow,
horizon = ctrl_obj$horizon,
fixedWindow = ctrl_obj$fixedWindow,
skip = ctrl_obj$skip)$train,
stop("Not a recognized resampling method.", call. = FALSE))
} else {
This bug can lead to false results. I checked this running train
with method = "glmnet"
on a random matrix and setting indexOut
. My current workaround is not to use indexOut
but use index
instead, if I want to fix the split into folds.
When specifying the holdout samples
indexOut
for cross-validation myself, theindex
values reported in the train object are wrong. They are are created as ifindexOut
was not provided at all.However, looking at the resampled predictions it looks like the train workflow itself is not affected, thank goodness (I guess it relies on ìndexOut
instead of
index`, right?):