topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

caret 6.0-86: Function groupKFold does not return the requested number of folds #1150

Open mkammer19 opened 4 years ago

mkammer19 commented 4 years ago

The function groupKFold in caret 6.0-86 does not return the requested number of folds, even though it is possible according to the number of groups. caret version 6.0-84 returns the correct number of folds.

Example:

R Version 4.0.1 Caret Version 6.0-86

require(caret) set.seed(1)

Vector with 10 groups group <- as.factor(sort(rep(seq(1, 10), 10)))

folds <- caret::groupKFold(group, k = 10)

Number of folds I expected: 10 folds with 1 group missing in each Number of folds I got: 7

length(folds)

Caret Version 6.0-84

require(caret) set.seed(1)

Vector with 10 groups

group <- as.factor(sort(rep(seq(1, 10), 10)))

folds <- caret::groupKFold(group, k = 10)

Number of folds I expected: 10 folds with 1 group missing in each Number of folds I got: 10

length(folds)

mattansb commented 4 years ago

Having the same problem, here is an example - I request 15 folds, and get less, and an inconstant number. Note that all groups in this example are of the same size:

set.seed(1)
x <- rep(letters[1:20], each = 20)
length(caret::groupKFold(x, k = 15))
#> [1] 12
length(caret::groupKFold(x, k = 15))
#> [1] 11
length(caret::groupKFold(x, k = 15))
#> [1] 8
length(caret::groupKFold(x, k = 15))
#> [1] 11

Created on 2020-06-17 by the reprex package (v0.3.0)

mattansb commented 4 years ago

It seems that groupKFold creates folds under some constraints, and I suspect that this why this happens:

  1. Number of folds => k
  2. Folds should be of roughly equal size (not sure about this one?)
  3. Each group should appear the same number of times across all folds:
set.seed(1)
x <- rep(letters[1:20], each = 20)
folds <- caret::groupKFold(x, k = 15)

folds_i <- lapply(folds, function(i) unique(x[i]))
folds_i <- unname(unlist(folds_i))
table(folds_i)
#> folds_i
#>  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t 
#> 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11

Created on 2020-06-17 by the reprex package (v0.3.0)

In the docs, point 3 can be seen here:

For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits.

Perhaps these constrains can be controlled by the user? If for example number of folds in more important than group balance?

Thanks!

pnb commented 4 years ago

@mattansb I'm not sure reason 3 could be the cause of this odd behavior, because even with k set to the number of groups in the data (i.e., the default value, and like what @mkammer19 did) the number of folds produced can be less than k. Also, the class distribution is not involved in groupKFold (y is not passed to the function).

I think you could be right about reason 2 being the cause though. It seems possible, but I haven't tested it.

In any case, at the very least I would consider this a bug because it doesn't generate a warning to the user when fewer than k folds are generated. I spent hours trying to figure out why my data apparently had many fewer groups in it than I thought, only to realize my data were fine and groupKFold was simply ignoring k.

In the meantime I'm using this code for leave one group out:

leave_one_group_out <- function(group_list) {
  result <- list()
  fold_i <- 0
  for (left_out_group in unique(group_list)) {
    fold_i <- fold_i + 1
    result[[fold_i]] <- which(group_list != left_out_group, arr.ind = TRUE)
  }
  return(result)
}
AnthonyTedde commented 4 years ago

In my - possibly falsy - comprehension, something weird appears in the call of the sample function inside groupKFold:

function (group, k = length(unique(group))) 
{
    g_unique <- unique(group)
    m <- length(g_unique)
    if (k > m) {
        stop("`k` should be less than ", m)
    }
    g_folds <- sample(k, size = m, replace = TRUE)
    out <- split(seq_along(group), g_folds[match(group, g_unique)])
    names(out) <- paste0("Fold", gsub(" ", "0", 
        format(seq_along(out))))
    lapply(out, function(z) seq_along(group)[-z])
}

Why do not use the sample function that way: (The right number of folds would be returned at each function call, as expected)

function (group, k = length(unique(group))) 
{
  g_unique <- unique(group)
  m <- length(g_unique)
  if (k > m) {
    stop("`k` should be less than ", m)
  }
  g_folds <- sample(m, size = k, replace = FALSE)
  out <- split(seq_along(group), g_folds[match(group, g_unique)])
  names(out) <- paste0("Fold", gsub(" ", "0", 
                                    format(seq_along(out))))
  lapply(out, function(z) seq_along(group)[-z])
}

Furthermore, in my opinion, I would turn the error message into:

stop("`k` should be less than or equal to", m)

instead of:

stop("`k` should be less than ", m)

Anthony

YuhaoGit commented 4 years ago

I encountered the same problem. I think @AnthonyTedde is right - it is due to the sample() function inside groupKFold.

Say m = 30, k = 10. The current code selects 30 elements from range 1:10, so if replace = TRUE, some numbers in 1:10 may never be chosen or have lower frequencies. (So there may be less than k folds and each fold not of the same size)

In order to set replace = False, I created a vector pre_group specifying that each number in 1:10 will appear the same times on the condition that k is a divisor of m. This solved my problem.

function (group, k = length(unique(group))) 
{
g_unique <- unique(group) 
m <- length(g_unique)
if (k > m) {
  stop("`k` should be less than ", m)
}

pre_group <- rep(1:k, each=m/k)
g_folds <- sample(pre_group, size = m, replace = FALSE)

out <- split(seq_along(group), g_folds[match(group, g_unique)])
names(out) <- paste0("Fold", gsub(" ", "0", 
                                  format(seq_along(out))))
lapply(out, function(z) seq_along(group)[-z])
}
fjones2222 commented 2 years ago

This issue still is a problem, I think simply changing the sample to replace = FALSE solves the problem

ThomasF59 commented 2 years ago

I used other answers to code functions for group K fold COMPLETE ( it means gives all splits possible and all combinations in these):


library(caret)
leave_one_group_out <- function(group_list) {
  result <- list()
  fold_i <- 0
  for (left_out_group in unique(group_list)) {
    fold_i <- fold_i + 1
    result[[fold_i]] <- which(group_list != left_out_group, arr.ind = TRUE)
  }
  return(result)
}
require(combinat)
cv_Kfold <- function(group_list, k) {
  result <- list()
  fold_i <- 0
  k_fold = floor(( 1 - 1/k) * length(unique(group_list)))
  matrix_comb = data.frame(combn(as.character(unique(group_list)), k_fold))
  for(i in 1:ncol(matrix_comb)){
    result_tmp =c()
    for(j in unique(matrix_comb[,i])){
      result_tmp = c(result_tmp, which(group_list == j, arr.ind = TRUE))
    }
    result[[i]] <- result_tmp
  }
  return(result)
}

Example: group = my data frame index = cv_Kfold(group$group, k = k.fold) train.control = trainControl(method="cv", number = k.fold, index = index) model <- train(Fit ~ . , data = group, method = "lm", trControl = train.control) model

ifellows commented 1 year ago

This still appears to be an issue 2 years later? If groupKFold works as expected, the documentation is not helpful. My initial take is that the function does not implement grouped k-fold cross validation in any way that I am familiar with.

For example, under any definition of grouped cross validation, how would this happen?:

> groupKFold(c(1,1,2,2,2,3,3))
$Fold1
integer(0)

IMO, grouped k-fold, should just do k-fold cross validation on the groups. Something like:

gkf <- function(g, k = length(unique(g))){
  lapply(createFolds(unique(g), k=k), 
         function(x) unlist(lapply(x, function(i) which(g==i))))
}

g <- c(1,1,2,2,2,3,3)
gkf(g)

Am I missing something?

mike-levine commented 1 year ago

I am still encountering this problem in caret v 6.0-93.

I get what I expected- where the number of folds is the number I requested- using an 'old' version of this function found in #540 and copied here (function from topepo):

group_cv <- function(x, k = length(unique(x))) {
  dat <- data.frame(index = seq(along = x), group = x)
  groups <- data.frame(group = unique(dat$group))
  group_folds <- createFolds(groups$group, returnTrain = TRUE, k = k)
  group_folds <- lapply(group_folds, function(x, y) y[x,,drop = FALSE], y = groups)
  dat_folds <- lapply(group_folds, function(x, y) merge(x, y), y = dat)
  lapply(dat_folds, function(x) sort(x$index))
}

groups <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3)

set.seed(242)
folds <- group_cv(groups)

Perhaps the change in #1108 is where things start to act differently. Regardless, thanks topepo for the group_cv function as it accomplishes exactly what I was looking for (number of folds is what I requested).

Tianmaru commented 1 month ago

Still encountering this problem in 6.0-94, which caught me a somewhat off guard.

Also, I think it is inconsistent that createFolds returns the holdout indices, while groupKFold returns the modeling indices.