Open mkammer19 opened 4 years ago
Having the same problem, here is an example - I request 15 folds, and get less, and an inconstant number. Note that all groups in this example are of the same size:
set.seed(1)
x <- rep(letters[1:20], each = 20)
length(caret::groupKFold(x, k = 15))
#> [1] 12
length(caret::groupKFold(x, k = 15))
#> [1] 11
length(caret::groupKFold(x, k = 15))
#> [1] 8
length(caret::groupKFold(x, k = 15))
#> [1] 11
Created on 2020-06-17 by the reprex package (v0.3.0)
It seems that groupKFold
creates folds under some constraints, and I suspect that this why this happens:
set.seed(1)
x <- rep(letters[1:20], each = 20)
folds <- caret::groupKFold(x, k = 15)
folds_i <- lapply(folds, function(i) unique(x[i]))
folds_i <- unname(unlist(folds_i))
table(folds_i)
#> folds_i
#> a b c d e f g h i j k l m n o p q r s t
#> 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
Created on 2020-06-17 by the reprex package (v0.3.0)
In the docs, point 3 can be seen here:
For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits.
Perhaps these constrains can be controlled by the user? If for example number of folds in more important than group balance?
Thanks!
@mattansb I'm not sure reason 3 could be the cause of this odd behavior, because even with k
set to the number of groups in the data (i.e., the default value, and like what @mkammer19 did) the number of folds produced can be less than k
. Also, the class distribution is not involved in groupKFold
(y
is not passed to the function).
I think you could be right about reason 2 being the cause though. It seems possible, but I haven't tested it.
In any case, at the very least I would consider this a bug because it doesn't generate a warning to the user when fewer than k
folds are generated. I spent hours trying to figure out why my data apparently had many fewer groups in it than I thought, only to realize my data were fine and groupKFold
was simply ignoring k
.
In the meantime I'm using this code for leave one group out:
leave_one_group_out <- function(group_list) {
result <- list()
fold_i <- 0
for (left_out_group in unique(group_list)) {
fold_i <- fold_i + 1
result[[fold_i]] <- which(group_list != left_out_group, arr.ind = TRUE)
}
return(result)
}
In my - possibly falsy - comprehension, something weird appears in the call of the sample function inside groupKFold:
function (group, k = length(unique(group)))
{
g_unique <- unique(group)
m <- length(g_unique)
if (k > m) {
stop("`k` should be less than ", m)
}
g_folds <- sample(k, size = m, replace = TRUE)
out <- split(seq_along(group), g_folds[match(group, g_unique)])
names(out) <- paste0("Fold", gsub(" ", "0",
format(seq_along(out))))
lapply(out, function(z) seq_along(group)[-z])
}
Why do not use the sample function that way: (The right number of folds would be returned at each function call, as expected)
function (group, k = length(unique(group)))
{
g_unique <- unique(group)
m <- length(g_unique)
if (k > m) {
stop("`k` should be less than ", m)
}
g_folds <- sample(m, size = k, replace = FALSE)
out <- split(seq_along(group), g_folds[match(group, g_unique)])
names(out) <- paste0("Fold", gsub(" ", "0",
format(seq_along(out))))
lapply(out, function(z) seq_along(group)[-z])
}
Furthermore, in my opinion, I would turn the error message into:
stop("`k` should be less than or equal to", m)
instead of:
stop("`k` should be less than ", m)
Anthony
I encountered the same problem. I think @AnthonyTedde is right - it is due to the sample() function inside groupKFold.
Say m = 30, k = 10. The current code selects 30 elements from range 1:10, so if replace = TRUE
, some numbers in 1:10 may never be chosen or have lower frequencies. (So there may be less than k folds and each fold not of the same size)
In order to set replace = False
, I created a vector pre_group
specifying that each number in 1:10 will appear the same times on the condition that k is a divisor of m. This solved my problem.
function (group, k = length(unique(group)))
{
g_unique <- unique(group)
m <- length(g_unique)
if (k > m) {
stop("`k` should be less than ", m)
}
pre_group <- rep(1:k, each=m/k)
g_folds <- sample(pre_group, size = m, replace = FALSE)
out <- split(seq_along(group), g_folds[match(group, g_unique)])
names(out) <- paste0("Fold", gsub(" ", "0",
format(seq_along(out))))
lapply(out, function(z) seq_along(group)[-z])
}
This issue still is a problem, I think simply changing the sample to replace = FALSE solves the problem
I used other answers to code functions for group K fold COMPLETE ( it means gives all splits possible and all combinations in these):
library(caret)
leave_one_group_out <- function(group_list) {
result <- list()
fold_i <- 0
for (left_out_group in unique(group_list)) {
fold_i <- fold_i + 1
result[[fold_i]] <- which(group_list != left_out_group, arr.ind = TRUE)
}
return(result)
}
require(combinat)
cv_Kfold <- function(group_list, k) {
result <- list()
fold_i <- 0
k_fold = floor(( 1 - 1/k) * length(unique(group_list)))
matrix_comb = data.frame(combn(as.character(unique(group_list)), k_fold))
for(i in 1:ncol(matrix_comb)){
result_tmp =c()
for(j in unique(matrix_comb[,i])){
result_tmp = c(result_tmp, which(group_list == j, arr.ind = TRUE))
}
result[[i]] <- result_tmp
}
return(result)
}
Example:
group = my data frame index = cv_Kfold(group$group, k = k.fold) train.control = trainControl(method="cv", number = k.fold, index = index) model <- train(Fit ~ . , data = group, method = "lm", trControl = train.control) model
This still appears to be an issue 2 years later? If groupKFold works as expected, the documentation is not helpful. My initial take is that the function does not implement grouped k-fold cross validation in any way that I am familiar with.
For example, under any definition of grouped cross validation, how would this happen?:
> groupKFold(c(1,1,2,2,2,3,3))
$Fold1
integer(0)
IMO, grouped k-fold, should just do k-fold cross validation on the groups. Something like:
gkf <- function(g, k = length(unique(g))){
lapply(createFolds(unique(g), k=k),
function(x) unlist(lapply(x, function(i) which(g==i))))
}
g <- c(1,1,2,2,2,3,3)
gkf(g)
Am I missing something?
I am still encountering this problem in caret v 6.0-93.
I get what I expected- where the number of folds is the number I requested- using an 'old' version of this function found in #540 and copied here (function from topepo):
group_cv <- function(x, k = length(unique(x))) {
dat <- data.frame(index = seq(along = x), group = x)
groups <- data.frame(group = unique(dat$group))
group_folds <- createFolds(groups$group, returnTrain = TRUE, k = k)
group_folds <- lapply(group_folds, function(x, y) y[x,,drop = FALSE], y = groups)
dat_folds <- lapply(group_folds, function(x, y) merge(x, y), y = dat)
lapply(dat_folds, function(x) sort(x$index))
}
groups <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3)
set.seed(242)
folds <- group_cv(groups)
Perhaps the change in #1108 is where things start to act differently. Regardless, thanks topepo for the group_cv
function as it accomplishes exactly what I was looking for (number of folds is what I requested).
Still encountering this problem in 6.0-94
, which caught me a somewhat off guard.
Also, I think it is inconsistent that createFolds
returns the holdout indices, while groupKFold
returns the modeling indices.
The function groupKFold in caret 6.0-86 does not return the requested number of folds, even though it is possible according to the number of groups. caret version 6.0-84 returns the correct number of folds.
Example:
R Version 4.0.1 Caret Version 6.0-86
require(caret) set.seed(1)
Vector with 10 groups group <- as.factor(sort(rep(seq(1, 10), 10)))
folds <- caret::groupKFold(group, k = 10)
Number of folds I expected: 10 folds with 1 group missing in each Number of folds I got: 7
length(folds)
Caret Version 6.0-84
require(caret) set.seed(1)
Vector with 10 groups
group <- as.factor(sort(rep(seq(1, 10), 10)))
folds <- caret::groupKFold(group, k = 10)
Number of folds I expected: 10 folds with 1 group missing in each Number of folds I got: 10
length(folds)