Using group_initial_split() with small group will fail even if adjusting the `prop` parameter?

MatthieuStigler commented 2 months ago

The problem

Summary: group_initial_split() fails often with small-frequency groups even if adjusting prop to reflect the small-frequency group?

I'm using group_initial_split() with a small number (4) groups. As I have one group with low frequency (10%), my intuition was that by setting prop=0.9, this group would be selected within the training sample. However, I get very often (around 70%) error messages such as:

> Error in group_mc_cv():

> ! Some assessment sets contained zero rows

> ℹ Consider using a non-grouped resampling method

How come this happens even if I adjusted prop? This fails even if I get the exact proportion of the group (1-freq(small_group))!? Am I misunderstanding the prop argument?

Thanks!

Reproducible example

library(rsample)
dat <- data.frame(group = sample(LETTERS[1:4], prob = c(0.3, 0.3, 0.3, 0.1), replace = TRUE, size=1000),
                  x = rnorm(1000))
table(dat$group)
#> 
#>   A   B   C   D 
#> 340 270 298  92

set.seed(123)
dat_split <- group_initial_split(dat, group, prop=0.9)
#> Error in `group_mc_cv()`:
#> ! Some assessment sets contained zero rows
#> ℹ Consider using a non-grouped resampling method

# This will fail about 80% times:
set.seed(1234)
mean(sapply(1:100, \(x) inherits(try(group_initial_split(dat, group, prop=0.9), silent = TRUE), "try-error")))
#> [1] 0.79

^{Created on 2024-09-08 with reprex v2.1.1}

hfrick commented 2 months ago

Hi @MatthieuStigler

From the docs:

group_initial_split() creates splits of the data based on some grouping variable, so that all data in a "group" is assigned to the same split.

while you are

trying to get each group at least once in the test sample.

Since groups as a whole get allotted to training or testing, they can't be all represented in the test set, otherwise there would be no observations left for the training set.

Stratification (as opposed to grouped resampling) aims to ensure that the proportion of each group is the same in the training and testing set as it is in the full dataset. So if you have a small group and want a training and testing set which both contain all groups, including that small group, stratification is typically what you want to use. This can be done with the strata argument for initial_split(), see example below.

Does this help?

library(rsample)

set.seed(123)
dat <- data.frame(group = sample(LETTERS[1:4], prob = c(0.3, 0.3, 0.3, 0.1), replace = TRUE, size=1000),
                  x = rnorm(1000))
# proportion of each group in the data
table(dat$group) / nrow(dat)
#> 
#>     A     B     C     D 
#> 0.296 0.301 0.311 0.092

dat_split <- initial_split(dat, strata = "group", prop = 0.75)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)

# preserved proportions
table(dat_train$group) / nrow(dat_train)
#> 
#>          A          B          C          D 
#> 0.29906542 0.29773031 0.30841121 0.09479306
table(dat_test$group) / nrow(dat_test)
#> 
#>          A          B          C          D 
#> 0.28685259 0.31075697 0.31872510 0.08366534

# what the prop argument does
nrow(dat_train) / nrow(dat)
#> [1] 0.749

^{Created on 2024-09-12 with reprex v2.1.0}

MatthieuStigler commented 2 months ago

Hi @hfrick

thanks a lot for the answer. Sorry, that last statement was a bit misleading (I meant that by running K times, I want to each time one group in the test sample), so I removed that part.

The main question remains: how come, having one group with frequency 0.1, setting prop=0.9 fails consistently (instead of attributing the 10% group in the test sample).

Thanks!

hfrick commented 2 months ago

Ah, I see. Thanks for clarifying!

I would say this could be loosely answered with "the error happens because we are sampling, not optimizing". In your example, we have 4 groups with one group about the size of the test set. So a grouped split with prob = 0.9 only works if we assign that smallest group, D, to the test set. But we have 4 to choose from, so it should fail in 3/4 of the attempts.

If you increase the number of attempts in your last illustration, you should be able to see it move towards 0.75.

tidymodels / rsample