Open MatthieuStigler opened 2 months ago
Hi @MatthieuStigler
From the docs:
group_initial_split()
creates splits of the data based on some grouping variable, so that all data in a "group" is assigned to the same split.
while you are
trying to get each group at least once in the test sample.
Since groups as a whole get allotted to training or testing, they can't be all represented in the test set, otherwise there would be no observations left for the training set.
Stratification (as opposed to grouped resampling) aims to ensure that the proportion of each group is the same in the training and testing set as it is in the full dataset. So if you have a small group and want a training and testing set which both contain all groups, including that small group, stratification is typically what you want to use. This can be done with the strata
argument for initial_split()
, see example below.
Does this help?
library(rsample)
set.seed(123)
dat <- data.frame(group = sample(LETTERS[1:4], prob = c(0.3, 0.3, 0.3, 0.1), replace = TRUE, size=1000),
x = rnorm(1000))
# proportion of each group in the data
table(dat$group) / nrow(dat)
#>
#> A B C D
#> 0.296 0.301 0.311 0.092
dat_split <- initial_split(dat, strata = "group", prop = 0.75)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)
# preserved proportions
table(dat_train$group) / nrow(dat_train)
#>
#> A B C D
#> 0.29906542 0.29773031 0.30841121 0.09479306
table(dat_test$group) / nrow(dat_test)
#>
#> A B C D
#> 0.28685259 0.31075697 0.31872510 0.08366534
# what the prop argument does
nrow(dat_train) / nrow(dat)
#> [1] 0.749
Created on 2024-09-12 with reprex v2.1.0
Hi @hfrick
thanks a lot for the answer. Sorry, that last statement was a bit misleading (I meant that by running K times, I want to each time one group in the test sample), so I removed that part.
The main question remains: how come, having one group with frequency 0.1, setting prop=0.9
fails consistently (instead of attributing the 10% group in the test sample).
Thanks!
Ah, I see. Thanks for clarifying!
I would say this could be loosely answered with "the error happens because we are sampling, not optimizing". In your example, we have 4 groups with one group about the size of the test set. So a grouped split with prob = 0.9
only works if we assign that smallest group, D
, to the test set. But we have 4 to choose from, so it should fail in 3/4 of the attempts.
If you increase the number of attempts in your last illustration, you should be able to see it move towards 0.75.
The problem
Summary:
group_initial_split()
fails often with small-frequency groups even if adjustingprop
to reflect the small-frequency group?I'm using
group_initial_split()
with a small number (4) groups. As I have one group with low frequency (10%), my intuition was that by settingprop=0.9
, this group would be selected within the training sample. However, I get very often (around 70%) error messages such as:How come this happens even if I adjusted
prop
? This fails even if I get the exact proportion of the group (1-freq(small_group))!? Am I misunderstanding theprop
argument?Thanks!
Reproducible example
Created on 2024-09-08 with reprex v2.1.1