Closed FredrikKarlssonSpeech closed 6 years ago
It occurred to me that what I am saying actually is that it would be nice if cross validation functions would honor grouping factors.
So that
mydata %>% group_by(ParticipantGroup) %>% crossv_kfold(k=5) -> fold
would create folds that retains the balance in data available that is present in the original data set. So, if 60% of the data is in group A and 40% in group B in the full data set, the folds would also have this composition.
I've been thinking about this too for bootstrapping.
What if you try:
mydata %>% group_by(ParticipantGroup) %>% tidyr:nest() %>% crossv_kfold(k = 5) -> fold
?
You could then unnest(data)
where appropriate in your process to "unpack" each participant group's data where appropriate.
This is out of scope for modelr; try more comprehensive resampling tools like https://topepo.github.io/rsample/
Hi,
It seems that it would be advantageous to have the option to force resample (and the cross validation functions) to produced balanced training and test data sets.
If I have imbalance in the number of participants in two groups, I would like to do something like this (ParticipantGroup=c("A","B"))
mydata %>% crossv_kfold(k=5, ParticipantGroup) -> fold
So that fold 1,2,3,4,5 will always have the same relative number of participants in the training and evaluation subsets of the data in each fold.
mydata %>% dplyr::filter(ParticipantGroup == "A") %>% crossv_kfold(k=5) -> A_fold mydata %>% dplyr::filter(ParticipantGroup == "B") %>% crossv_kfold(k=5) -> B_fold
fold <- A_fold %>% rbind(B_fold)
does not work, as we are working with indices.