tidyverse / modelr

Helper functions for modelling
https://modelr.tidyverse.org
GNU General Public License v3.0
401 stars 65 forks source link

An option to have balanced resampling across subgroups of data #59

Closed FredrikKarlssonSpeech closed 6 years ago

FredrikKarlssonSpeech commented 7 years ago

Hi,

It seems that it would be advantageous to have the option to force resample (and the cross validation functions) to produced balanced training and test data sets.

If I have imbalance in the number of participants in two groups, I would like to do something like this (ParticipantGroup=c("A","B"))

mydata %>% crossv_kfold(k=5, ParticipantGroup) -> fold

So that fold 1,2,3,4,5 will always have the same relative number of participants in the training and evaluation subsets of the data in each fold.

mydata %>% dplyr::filter(ParticipantGroup == "A") %>% crossv_kfold(k=5) -> A_fold mydata %>% dplyr::filter(ParticipantGroup == "B") %>% crossv_kfold(k=5) -> B_fold

fold <- A_fold %>% rbind(B_fold)

does not work, as we are working with indices.

FredrikKarlssonSpeech commented 7 years ago

It occurred to me that what I am saying actually is that it would be nice if cross validation functions would honor grouping factors.

So that

mydata %>% group_by(ParticipantGroup) %>% crossv_kfold(k=5) -> fold

would create folds that retains the balance in data available that is present in the original data set. So, if 60% of the data is in group A and 40% in group B in the full data set, the folds would also have this composition.

danfredman commented 7 years ago

I've been thinking about this too for bootstrapping.

What if you try:

mydata %>% group_by(ParticipantGroup) %>% tidyr:nest() %>% crossv_kfold(k = 5) -> fold ?

You could then unnest(data) where appropriate in your process to "unpack" each participant group's data where appropriate.

hadley commented 6 years ago

This is out of scope for modelr; try more comprehensive resampling tools like https://topepo.github.io/rsample/