mlr-org / mlr3

mlr3: Machine Learning in R - next generation
https://mlr3.mlr-org.com
GNU Lesser General Public License v3.0
927 stars 86 forks source link

Resampling : Grouping and stratification on mlr3 #925

Closed NicoD1995 closed 1 year ago

NicoD1995 commented 1 year ago

Hi everybody, I'm quite new at using mlr3 for my datas and I use the great mlr3 book to help me learning. Here is my problem, Do someone have a solution for grouping and stratify at the same time some datas. It's very common in medical datas, bc we have many patients, with repeated measures, and imbalanced datas (with a factor of 3-4 in my field...). Here is a little example of my data set (it's just an example, I have in fact 830 features and 1100 rows).

structure(list(PatientID = c("P1", "P1", "P1", "P1", "P1", "P1", "P2", "P2", "P3", "P4", "P5", "P5", "P5", "P5", "P5", "P6", "P6", "P6"), LesionResponse = structure(c(2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L),.Label = c("0", "1"), class = "factor"), F1 = c(1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 1.25, 0.625, 0.625, 1.25, 1.25, 1.25), F2 = c(1, 5, 3, 2, 1, 1, 6, 9, 0, 5, 0, 4, 4, 4, 5, 2, 1, 1), F3 = c(0, 4, 3, 1, 1, 0, 3, 8, 4, 5, 0, 4, 4, 3, 5, 2, 0, 0), F4 = c(0, 9, 0, 7, 4, 0, 3, 8, 4, 5, 9, 1, 1, 3, 5, 3, 9, 0)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L), class = "data.frame")

I found a solution with a custom resampling -> function to separate my data in three parts with 60/20/20 proportions for my training, validation and testing sets, then a loop to check the proportion of my majority class. If not in the wanted interval, we do it again... But it's very long and I'm trying to find a way to do it more quickly.

Thanks

be-marc commented 1 year ago

There is an implementation in sklearn. To implement that we would have to change the way we group and stratify internally. I would say that is a special case that is better solved with ResamplingCustom. @NicoD1995 Maybe the code from sklearn will help you. If you have translated the code to R, feel free to post it here. Maybe we can make a gallery post out of it.

tdhock commented 6 months ago

hi, I have a similar issue, which I solved by defining my own Resampling class, based on https://github.com/mlr-org/mlr3/blob/main/R/Resampling.R (my fix was to remove the stop/error Cannot combine stratification with grouping)

YoonGeonWook commented 3 months ago

hi, I have a similar issue, which I solved by defining my own Resampling class, based on https://github.com/mlr-org/mlr3/blob/main/R/Resampling.R (my fix was to remove the stop/error Cannot combine stratification with grouping)

Hi @tdhock, do you mean that you simply removed the if(!is.null(groups)) stopf("Cannot combine stratification with grouping") part from the existing Resampling.R code?

tdhock commented 3 months ago

right! another work-around would be to use mlr3resampling::ResamplingSameOtherSizesCV which supports both stratification and groups, see "Use with auto_tuner on a task with stratification and grouping" in https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/ResamplingSameOtherSizesCV.html

YoonGeonWook commented 3 months ago

Thanks so much, I'll read and try it out!