Stratify folds by sparse outcome - question

Ofran-a commented 1 year ago

Hello The outcome in my dataset is binary and quite sparse, I have 2700 outcomes in a dataset of 44,500 with 28 covariates. Is there a way to make sure that the cross validation folds are stratified by the outcome in the same way as can be done in glmnet. For some of the learners I was able to add the option stratify_cv = TRUE, like below:

Lrnr_glmnet$new(stratify_cv = TRUE, family = "binomial", alpha = 1, use_min = TRUE)

When I extract the predictions from the tmle fit object then convert them into classes, I get only ~500 positive outcomes and the rest are classified as negative.

Many thanks Ofran

rachaelvp commented 1 year ago

Yes. Stratified CV is done automatically for binary and categorical outcomes in the devel branch of sl3 (which will eventually be merged to master), so if you install that branch, then you don’t need to do anything else to achieve this. Otherwise, you can create the folds yourself using the origami R package make_folds function, and you can specify the strata you want to stratify by (i.e., your outcome) in that function. You can then pass the returned object to the folds argument of make_sl3_Task.

Ofran-a commented 1 year ago

Thank you so much :)

rachaelvp commented 1 year ago

My pleasure! This is a common Q, and now we have an answer to refer to when it comes up again. Thanks for filing the issue ☺️

tlverse / sl3

Stratify folds by sparse outcome - question #405