greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Cross-validation for imbalanced label case #4

Open jjc2718 opened 3 years ago

jjc2718 commented 3 years ago

If labels are highly imbalanced (for example, TP53 in ovarian cancer) ROC can break because some cross-validation splits will only have one class.

Maybe using StratifiedKFold instead of standard k-fold CV is the best solution here?

jjc2718 commented 3 years ago

After thinking about this, I'm not planning to stratify cross-validation folds by label. I think if there are so few positive labels that splits only have one class sometimes by chance (e.g. the TP53/OV case described above), we're not going to be able to train effective models anyway due to the extreme label imbalance.

In general, I think there are downsides to stratifying by label (see, e.g. this CrossValidated post or this one). I want to make sure our cross-validation is as representative of external datasets as it can be (some of which may have different label proportions than TCGA), and generating CV folds randomly many times seems like a better way to evaluate generalization than forcing every test dataset to have the same label proportion.

I may revisit this in the future, but closing for now.

jjc2718 commented 3 years ago

Reopening this in light of https://github.com/greenelab/pancancer-evaluation/pull/31#discussion_r508034341 . I think stratification by label may be the best solution to the issue described there - need to think about it a bit.