Open jjc2718 opened 4 years ago
After thinking about this, I'm not planning to stratify cross-validation folds by label. I think if there are so few positive labels that splits only have one class sometimes by chance (e.g. the TP53/OV case described above), we're not going to be able to train effective models anyway due to the extreme label imbalance.
In general, I think there are downsides to stratifying by label (see, e.g. this CrossValidated post or this one). I want to make sure our cross-validation is as representative of external datasets as it can be (some of which may have different label proportions than TCGA), and generating CV folds randomly many times seems like a better way to evaluate generalization than forcing every test dataset to have the same label proportion.
I may revisit this in the future, but closing for now.
Reopening this in light of https://github.com/greenelab/pancancer-evaluation/pull/31#discussion_r508034341 . I think stratification by label may be the best solution to the issue described there - need to think about it a bit.
If labels are highly imbalanced (for example, TP53 in ovarian cancer) ROC can break because some cross-validation splits will only have one class.
Maybe using StratifiedKFold instead of standard k-fold CV is the best solution here?