This is the first PR of (probably) several setting up drug response prediction on CCLE data as a use case for our feature selection methods. Here, we took the cell line classifications from Iorio et al. 2016 into resistant/sensitive for a few drugs, and we're trying to predict them using gene expression data. For now we're stratifying CV folds by cancer type (so train/test sets have equal representation of the same cancer types), which should be the "easy" case compared to holding out entire cancer types.
In general, we don't get great performance for the six drugs we're looking for, with many AUPR values near 0 or only slightly better:
These are pretty similar to the AUPR values reported in https://arxiv.org/abs/2208.14822 (see Table 5: they're using multi-omics data so their results are slightly better than ours, but not by too much). So this is a slightly harder problem than we expected, and honestly we don't see much separation between feature selection methods with most of them performing comparably to random features.
We'll have to think about whether this is the right problem, or if regression on continuous drug response values (e.g. IC50 values) would be a better way to go. Our labels are pretty imbalanced here (the vast majority of cell lines are resistant to the vast majority of drugs) so that could be one issue.
This is the first PR of (probably) several setting up drug response prediction on CCLE data as a use case for our feature selection methods. Here, we took the cell line classifications from Iorio et al. 2016 into resistant/sensitive for a few drugs, and we're trying to predict them using gene expression data. For now we're stratifying CV folds by cancer type (so train/test sets have equal representation of the same cancer types), which should be the "easy" case compared to holding out entire cancer types.
In general, we don't get great performance for the six drugs we're looking for, with many AUPR values near 0 or only slightly better:
These are pretty similar to the AUPR values reported in https://arxiv.org/abs/2208.14822 (see Table 5: they're using multi-omics data so their results are slightly better than ours, but not by too much). So this is a slightly harder problem than we expected, and honestly we don't see much separation between feature selection methods with most of them performing comparably to random features.
We'll have to think about whether this is the right problem, or if regression on continuous drug response values (e.g. IC50 values) would be a better way to go. Our labels are pretty imbalanced here (the vast majority of cell lines are resistant to the vast majority of drugs) so that could be one issue.