Closed jjc2718 closed 1 year ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
View / edit / reply to this conversation on ReviewNB
ben-heil commented on 2022-09-09T21:18:53Z ----------------------------------------------------------------
It never ceases to amaze me how strong "use more genes" is as a feature selection technique. I know that e.g. median f test is better in some cases, but even in a dataset like this where everything performs roughly the same, going from 100 genes to 1000 doubles the mean AUPR.
jjc2718 commented on 2022-09-12T16:06:38Z ----------------------------------------------------------------
Yeah! I would expect it to taper off fairly quickly because there's so much redundancy/collinearity in gene expression, but it really doesn't, at least not here. Maybe I'll look at some learning curves for one or two genes to get an idea of when adding features stops improving performance for this problem.
View / edit / reply to this conversation on ReviewNB
ben-heil commented on 2022-09-09T21:19:41Z ----------------------------------------------------------------
Can probably remove (or alternatively replace with interpretation)
jjc2718 commented on 2022-09-12T16:00:38Z ----------------------------------------------------------------
Good catch! I think I'll just remove it for now, I'm planning to make some tweaks to this in the coming weeks so any interpretation I add now is likely to change.
Good catch! I think I'll just remove it for now, I'm planning to make some tweaks to this in the coming weeks so any interpretation I add now is likely to change.
View entire conversation on ReviewNB
Yeah! I would expect it to taper off fairly quickly because there's so much redundancy/collinearity in gene expression, but it really doesn't, at least not here. Maybe I'll look at some learning curves for one or two genes to get an idea of when adding features stops improving performance for this problem.
View entire conversation on ReviewNB
PR description:
This PR implements mutation status prediction (sample has/does not have a mutation in the given gene) for cell line data from CCLE, using gene expression. In general this is harder than it was for TCGA because there are about an order of magnitude fewer cell lines than there were tumor samples in TCGA (see #51 for some analysis of cancer types and mutated proportion of cell lines in the CCLE data).
Likely because of this, mutation prediction doesn't work as well on CCLE data as it did for TCGA, both in terms of overall performance and in terms of effectiveness of our feature selection method. Here are the results for EGFR, a gene that worked well on TCGA as shown in #48:
So we can see that there's very little difference between
pancan_f_test
andmedian_f_test
, and also that most of thedelta_aupr
values are close to 0 suggesting that our models aren't doing much better than one with randomly permuted labels.Code changes:
08_cell_line_prediction/run_ccle_mutation_prediction.py
to run experimentspancancer_evaluation/data_models/ccle_data_model.py
andpancancer_evaluation/utilities/ccle_data_utilities.py
to load the CCLE data using the same preprocessing pipeline we used for the TCGA datapancancer_evaluation/utilities/data_utilities.py
to correctly load oncogene/TSG info08_cell_line_prediction/plot_mutation_prediction_results.ipynb
to visualize results