[x] use_pancancer should be false for all_other_cancers in process_data_for_gene() (i.e. there shouldn't be unused dummy/one-hot variables)
[ ] Label filtering should happen after we take the intersection of samples between gene expression and mutation (this will make the proportions in 08_cell_line_prediction/download_data.ipynb match what we actually see when the scripts run)
[ ] tcga_utilities should probably be renamed to something more general, or split
[ ] CNV data for cell lines, in ccle_data_model_generate_labels()
[x] remove unknown/non-cancerous samples in load_sample_info()
[x] maybe try sklearn LogisticRegression with elastic net penalty rather than SGDClassifier
use_pancancer
should be false forall_other_cancers
inprocess_data_for_gene()
(i.e. there shouldn't be unused dummy/one-hot variables)08_cell_line_prediction/download_data.ipynb
match what we actually see when the scripts run)tcga_utilities
should probably be renamed to something more general, or splitccle_data_model
_generate_labels()
load_sample_info()
LogisticRegression
with elastic net penalty rather thanSGDClassifier