As the next step for the feature selection work, we want to see if our conclusions generalize to a different dataset. The script 08_cell_line_prediction/download_data.ipynb downloads cell line expression and mutation data from CCLE, as well as information about each cell line such as its cancer type of origin, tissue of origin, etc.
In the download script we're also visualizing the number/proportion of cell lines for each cancer type that have a mutation in the given gene - this will help us set thresholds for including cancer types that make sense for CCLE, since there are far fewer cell lines in CCLE than tumor samples in TCGA. 5 mutated samples and 10% of samples mutated seems to make sense for most genes we've been looking at, so we'll probably go with that.
As the next step for the feature selection work, we want to see if our conclusions generalize to a different dataset. The script
08_cell_line_prediction/download_data.ipynb
downloads cell line expression and mutation data from CCLE, as well as information about each cell line such as its cancer type of origin, tissue of origin, etc.In the download script we're also visualizing the number/proportion of cell lines for each cancer type that have a mutation in the given gene - this will help us set thresholds for including cancer types that make sense for CCLE, since there are far fewer cell lines in CCLE than tumor samples in TCGA. 5 mutated samples and 10% of samples mutated seems to make sense for most genes we've been looking at, so we'll probably go with that.