Add microRNA data to all data types experiments

Overview of PR:

Before we had avoided including the TCGA miRNA data in our comparisons because we thought we would lose samples if we included it (generally we've only been using samples that have data for all data types, and dropping samples with missing data). But it turns out that including miRNA only rules out about 50 samples (leaving ~5200 remaining), so we decided to add it to our comparisons.

It doesn't affect the overall story much - mutation prediction using miRNA is a little worse than RPPA data, and still considerably worse than standard RNA-seq.

Description of changes:

No huge code changes here, the primary ones are:

Added a script to preprocess miRNA data (pretty similar to the script for preprocessing RNA-seq data): 00_download_data/1E_preprocess_mirna_data.ipynb
Added a script to run experiments using overlapping samples from all data types: 02_classify_mutations/scripts/run_all.sh
Modified code in 02_classify_mutations/plot_mutation_results.ipynb to visualize results

Looks good to me. One conceptual question since its been a little while can you tell me if my understanding is correct here:

So for a given gene, you're predicting mutated or not (binary label) using either mRNA expression, methylation or miRNA expression data, etc.

Yep, exactly right

Are you using the same samples in each dataset (i.e. do you have mRNA, methylation and miRNA measurements for the same sample)? I saw you mentioned matching sample ids, so I'm not sure if this is what its referring to. So each dataset is the same size?

There are 3 main comparisons ("experiments") we want to look at in the paper we're working on:

Comparing different gene sets for mutation prediction, using only gene expression data
Comparing gene expression and DNA methylation
Comparing all data types (expression, methylation, RPPA, miRNA, mutational signatures)

For each of these experiments, we use the set of samples where all of the relevant data types were measured. So for the first experiment we use all the samples with expression and somatic mutation data (because we're predicting mutations we need to have that data), for the second experiment we use samples with expression/mutation/27K/450K methylation, and for the third experiment we use samples where all of the data types were measured.

So each of these 3 experiments use different sets of samples (of decreasing size as more data types are added), but within each experiment all of the same samples are used for each data type. For the plot I showed above, for instance, all of the models used the same set of samples, to make a fair comparison between data types.

And you are using separate models for each data type correct? Since they each have different distributions I assume that is the case but just wanted to double check.

Right, here we're using a separate model for each data type. We're also doing a small experiment with combining data types into "multi-omics" models (see #38) but our paper is mostly focusing on prediction from each data type separately, and this PR just applies to the individual data type models.

greenelab / mpmp

Add microRNA data to all data types experiments #47