greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Add microRNA data to all data types experiments #47

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

Overview of PR:

Before we had avoided including the TCGA miRNA data in our comparisons because we thought we would lose samples if we included it (generally we've only been using samples that have data for all data types, and dropping samples with missing data). But it turns out that including miRNA only rules out about 50 samples (leaving ~5200 remaining), so we decided to add it to our comparisons.

It doesn't affect the overall story much - mutation prediction using miRNA is a little worse than RPPA data, and still considerably worse than standard RNA-seq.

image

Description of changes:

No huge code changes here, the primary ones are:

jjc2718 commented 3 years ago

Looks good to me. One conceptual question since its been a little while can you tell me if my understanding is correct here:

So for a given gene, you're predicting mutated or not (binary label) using either mRNA expression, methylation or miRNA expression data, etc.

Yep, exactly right

Are you using the same samples in each dataset (i.e. do you have mRNA, methylation and miRNA measurements for the same sample)? I saw you mentioned matching sample ids, so I'm not sure if this is what its referring to. So each dataset is the same size?

There are 3 main comparisons ("experiments") we want to look at in the paper we're working on:

For each of these experiments, we use the set of samples where all of the relevant data types were measured. So for the first experiment we use all the samples with expression and somatic mutation data (because we're predicting mutations we need to have that data), for the second experiment we use samples with expression/mutation/27K/450K methylation, and for the third experiment we use samples where all of the data types were measured.

So each of these 3 experiments use different sets of samples (of decreasing size as more data types are added), but within each experiment all of the same samples are used for each data type. For the plot I showed above, for instance, all of the models used the same set of samples, to make a fair comparison between data types.

And you are using separate models for each data type correct? Since they each have different distributions I assume that is the case but just wanted to double check.

Right, here we're using a separate model for each data type. We're also doing a small experiment with combining data types into "multi-omics" models (see #38) but our paper is mostly focusing on prediction from each data type separately, and this PR just applies to the individual data type models.