Positive control/cancer type holdout experiments follow-up

jjc2718 commented 3 years ago

This PR implements a few of the ideas that were brought up in my braintrust presentation last week.

Specific changes in this PR:

Plot label count rather than holdout percent on x-axis of line plots (suggested by @ben-heil)
PCA/UMAP plots of train/test sets in holdout experiments, to see if well-performing cancer types are linearly separable (suggested by @ajlee21 and @nrosed; closes #39).
Added cleaned-up version of line plots with more legible cancer type labels (don't remember who suggested this, probably multiple people since the old version was impossible to read).

I ran the PCA/UMAP script a few times to compare different genes/cancer types, and put the results in these Google slides (should be globally viewable, I think).

These results give me slightly more confidence that what we're seeing in these experiments is an actual property of the data for certain genes/cancer types, and not just a bug or a limitation of our performance metrics.

jjc2718 commented 3 years ago

Responses to @ajlee21 comments:

Just to clarify, when you say "percent_flipped" = percent held out of training? I assume this is just borrowing old terminology so I just wanted to confirm.

Yeah, sorry, this terminology is left over from when I was actually flipping labels from 1 to 0. I'll update it to percent_holdout which I think is clearer.

By eye it looks like the trends are similar between the old line plots (proportion of held out) vs new line plots (raw number held out). Can you remind me the logic for this additional plot? I assume that different cancer types will have different numbers of total samples, so I feel like proportion makes more sense to look at in this case. But I'm sure it was mentioned during your BT and I have forgotten. So maybe 95% of the data in cancer A is 400 but cancer B is 100. What does looking at the raw data tell you? I feel like you should just see curves drop off as you move from left to right, but not sure what else...

I think our hypothesis was that the cancer types that don't drop off could be maintaining their performance just because they had more samples. Just to give a slightly more obvious example, if you trained on 5% of a dataset with 5000 samples that's still 250 samples (maybe enough to train a decently generalizable model), but 5% of a dataset with 100 samples is only 5 samples (probably not enough to learn anything generalizable).

Looking at the results, though, your interpretation matches mine - trends look similar to what we had before. It doesn't look like there's much of a difference in the shape of the curves for the cancer types that have lots of samples vs. the ones that have fewer samples, so based on this it seems unlikely that sample size is the main factor behind improved performance in certain cases.

jjc2718 commented 3 years ago

Responses to @ben-heil comments:

How appropriate are linear models in predicting mutations? It like training linear models on multiple cancer types wouldn't necessarily lead to better mutation prediction unless you had really stellar correction across cancer types. Do you expect to use more complex/nonlinear models now that your proof of concept works?

Yeah, it's possible that using more complex models would change things. This paper was published recently showing that non-linear models can work well for predicting Ras pathway activation - although I'm not entirely convinced that their particular method will generalize well outside of Ras (a relatively easy example with lots of training data), I think there's enough evidence there that the general idea might be worth trying with a more established non-linear model.

Is fall off in performance of many of the cancer types' predictors encouraging for the pancancer side of things? Does it indicate that more data is needed in those cancers?

I definitely think these results do indicate that more data would be beneficial in some cases, which is encouraging. I just think it's unclear whether pan-cancer data necessarily solves the problem in all cases, or whether the differences between cancer types overwhelm the benefits of having more data (especially in light of our results from the pan-cancer vs. single-cancer experiments from before). Interested in continuing to think about this going forward!

Is there a way to account for subtypes within the cancer confounding your results? For example in the TP53_LGG plot there are fairly separable groups of TP53 positive and negative cancers, but there is an even bigger divide between what appears to be two types of glioma

Good point. The two types of glioma we're seeing are almost certainly IDH1 mutants vs. non-mutants (like we saw in greenelab/mpmp#2). TCGA does have some subtype annotations for certain cancers (IDH1 status in glioma is among them), so using these as a covariate may be something worth trying in the future. But some of these subtypes are poorly defined or hotly debated, like Ariel's work has shown us in the case of ovarian cancer, so I don't have a good sense of how helpful they'll actually be or which cases they might be helpful in.

greenelab / pancancer-evaluation

Positive control/cancer type holdout experiments follow-up #40