Add cancer types to single-cancer mutation prediction models

The code changes for this PR aren't too extensive, but the results might be a bit dense/hard to parse. Let me know if you have questions on them, happy to clarify things.

Overview of PR goals:

As a reminder, in past experiments (#21 and others) I showed that training on a pan-cancer dataset doesn't improve mutation prediction in most cases compared to training only on the cancer type in the test set.

In the experiments in this PR, we hypothesized that we could do better than a pan-cancer model by adding only a subset of cancer types that are relevant to the target cancer. Our guess/hope was that in the pan-cancer data, some of the cancer types would be unrelated or have different mutational consequences than the target cancer, which could muddle the signal gained by adding more data.

Our general idea was to take the infrastructure from previous experiments (predicting mutations in a given gene for a given cancer type), and instead of trying only single-cancer or pan-cancer models, we iteratively add other cancer types to the training set to see if performance improves or gets worse.

To choose cancer types that are "similar" to the target cancer types, we used our confusion matrices from the experiments here: https://github.com/greenelab/mpmp/pull/13. We just used the gene expression matrix (subsampled version) for now.

Results:

You can see results for a variety of genes and cancer types here: https://docs.google.com/presentation/d/1Zd5bnUrTEdZDoPVFlCH5Yt1hcguUefv05_pWCNRh_IY/edit?usp=sharing . Unfortunately, overall adding cancer types by confusion matrix similarity doesn't seem to make a drastic difference in performance in most cases. There are a few interesting cases where it does change performance, but not as many as we were expecting/hoping for.

Example here for TP53:

Most cancer types are linear as more data is added (little difference between single-cancer and pancancer models). Some have a slight upward slope (pancancer model helps), but few are "peaky" in the middle (adding a few cancer types helps but pancancer model is worse), which is what we were hoping to see.

Thanks for the feedback @ben-heil! These are good ideas.

I forget, have you tested whether the AUPR decreases when you only use a subset of your given cancer type? If the cancer types were similar enough I'd expect the AUPR to increase just by virtue of having more data.

This is sort of what we were doing in #37. Just as a reminder, here are the results we got for progressively larger holdout sets (i.e. progressively smaller training sets) within each cancer type:

You can read these a bit like "reverse learning curves" (i.e. larger x-axis => less training data), but the size of the holdout set is changing as more data is held out, so it might be good to hold that constant and see what happens with smaller training sets.

My hypotheses for the flat prediction curves:

You're running into irreducible error. Mutations only have so much impact on gene expression so the inherent noise makes it impossible to improve past the point you get to with your starting number of samples

Yeah, I think this is definitely plausible. Definitely for some of these examples, the mutation probably doesn't affect gene expression at all (i.e. it's not a driver in that cancer type), so that would explain a flat curve. We were just surprised to see so many examples where this is the case.

Linear models are insufficiently complex. Basically the same as 1, but potentially you could shift the curve up if you were able to introduce more complicated decision boundaries

By "shift the curve up", you mean better performance for all training set sizes? Or more improvement as training data increases? Or both?

I think it's possible that this would help...will have to think about it.

Maybe some flavor of gradient starvation? I wouldn't expect that to be the case on a linear model, but maybe your model has an easy shortcut to learning the training data so it misses some of the nuance that would be helpful?

This is interesting - I hadn't seen the paper, but I do think based on some experiments that I ran in the past, our models' feature selection is a bit unstable (e.g. between bootstrapped subsets of the training data, nonzero coefficients vary a lot). So it's possible and seems intuitive that better choices of features could result in models that generalize better.

I'm not sure exactly what I would try first as far as addressing this (there are tons of feature selection methods we haven't looked at), but definitely an idea for the future.

Predictions from gene expression data aren't actually possible. In that case maybe your model learns some technical artifact, but the amount of biological signal you bring to the table doesn't really matter. I don't think that's true, but it's worth thinking about

Applying Occam's Razor to 3 rather than invoking trendy DL papers, maybe the models are just learning detecting technical artifacts before biology

Yeah, these are good points. I don't have any great ideas for how to separate technical artifacts from actual biological signal, but maybe I'll have to do some reading. Interested to hear if you have any ideas!

Responses to @ajlee21 comments:

To clarify, you selected additional cancer types that your confusion matrix suggested have similar gene expression signatures as your training/target cancer types. And so you're hoping that the added data would boost your signal? I guess if you're just adding more of the same signal I'm not sure how much you would improve in theory (I think this is similar to Ben's (1)). I'm curious if you would gain additional rare/complex signals using a more complex model as Ben already mentioned too (2)

Yeah, what you said sounds exactly right. Particularly for small sample size cancers, we were hoping that adding more data (where the data is from a cancer type similar to the target cancer type) would improve the predictions very clearly. I do think it would be good to try to subset some of these cancers and see if the model seems to be saturated (kind of like some of the experiments Ben has done in his work), which may give us a better idea of an upper bound on how fruitful adding more data is likely to be.

Interesting behavior by some gene/cancer types in your slides. I'm wondering about the peak in BRAF_COAD (slide 3). And the case where you're seeing a steady increase for NF1 (slide 11). What do you think is contributing to the increase in performance in these cases? So is your next step to determine if these cherry-picked example trends are robust?

Yeah, I think these are interesting cases. BRAF mutations in colon cancer seem to be a bit of an odd case (they're different in some way than in other cancers), and uterine cancer (added to form the peak) may just be more similar than other cancers (in other words, this could be a real interesting signal). Not so sure about NF1.

I think we may eventually look into whether these trends are robust, but we were hoping to see a stronger signal across the board (i.e. this was a bit of a pilot study before trying more genes and more elaborate methods), so I'm not sure how much more time I'll spend pursuing this given the lukewarm outcome. Even if this did work well, it's unclear how useful it would be, as compared to (for example) the methylation project where the signal/takeaway seems to be a bit clearer.

greenelab / pancancer-evaluation

Add cancer types to single-cancer mutation prediction models #42