greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Mutational signatures prediction analysis #28

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

Summary of PR changes:

In addition to the other data types we're already using to predict mutation status (gene expression, 27K and 450K methylation, RPPA) we wanted to try using mutational signatures data as well. We didn't necessarily think this would work in most cases, but we thought there would be a few genes whose mutation status would correlate strongly with certain mutational signatures (e.g. genes involved in DNA damage repair).

However, we don't really see much predictive signal in practice:

Screen Shot 2021-03-02 at 11 33 10 AM

TP53 does seem to have a strong mutational signatures signal, but this is true for all data types we've tested.

We looked directly at results for some DNA damage repair genes (see slide 6 of these slides). Our suspicion is that either there are too few samples for these examples (BRCA1/2 and ATM generally have very few somatically mutated examples), or that we need to look specifically at samples with germline mutations which would be a much more complicated analysis. BRCA2 and ATM are somewhat close to being significantly predictable, but BRCA1 and FBXW7 don't seem particularly well-predicted.

Another strange result is that some Ras pathway genes (KRAS, NRAS, BRAF) seem to be somewhat well-predicted from mutational signatures data. This was unexpected because mutations in those genes aren't really understood to have any coordinated effect on somatic mutation patterns, at least to my and Casey's knowledge. We are still using covariates for cancer type and sample mutation burden here, so it's possible that in some cancer types that information is sufficient to predict presence/absence of a particular Ras pathway mutation. We'll have to investigate this in the future.

Files to focus on in review:

Most of the relevant results are in the analysis notebooks in 01_classify_stratified (e.g. plot_mutation_results.ipynb, expression_vs_methylation.ipynb) and the preprocessing notebook in 00_download_data/1E_preprocess_mutational_signatures.ipynb.

Changes to other files are mostly small, but feel free to review in as much detail as you want.

jjc2718 commented 3 years ago

These are great questions! Responses below:

  1. As a sanity check, are the expected signatures informative in predicting TP53? I know that some signatures are reflective of sample prep/sequencing like 8-oxoguanine presence, so it may be interesting to see if there are possible other confounders you may be missing.

By "expected", do you mean the signatures on this page that aren't sequencing artifacts? Looks like the 8-oxoguanine signature is SBS45 there.

I do see SBS45 as one of the selected features in one of the 8 TP53 cross-validation folds (has a very small negative coefficient) but it has a 0 coefficient in the other 7 folds. None of the other sequencing artifact signatures are selected for any of the TP53 models.

Other than that, I'm not exactly sure which signatures I'd expect to find, since I didn't really expect TP53 mutation to lead to any mutational signatures directly. Some of the meaningful signatures that come up frequently are SBS15 (defective mismatch repair, has a negative coefficient) and SBS3 (defective homologous recombination, has a small positive coefficient). I guess these could make sense (maybe TP53 mutation and some types of mutations in DNA damage repair genes could be mutually exclusive), but I doubt that TP53 is actually causing any of them.

  1. I think it's not too surprising that the mutational signatures are not very informative. The mutational signatures are mostly cancer-type specific (UV / Smoking / carcinogen / APOBEC) and not specific to a mutation. Furthermore, you are directly correcting away any cancer type specificity.

Yeah, I agree. We weren't really generally expecting them to be useful, but we did hope there were a few genes that would have strong signatures. Like you said, we expected that most of them wouldn't be useful because they're tied more directly to a cancer type than a mutation or molecular subtype, and that does seem to check out.

  1. Related to the RAS prediction, from a brief search I found this: Lung tumor KRAS and TP53 mutations in nonsmokers reflect exposure to PAH-rich coal combustion emissions from this paper: A Compendium of Mutational Signatures of Environmental Agents

This is interesting! I hadn't seen that paper.

I guess something like what's described in this paper (smoky coal exposure tending to cause KRAS mutations, and also leading to an environmental exposure mutational signature) is the most likely explanation (i.e. confounders rather than a direct causal link). With the other data types, when we see a strong signal it's often reasonable to assume the mutation is causing the signal in the data set being used for prediction, but this is a good reminder to be careful about assuming a causal relationship between mutation and predictive signal in any of these experiments.

Looking at some of the mutational signatures that are predictive in Ras genes: SBS2 (APOBEC activity) has a negative coefficient in KRAS, SBS7a (UV exposure) has a positive coefficient in NRAS, SBS20 (defective MMR) has a positive coefficient in BRAF. None of this makes obvious sense to me at first glance, but I'll have to think about it.

  1. Can you remind me again how you calculate the p-value for the y-axis in the last set of plots in https://github.com/jjc2718/mpmp/blob/mut_sigs/01_classify_stratified/plot_mutation_results.ipynb ?

Say, for instance, we're comparing expression and mutational signatures (other comparisons are the same). For each data type, we do 2x4-fold cross-validation (so 8 total train/holdout splits), then we're just doing a t-test to compare the distribution of results (AUPR values for each CV holdout set) using expression with the distribution of results for mutational signatures. Then the value on the x-axis is just the difference between the means for these CV result distributions.

Does that make sense to you? We're not totally convinced this is the exact right way of doing things, but it seems to work well enough.

nrosed commented 3 years ago

For 1) yeah, those were the signatures I was thinking of SBS15 and SBS3 For 4) ok that makes sense. I forgot how you were getting the variances to do the t-test. I think that make sense. I guess if you have paired folds between data types you could do a paired sample t-test, but I don't think it should really make a difference. Thanks for the clarification!