greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Microsatellite instability (MSI) prediction #60

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

As a positive control for the mutational signatures data, we wanted to try predicting microsatellite instability status of TCGA samples from the -omics datasets that we're using for mutation prediction. MSI classes were experimentally determined in a few cancer types where it occurs frequently (COAD/READ, STAD, UCEC) from targeted sequencing panels as part of the TCGA Pan-Cancer Atlas; in other cancer types it's very rare and probably wouldn't be worth training a classifier for.

In previous work, this has been a fairly straightforward classification problem where genomic data generally performs well. In addition, the COSMIC single-base signatures that we're using have a few annotations for microsatellite instability, so we wanted to see if those show up as strong predictors.

Our results are pretty much what we expected, but MSI classification performance does show that standardizing the mutational signatures data before classification seems to help, so we're planning to make that change for the mutation prediction experiments as well.

image

jjc2718 commented 3 years ago

Looks like GitHub Actions is down right now, which is why the tests are failing - I'll rerun them when it's back up. https://www.githubstatus.com/incidents/7p1nnvkgh96y

jjc2718 commented 3 years ago

Why is predicting MSI a positive control for predicting mutations?

Sorry, maybe this was a little confusing because we've been talking about predicting mutations for so long. We're using MSI prediction as a control for the mutational signatures -omics features (one of the data types we used in addition to expression, DNA methylation, etc). Using mutational signatures performed really poorly for the mutation prediction task, so we wanted to make sure that we could use them for a different prediction task where we know they should work well.

Tumors that are microsatellite instability-high (our positive class in these experiments) typically have mutations in DNA damage repair genes, which lead to distinct hypermutation patterns in "microsatellite" DNA fragments (and elsewhere, but the microsatellites are easy to find and assay using PCR so they're a convenient clinical marker of defective DNA repair). Since these patterns affect DNA directly, mutational signatures (essentially a reduced-dimensional version of the tumor's somatic mutation profile) should be a more direct readout than things like gene expression and DNA methylation.

So that's a very long-winded of saying that MSI prediction is a different problem (i.e. different set of labels) than mutation prediction, but should be a good control for the mutational signatures input data. Hope that clarifies a bit!