im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

Mutations observed in contiguous sites within a sample. #7

Closed madeleinedarbyshire closed 6 years ago

madeleinedarbyshire commented 6 years ago

Hi,

I'd like to improve the accuracy of my results by addressing the following warning:

Warning message:
In dndscv(df) : Mutations observed in contiguous sites within a sample. 
Please annotate or remove dinucleotide or complex substitutions for best results.

My data only contains SNVs. I tried looking for contiguous mutations (i.e on the same chromosome and at positions x and x+1) but failed to find any. How close do mutations need to be to be considered contiguous? Is there likely anything else that might trigger this warning?

Many Thanks,

Maddie

im3sanger commented 6 years ago

Hi Maddie,

This warning should only be issued when you have two consecutive mutations. Could you check this again in your data? You can check this with the code below.

If you get this warning without true consecutive mutations, please send me an email with an example dataset where I can reproduce this warning.

Best wishes, Inigo

mutations = mutations[order(mutations$sampleID,mutations$chr,mutations$pos),]
ind = which(diff(mutations$pos)==1)
mutations[unique(sort(c(ind,ind+1))),]
madeleinedarbyshire commented 6 years ago

Hi Inigo,

Thank you for clarifying that. My apologies, there was a bug in my code preventing me from finding the contiguous mutations. I'm getting much more accurate results now I have cleared up this and my previous issue. Thanks for all your help.

Maddie

im3sanger commented 6 years ago

No problem, I am glad it helped!

Inigo

Slavatron commented 5 years ago

Wondering if there are any suggested practices when it comes to removing contiguous mutations? When I encountered them, my first instinct was to loop through each set of adjacent mutations, arbitrarily choose the one with the furthest upstream 'pos' value and discard the rest - my rationale was that this seemed like a relatively simple and unbiased way of solving the problem. Then I got the idea to add a rule prioritizing SNVs over INDELs when given a choice - my rationale was that INDELs introduce more uncertainty than SNVs. Then it occurred to me that it might be safer (and certainly easier!) to just remove ALL adjacent mutations. I can imagine there might not be any single "best" solution to this problem but am hoping there are at least some general guidelines or principles for how to decide what to do about contiguous mutations. Thanks.

im3sanger commented 5 years ago

Hello,

Contiguous mutations typically reflect a single mutational event, such as dinucleotide substitutions. They can be very common in melanomas (CC>TT) or lung cancers (CC>AA), for example. If you have evidence of many unannotated dinucleotide substitutions, the best way to deal with them is to annotate them as such: ref=CC, mut=TT, and dndscv will treat them separately (in the indel category). Certain mutational processes cause complex substitutions, such as CCT>AG. Those can be wrongly annotated as multiple substitutions and indels, and again, the best way to deal with them is to annotate them as a complex substitution (ref=CCT, mut=AG). Another common cause of apparently contiguous mutations are false positive substitutions near or adjacent to genuine indels, caused by misalignment of read ends carrying an indel (mapping algorithms can introduce one or two substitutions at the end of a read instead of opening an indel).

The best practice is to use mutation calling or mutation annotation software that annotate dinucleotides, complex substitutions and false positive substitutions near indels. If two mutations are adjacent but represent two genuinely independent events (i.e. they occur in different reads), they are best left separate when working with dndscv.

Some of these decisions require having access to bam files and I appreciate that these are not always available when working with public somatic mutation calls. In that case, if the quality of the calls is dubious, one could choose to filter out substitutions close to indels (say within 10bp of an indel). And if the mutation file contains frequent pairs of mutations and no annotation of dinucleotides or complex mutations, it may be safer to group these mutations into dinucleotides or complex events.

I hope this helps, Inigo