im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

DNDS-cv dropping lots of mutations #34

Closed ndfriedman closed 5 years ago

ndfriedman commented 5 years ago

I am running DNDS-cv on targeted sequencing data and my some of my mutations are disappearing and not appearing in the output in a way that is affecting my results. I am looking at hypermutated cancers on a targeted panel so I run:

dndscv(maf, gene_list = myGeneList, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf). Ive also tried adding the parameter min_indels = 1. My variants were called with GRCh37

dndscv runs and only catches a handful of mutations that it excludes and does not exclude any samples. But when I go and look at the annotmuts or sel_cv objects that dndscv returns, many (between 30-50%) of my mutations have disappeared. Many of these were intronic mutations which makes sense. But I am still losing a substantial number of variants without any comment or documentation from dndscv. I've tried to see if there were any biases, but it seems like this loss of variants is occurring for all genes and all mutation types (missense, indel, silent, splice) and all samples. I also seem to be even losing hotspot mutations sometimes. My guess is the error some how relates to alignment because the loss of the variants seems very location specific. For example in my cohort dndscv keeps a KRAS p.Q61K (25380277 G->T) mutation but silently discards a KRAS p.Q61H mutation (25380275 T->G). Anyways, do you have any insight on the source of this error? It is pretty bizarre as even though dndscv is ditching many of my mutations it seems to do so in an unbiased way for most genes so my results still seem reasonable. But something is definitely up. Also for what its worth when dnds tells me '3 (0.027%) mutations have a wrong reference base' its basing that on the number of variants it will eventually deliver me in the annotmuts object rather than the true number of variants that were in my input maf.
Anyways, thanks so much for helping!

im3sanger commented 5 years ago

Hello,

Thank you for your message.

I am not sure what could be behind your problem as I have never seen this behaviour before. The example below using the two mutations that you mention in KRAS shows that dndscv should be annotating them correctly:

library(dndscv) m = data.frame(sampleID="Test", chr="12", pos=c(25380275,25380277), ref=c("T","G"), mut=c("G","T")) dndsout = dndscv(m, outp=1) # outp=1 to see only the annotation of variants

This gives the correct annotation for both mutations (dndsout$annotmuts): sampleID chr pos ref mut gene strand ref_cod mut_cod ref3_cod mut3_cod 1 Test 12 25380275 T G KRAS -1 A C AAG ACG 2 Test 12 25380277 G T KRAS -1 C A TCA TAA aachange ntchange codonsub impact pid 1 Q61H A183C CAA>CAC Missense ENSP00000256078 2 Q61K C181A CAA>AAA Missense ENSP00000256078

Is it possible that the mutation table that you are loading into R is truncated? Loading some text files into R using read.table can lead to truncated tables because of certain characters (the "comment.char" argument in read.table can help in some of these cases). If this is not the case, please check that the chromosome names are correct (for GRCh37, use "12" instead of "chr12") and that the format of the mutation in the input table is correct.

dndscv will always exclude non-coding variants, defined as those variants not within the GRanges object of all coding genes (incl. essential splice sites) listed in the RefCDS object and in the gene_list argument. But dndscv should not drop coding variants without a warning.

If the suggestions above do not solve your problem, please feel free to send me a small example file that I can use to reproduce and further investigate your problem.

I hope this helps, Inigo

ndfriedman commented 5 years ago

Inigo--that helps a lot, I think I may have been wrong about the KRAS mutations being dropped then. I think part of the confusion may be arising from the sel_cv object. When I look inside and see the columns n_syn, n_mis, n_ind etc are these reporting the number of distinct alleles of each class observed mutated in the gene or the total number of mutations observed?

im3sanger commented 5 years ago

Hi. For point mutations, in the sel_cv object the columns n_syn, n_mis, n_non and n_spl indicate the total number of synonymous, missense, nonsense and essential splice site mutations in the gene, annotated with respect to the reference transcript from each gene used by RefCDS. For indels, however, the column indicates the number of indels used by dndscv to study selection. By default, dndscv only counts the number of unique sites with indels, so that if the same site has an indel across multiple patients, they are only counted once. This is to protect against false positives due to microsatellites or recurrent artefacts (see this paper for more details). You can use the argument "use_indel_sites=F" to count all indels instead.

Could this explain the confusion?

Best, Inigo

ndfriedman commented 5 years ago

Yes, thank you that does resolve my confusion. Thanks for the help!