im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

Error using dndscv #5

Closed Salvobioinfo closed 6 years ago

Salvobioinfo commented 6 years ago

Hello,

I am trying to use the package on a dataset of mutations, originally mapped on the hg38 genome, that were ported to the hg19 using liftOver. Whenever I run dndscv I keep getting this error message: [1] Loading the environment... [2] Annotating the mutations... Note: 21 mutations removed for exceeding the limit of mutations per gene per sample Error in dndscv(DF_1, refdb = "hg19", max_muts_per_gene_per_sample = 3, : 13 mutations have a wrong reference base, please correct and rerun.

Is there a way to look/identify the lines containing the wrongly mapped mutations?

Thanks in advance. Salvatore

im3sanger commented 6 years ago

Hi Salvatore,

Thanks. Note that this is not a bug in dndscv but a problem with your input table of mutations, which contains mutation calls where the reference base does not match the reference base at this site in the reference genome. You will need to remove these mutations from your input table to run dndscv.

When implementing dndscv, I decided to use an "error" (stopping the execution) rather than a "warning" when some mutations in the input table have the wrong reference base annotation. I could change this, but I wanted the user to review and correct the input table of mutations to avoid confusion (otherwise one could get an output from dndscv running mutations from other assemblies or species by mistake).

You can use other software to find which mutations in your input table have a wrong reference annotation (I could also output this in a future version of dndscv). For example, you can use the scanFa function in the Rsamtools package to get the reference base for a set of sites (see example below).

I hope this helps.

Best wishes, Inigo

library(GenomicRanges)
library(Rsamtools)
mutations$ref_hg19 = as.vector(scanFa(genomeFile, GRanges(mutations$chr, IRanges(mutations$pos, mutations$pos)))) # Where genomeFile is the path to a fasta file for hg19
wrong_refs = mutations$ref_hg19!=mutations$ref
im3sanger commented 6 years ago

Hi again,

Since the problem was noted by others using liftover, I just modified the dndscv package to tolerate some mutations with the wrong reference base (up to 10% of all input mutations). The dndscv function now excludes these mutations and continues the execution while also outputting the table of mutations with wrong bases.

Best wishes, Inigo