im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

RefCDS discrepancy when loading human GRCh38? #32

Closed Slavatron closed 5 years ago

Slavatron commented 5 years ago

I'm trying to run dndscv with Human GRCh38 as the reference. I've filtered my data to only include exonic SNVs and formatted it like this:

> head(v2)
     sampleID chr      pos ref mut
567   X2969_A   1 13175471   G   A
777   X2969_A   1 16759300   G   T
1308  X2969_A   1 26780101   C   G
1465  X2969_A   1 32964119   C   G
1469  X2969_A   1 33013382   G   A
1588  X2969_A   1 38874801   T   C

...with all columns being "character" class, except for pos which is class "integer". I've downloaded the file you provided, RefCDS_human_GRCh38.p12.rda, and attempted to run the following command:

dndsout = dndscv(v2,refdb = "~/Downloads/RefCDS_human_GRCh38.p12.rda", cv=NULL)

...which returns an error:

Zero coding substitutions found in this dataset. Unable to run dndscv.

After reading this Biostars post and this issue, I decided to manually check for overlap between my mutations and the GRCh38 reference file you provided. This lead me to discover a strange discrepancy: with an empty environment I ran:

load("~/Downloads/RefCDS_human_GRCh38.p12.rda")

...which created two objects named "gr_genes" and "RefCDS". When I examine the RefCDS object in Rstudio by first clicking on it in the upper-right "environment" window then clicking on the 4th element it appears to show the gene A2ML1 with coordinates from GRCh37; specifically the "intervals_cds" vector contains the following values:

8975238 8975778 8976316 8982323 ...

However, when I query the same gene using the console it returns coordinates like I would expect:

> RefCDS[[4]]$gene_name
[1] "A2ML1"
> RefCDS[[4]]$intervals_cds
         [,1]    [,2]
 [1,] 8822652 8822713
 [2,] 8823182 8823365
...
[35,] 8874971 8875011

I'm wondering if this discrepancy could somehow be causing the error message I'm getting. My dataset definitely contains non-synonymous A2ML1 mutations; the following command demonstrates this using GRCh38 coordinates from the previous command:


> head(v2[which(v2$chr == 12 & v2$pos > 8822652 & v2$pos < 8875011),])
        sampleID chr     pos ref mut
145577   X2964_A  12 8841524   C   T
312738   X2965_A  12 8858075   G   A
1092784  X1107_A  12 8843160   A   G
1204223  X2979_A  12 8851916   G   A
1204224  X2979_A  12 8857183   C   T
1204225  X2979_A  12 8857224   G   A

Although only the bottom row of this output is a non-synonymous SNV, it might be worth mentioning that roughly half the dataset is non-synonymous SNVs.

Here's what I'm working with:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0  IRanges_2.18.1      
[4] S4Vectors_0.22.0     BiocGenerics_0.30.0  dndscv_0.0.1.0      
[7] reshape2_1.4.3      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1             Rsamtools_2.0.0        Biostrings_2.52.0     
 [4] MASS_7.3-51.1          bitops_1.0-6           plyr_1.8.4            
 [7] magrittr_1.5           stringi_1.4.3          zlibbioc_1.30.0       
[10] XVector_0.24.0         seqinr_3.4-5           BiocParallel_1.18.0   
[13] tools_3.6.0            ade4_1.7-13            stringr_1.4.0         
[16] RCurl_1.95-4.12        compiler_3.6.0         GenomeInfoDbData_1.2.1
im3sanger commented 5 years ago

Hello,

Thank you for your interest in dNdScv.

I suspect that the problem is that you are using chromosome names as "1", "2"... while the GRCh38 convention in Ensembl is using the chr prefix: "chr1", "chr2". As a result, dndscv does not find any overlaps. I will add a warning in the code to flag this, and I will consider automatically converting chromosome names in the future, although I have been reluctant to do this to avoid possible problems with user-defined names for other species.

For now, just try: v2$chr = paste("chr",v2$chr,sep="") dndsout = dndscv(v2,refdb = "~/Downloads/RefCDS_human_GRCh38.p12.rda", cv=NULL)

Also, please note that you do not need to restrict your input table of mutations to coding mutations. The dndscv function does that at the start using its own set of transcripts. It is better to feed all mutations, coding and noncoding, to dndscv and let the function do the filtering for you to avoid introducing any biases accidentally by using different transcript definitions.

I hope this helps! Inigo

im3sanger commented 5 years ago

I have now added a more informative error message based on previous feedback from users: "Zero coding substitutions found in this dataset. Unable to run dndscv. Common causes for this error are inputting only indels or using chromosome names different to those in the reference database (e.g. chr1 vs 1)"

im3sanger commented 5 years ago

By the way, you can see the chromosome names used by RefCDS using: unique(sapply(RefCDS, function(x) x$chr))