im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

dndscv for other organisms #9

Closed sadafambreen closed 6 years ago

sadafambreen commented 6 years ago

Hi, you said that you will improve dndscv. So, it can be used for other organisms e.g. Mice or Dog. Have you developed it? Or you are still working on it?

im3sanger commented 6 years ago

Hello,

I am afraid I am still working on it. This is definitely my top priority for the next release.

Best wishes, Inigo

sbamin commented 6 years ago

Hi @im3sanger , this would definitely of much interest.

I work on canine tumors and checking ways to compare cross-species mutational rates. For now, covariates is a major limitation for driver discovery approach, and at most I am thinking of factoring gene size and gene expression for mutsigcv. However, for estimating background mutation rate (both coding and global), I prefer to use approach based on dN/dS. I wonder if I can implement traditional dN/dS and dNdSloc model using your code. I read methods section in Cell 2017 paper (page e3) to use separate t parameter for each gene. I believe that is coming from refcds_hg19.rda, and I like to see if I can make similar refDatabase for CanFam3.1 genome. I can give a shot to make most of refCDS for canine except $L matrix. It seem like a trinucleotide context with observed frequency of one of four bases. If you provide helpful hints to get me started, I can build and test run on canine data.

> RefCDS[[1]][1:15]
$gene_name
[1] "A1BG"

$gene_id
[1] "ENSG00000121410"

$protein_id
[1] "ENSP00000263100"

$CDS_length
[1] 1488

$chr
[1] "19"

$strand
[1] -1

$intervals_cds
         [,1]     [,2]
[1,] 58858388 58858395
[2,] 58858719 58859006
[3,] 58861736 58862017
[4,] 58862757 58863053
[5,] 58863649 58863921
[6,] 58864294 58864563
[7,] 58864658 58864693
[8,] 58864770 58864803

$intervals_splice
 [1] 58858396 58858397 58858714 58858717 58858718 58859007 58859008 58861731
 [9] 58861734 58861735 58862018 58862019 58862752 58862755 58862756 58863054
[17] 58863055 58863644 58863647 58863648 58863922 58863923 58864289 58864292
[25] 58864293 58864564 58864565 58864653 58864656 58864657 58864694 58864695
[33] 58864765 58864768 58864769

$seq_cds
  1488-letter "DNAString" instance
seq: ATGTCCATGCTCGTGGTCTTTCTCTTGCTGTGGGGT...AGCGACCCTGTGGAGCTCCTGGTGGCAGAAAGCTGA

$seq_cds1up
  1488-letter "DNAString" instance
seq: CATGTCCATGCTCGTGGTCTTTCTCTTGCTGTGGGG...CAGCGACCCTGTGGAGCTCCTGGTGGCAGAAAGCTG

$seq_cds1down
  1488-letter "DNAString" instance
seq: TGTCCATGCTCGTGGTCTTTCTCTTGCTGTGGGGTG...GCGACCCTGTGGAGCTCCTGGTGGCAGGAAGCTGAT

$seq_splice
  35-letter "DNAString" instance
seq: GACTGGAGTGGAGTGGACTGGAGTGGAGTGGAGTG

$seq_splice1up
  35-letter "DNAString" instance
seq: ATAGGACAGGACAGGACAGGACAGAACAGTACAGG

$seq_splice1down
  35-letter "DNAString" instance
seq: AGGGTGGCGTAGCGTCGTGTAGCGTTGTGTGGGGT

$L
       [,1] [,2] [,3] [,4]
  [1,]    0    7    0    0
  [2,]    3    4    0    0
  [3,]    0    7    0    0
  [4,]    0   12    0    0
  [5,]    1   11    0    0
  [6,]    0   12    0    0
  ...
  [189,]    1    7    0    0
  [190,]    1    9    0    0
  [191,]    2    8    0    0
  [192,]    0   10    0    0

Would I also be needing to change substmodel R object? Can you explain what those suffixes are in it?

[191,] "t*TTT>TCT" "t*TTT>TCT*wmis" "t*TTT>TCT*wnon" "t*TTT>TCT*wspl"
[192,] "t"         "t*wmis"         "t*wnon"         "t*wspl"
im3sanger commented 6 years ago

Hi sbamin,

Thanks for your interest in dNdScv. Actually, I have already generated the CanFam3.1 RefCDS object for someone else and you can download it from here.

You can run dndscv with this object using the following line:

dndsout = dndscv(mutations, refdb="Path-to-RefCDS-file", cv=NULL)

Covariates are useful as they can increase sensitivity and specificity by reducing unexplained variation in the mutation rate across genes. But they are not essential and, in fact, their effect on dNdScv can be modest. You can see this with the example dataset provided:

data("dataset_simbreast", package="dndscv")
dndsout = dndscv(mutations)
dndsout_nocovariates = dndscv(mutations, cv=NULL)

Note that, in the example dataset, only one gene is lost from the list of significant genes when running dNdScv without covariates.

print(dndsout$sel_cv[dndsout$sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")])
print(dndsout_nocovariates$sel_cv[dndsout_nocovariates$sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")])

And the global dN/dS estimates remain unchanged since covariates are only used in the gene-wise regression model.

I hope to be able to provide the code to generate a RefCDS for any assembly in the very near future.

Best wishes, Inigo

sbamin commented 6 years ago

Thanks a lot, Iñigo!



Samir
im3sanger commented 6 years ago

Hi Samir,

Sorry, I just edited my post.

Also, to briefly expand on my message, dNdScv can perform well without covariates as it already accounts for the variation of the mutation rate across genes using a Gamma distribution to model unexplained variation in the density of synonymous mutations across genes. You can see how removing the covariates leads to a lower "theta" value, which means a wider Gamma distribution. But the impact can be modest in many datasets. The value of "theta" is printed when running the method and available from the output using dndsout$nbreg$theta.

Best, Inigo

sbamin commented 6 years ago

Great and I will keep those in mind. In fact, Cell 2017 methods section is very detailed and helpful to learn dNDScv approach and effect of modeling by Gamma vs Poisson distribution.

Let me run based on your comments, and will get back here with feedback.

im3sanger commented 6 years ago

For those working with mouse (GRCm38), here is a link to the RefCDS.

Inigo

im3sanger commented 6 years ago

Hi again,

I have released a new version of the package with a function to generate the RefCDS object for any species or assembly.

Best wishes, Inigo