Simulating DNA sequences with given mutation rate

jphill01 commented 6 years ago

I'm wondering if there is any interest in developing a general function that simulates DNA sequences of a given number, basepair length and mutation rate.

Something like (very rudimentary):

sim.seqs <- function(num.seqs, length.seqs, theta) {

nucl <- as.DNAbin(c("a", "c", "g", "t")) # define DNA alphabet
res <- replicate(length.seqs, sample(nucl, size = num.seqs, replace = TRUE)) # create the sequences
res <- res[sample(nrow(res), size = num.seqs, replace = TRUE), ] # duplicate some sequences (without specifying mu)
class(res) <- "DNAbin" # coerce to class DNAbin

}

theta would be the population mutation rate (4Nmu), where N is effective population size (or perhaps just num.seqs), mu is the per generation (forward) mutation rate.

It could start off by generating a single ancestral sequence (or maybe pulling a random one from GenBank or BOLD for a particular species).

The output of such a function would conveniently be a DNAbin object for direct analysis with the 'ape' and/or 'pegas' R packages.

I know of no such R functions or packages that can accomplish this (without, say, simulating a stochastic coalescent birth-death process and generating a phylogenetic tree).

emmanuelparadis commented 6 years ago

Yes there is undoubtedly interest in simulating DNA sequences under some population genetic models. Note that there is a new function in ape (currently in the testing version 5.1-4), so that the first two lines of your function could be replaced by:

res <- rDNAbin(nrow = num.seqs, ncol = length.seqs)

This will generate totally random sequences, so you'll have to introduce how much they diverged in some way at some stage.

jphill01 commented 6 years ago

Thanks! Do you know where I can find the source code for rDNAbin? I am not able to download version 5.1-4 from the ape website. I imagine though that the code is much like the first two lines of my rudimentary code above.

emmanuelparadis commented 6 years ago

See this link:

http://ape-package.ird.fr/ape_installation.html#versions

I've just uploaded ape 5.1-5 and updated the NEWS file:

http://ape-package.ird.fr/NEWS

jphill01 commented 6 years ago

Thanks again. Any idea what substitution models will be implemented (p-distance, Jukes-Cantor, Kimura-2-Parameter, etc.) and when this might occur?

emmanuelparadis commented 6 years ago

Generally in population genetics, the time frame is short so that multiple mutations/substitutions are not a problem when comparing sequences (hence the infinite alleles model). That said, simulating under JC69 or K80 model is as easy a under an uncorrected (infinite alleles) model. There is an implementation of these in phangorn with the function simSeq (though this requires a tree).

jphill01 commented 6 years ago

Here is Thibaut's haploGen() implementation of DNA sequence generation according to a K2P substitution model (tweaked a little at the beginning and end):

sim.seqs <- function(num.seqs, length.seqs, mu.transi, mu.transv) {
    nucl <- as.DNAbin(c("A", "C", "G", "T"))
    transiset <- list("A" = as.DNAbin("G"), 
                       "C" = as.DNAbin("T"),
                       "G" = as.DNAbin("A"), 
                       "T" = as.DNAbin("C"))
    transvset <- list("A" = as.DNAbin(c("C", "T")),
                         "C" = as.DNAbin(c("A", "G")),
                         "G" = as.DNAbin(c("C", "T")),
                         "T" = as.DNAbin(c("A", "G")))

 res <- sample(nucl, size = length.seqs, replace = TRUE) # generate a random DNA sequence

## create transitions 
transi <- function(res){
  unlist(transiset[as.character(res)])
}

## create transversions
transv <- function(res){
 sapply(transvset[as.character(res)], sample, 1)
}

seq.dupli <- function(res) { 

  ## transitions ##
  n.transi <- rbinom(n = 1, size = length.seqs, prob = mu.transi) # total number of transitions
  if (n.transi > 0) {
    idx <- sample(length.seqs, size = n.transi, replace = FALSE)
    res[idx] <- transi(res[idx])
  }

  ## transversions ##
  n.transv <- rbinom(n = 1, size = length.seqs, prob = mu.transv) # total number of transversions
  if (n.transv > 0) {
    idx <- sample(length.seqs, size = n.transv, replace = FALSE)
    res[idx] <- transv(res[idx])
  }

  res
}

res <- as.matrix.DNAbin(res)
res <- t(matrix(rep(seq.dupli(res), num.seqs), ncol = num.seqs, byrow = TRUE))

class(res) <- "DNAbin"
}

The issue is that the above code doesn't seem to work: it does generate completely random sequences BUT not under a K2P model. Basically, each site is segregating, which is not what one would want. This is verified by collapsing res into unique haplotypes:

h <- sort(haplotype(res), decreasing = TRUE, what = "frequencies")
rownames(h) <- 1:nrow(h)
h # this is equal to num.seqs

I thought this might be a way forward for rDNAbin(), once the bugs are worked out of course.

Any idea why the above code is behaving oddly?

Thanks.

emmanuelparadis commented 6 years ago

Maybe the purpose of the original code was to generate SNPs, so all sites are by definition polymorphic?

jphill01 commented 6 years ago

I checked the actual alignment generated by haploGen() in MEGA and only a few sites are variable; so, this means there is an error in my own code above which I will work to fix.

jphill01 commented 6 years ago

I think this issue can be closed now. The error has been fixed. Many thanks.

thibautjombart / adegenet

Simulating DNA sequences with given mutation rate #241