Closed jphill01 closed 6 years ago
Yes there is undoubtedly interest in simulating DNA sequences under some population genetic models. Note that there is a new function in ape (currently in the testing version 5.1-4), so that the first two lines of your function could be replaced by:
res <- rDNAbin(nrow = num.seqs, ncol = length.seqs)
This will generate totally random sequences, so you'll have to introduce how much they diverged in some way at some stage.
Thanks! Do you know where I can find the source code for rDNAbin? I am not able to download version 5.1-4 from the ape website. I imagine though that the code is much like the first two lines of my rudimentary code above.
See this link:
http://ape-package.ird.fr/ape_installation.html#versions
I've just uploaded ape 5.1-5 and updated the NEWS file:
Thanks again. Any idea what substitution models will be implemented (p-distance, Jukes-Cantor, Kimura-2-Parameter, etc.) and when this might occur?
Generally in population genetics, the time frame is short so that multiple mutations/substitutions are not a problem when comparing sequences (hence the infinite alleles model). That said, simulating under JC69 or K80 model is as easy a under an uncorrected (infinite alleles) model. There is an implementation of these in phangorn with the function simSeq (though this requires a tree).
Here is Thibaut's haploGen() implementation of DNA sequence generation according to a K2P substitution model (tweaked a little at the beginning and end):
sim.seqs <- function(num.seqs, length.seqs, mu.transi, mu.transv) {
nucl <- as.DNAbin(c("A", "C", "G", "T"))
transiset <- list("A" = as.DNAbin("G"),
"C" = as.DNAbin("T"),
"G" = as.DNAbin("A"),
"T" = as.DNAbin("C"))
transvset <- list("A" = as.DNAbin(c("C", "T")),
"C" = as.DNAbin(c("A", "G")),
"G" = as.DNAbin(c("C", "T")),
"T" = as.DNAbin(c("A", "G")))
res <- sample(nucl, size = length.seqs, replace = TRUE) # generate a random DNA sequence
## create transitions
transi <- function(res){
unlist(transiset[as.character(res)])
}
## create transversions
transv <- function(res){
sapply(transvset[as.character(res)], sample, 1)
}
seq.dupli <- function(res) {
## transitions ##
n.transi <- rbinom(n = 1, size = length.seqs, prob = mu.transi) # total number of transitions
if (n.transi > 0) {
idx <- sample(length.seqs, size = n.transi, replace = FALSE)
res[idx] <- transi(res[idx])
}
## transversions ##
n.transv <- rbinom(n = 1, size = length.seqs, prob = mu.transv) # total number of transversions
if (n.transv > 0) {
idx <- sample(length.seqs, size = n.transv, replace = FALSE)
res[idx] <- transv(res[idx])
}
res
}
res <- as.matrix.DNAbin(res)
res <- t(matrix(rep(seq.dupli(res), num.seqs), ncol = num.seqs, byrow = TRUE))
class(res) <- "DNAbin"
}
The issue is that the above code doesn't seem to work: it does generate completely random sequences BUT not under a K2P model. Basically, each site is segregating, which is not what one would want. This is verified by collapsing res into unique haplotypes:
h <- sort(haplotype(res), decreasing = TRUE, what = "frequencies")
rownames(h) <- 1:nrow(h)
h # this is equal to num.seqs
I thought this might be a way forward for rDNAbin(), once the bugs are worked out of course.
Any idea why the above code is behaving oddly?
Thanks.
Maybe the purpose of the original code was to generate SNPs, so all sites are by definition polymorphic?
I checked the actual alignment generated by haploGen() in MEGA and only a few sites are variable; so, this means there is an error in my own code above which I will work to fix.
I think this issue can be closed now. The error has been fixed. Many thanks.
I'm wondering if there is any interest in developing a general function that simulates DNA sequences of a given number, basepair length and mutation rate.
Something like (very rudimentary):
theta would be the population mutation rate (4Nmu), where N is effective population size (or perhaps just num.seqs), mu is the per generation (forward) mutation rate.
It could start off by generating a single ancestral sequence (or maybe pulling a random one from GenBank or BOLD for a particular species).
The output of such a function would conveniently be a DNAbin object for direct analysis with the 'ape' and/or 'pegas' R packages.
I know of no such R functions or packages that can accomplish this (without, say, simulating a stochastic coalescent birth-death process and generating a phylogenetic tree).