I cannot predict the number of non-silent mutations by phangorn::simSeq. Whatever I do, my prediction is off. I assume the problem is at my side, but it would be great to use another tool (seq-gen) to verify phangorn is correct (and hopefully I will find out what I overlooked in the process).
What should the alignment length be to have one mutation per 1K years?
Hypothesis
Once per 15M years
crown_age <- 15 # million years
mutation_rate <- f_dna_used * 1.0 / crown_age # chance per nucleotide per million years
1 nucleotide has a resolution of 15M years.
2 nucleotides have a resolution of 7.5M years
15 nucleotides have a resolution of 1M years
15K nucleotides have a resolution of 1K years
sequence_length <- 15000 # base pairs
Methods
1. Simulate a phylogeny with only two branches and the correct crown age
#set.seed(1)
sum_edge_length <- crown_age * 2
# Simulate a tree with two taxa and the desired summed edge length
while (1) {
tree <- TreeSim::sim.bd.age(age = crown_age, numbsim = 1, lambda = 1.0 / crown_age, mu = 0.0, frac = 1.0, mrca = TRUE, complete = FALSE)[[1]]
if (length(tree$tip.label) == 2) break
}
testit::assert(length(tree$tip.label) == 2) # 2 taxa
testit::assert(sum(tree$edge.length) == sum_edge_length)
ggtree(tree) + geom_treescale(x = 0, width = crown_age, color = "red", offset = 0.01)
2. Predict the number of non-silent mutations
On average all nuceotides will change.
Of these mutations, each one has a 25% chance to be silent,
for example, to go from adenine to adenine.
The number of expected observable mutations is then:
However, there will be some nucleotides that will be picked twice or more,
as there will be nucleotides that will never be picked.
Here we run a simple simulation to add this to our expectation:
calc_exp_n_diffs <- function(sequence_length, f_dna_used) {
# Create an alignment of zeroes
nucleotides <- rep(0, sequence_length)
# One mutation per base pair
n_mutations <- sequence_length * f_dna_used
# Pick the indices that will have a mutation
random_indices <- 1 + (sort(floor(runif(min = 0, max = sequence_length, n = n_mutations))))
# Put a random base pair there
testit::assert(all(random_indices >= 1))
testit::assert(all(random_indices <= length(nucleotides)))
for (random_index in random_indices) {
nucleotides[random_index] <- sample(x = 0:3, size = 1)
}
# Return the number of non-silent nucleotides
testit::assert(sum(nucleotides != 0) > 0)
sum(nucleotides != 0)
}
expected_n_diffs_sim <- mean(replicate(n = 100, calc_exp_n_diffs(sequence_length, f_dna_used)))
print(expected_n_diffs_sim)
From C++ I expect 7111.68 (code is at appendix):
expected_n_diffs_cpp_naive <- 7111.68 # Expects 15K mutations
expected_n_diffs_cpp_smart <- 7112.42 # Number of mutations also follows an exponential distribution
I cannot predict the number of non-silent mutations by
phangorn::simSeq
. Whatever I do, my prediction is off. I assume the problem is at my side, but it would be great to use another tool (seq-gen) to verifyphangorn
is correct (and hopefully I will find out what I overlooked in the process).Problem
Here, we assume to use the full DNA, but modify the fraction of DNA actually used,
f_dna_used
, below:Hypothesis
Methods
1. Simulate a phylogeny with only two branches and the correct crown age
2. Predict the number of non-silent mutations
On average all nuceotides will change. Of these mutations, each one has a 25% chance to be silent, for example, to go from adenine to adenine.
The number of expected observable mutations is then:
However, there will be some nucleotides that will be picked twice or more, as there will be nucleotides that will never be picked.
Here we run a simple simulation to add this to our expectation:
From C++ I expect
7111.68
(code is at appendix):From Wikipedia:
3. Simulate some alignments
At the root of the tree, we put a sequence of just
a
s.Here is the last alignment:
4. Plot the results
Appendix
C++ Naive
C++ smart