I cannot predict the number of non-silent mutations by phangorn::simSeq. Whatever I do, my prediction is off. I assume the problem is at my side, but it would be great to use another tool (seq-gen) to verify phangorn is correct (and hopefully I will find out what I overlooked in the process).

---
title: 'raket: mutation rate'
author: "Richel J.C. Bilderbeek"
date: "April 18, 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggtree)

Problem

What is the best mutation rate to pick for a crown age of 15M years?

Here, we assume to use the full DNA, but modify the fraction of DNA actually used, f_dna_used, below:

f_dna_used <- 1.0
testit::assert(f_dna_used >= 0.0 && f_dna_used <= 1.0)

What should the alignment length be to have one mutation per 1K years?

Hypothesis

Once per 15M years

crown_age <- 15 # million years
mutation_rate <- f_dna_used * 1.0 / crown_age # chance per nucleotide per million years

1 nucleotide has a resolution of 15M years. 2 nucleotides have a resolution of 7.5M years 15 nucleotides have a resolution of 1M years 15K nucleotides have a resolution of 1K years

sequence_length <- 15000 # base pairs

Methods

1. Simulate a phylogeny with only two branches and the correct crown age

#set.seed(1)
sum_edge_length <- crown_age * 2

# Simulate a tree with two taxa and the desired summed edge length 
while (1) {
  tree <- TreeSim::sim.bd.age(age = crown_age, numbsim = 1, lambda = 1.0 / crown_age, mu = 0.0, frac = 1.0, mrca = TRUE, complete = FALSE)[[1]]
  if (length(tree$tip.label) == 2) break
}
testit::assert(length(tree$tip.label) == 2) # 2 taxa
testit::assert(sum(tree$edge.length) == sum_edge_length)
ggtree(tree) + geom_treescale(x = 0, width = crown_age, color = "red", offset = 0.01)

2. Predict the number of non-silent mutations

On average all nuceotides will change. Of these mutations, each one has a 25% chance to be silent, for example, to go from adenine to adenine.

The number of expected observable mutations is then:

expected_n_diffs_naive <- mutation_rate * sequence_length * crown_age * 0.75
print(expected_n_diffs_naive)

However, there will be some nucleotides that will be picked twice or more, as there will be nucleotides that will never be picked.

Here we run a simple simulation to add this to our expectation:

calc_exp_n_diffs <- function(sequence_length, f_dna_used) {
  # Create an alignment of zeroes
  nucleotides <- rep(0, sequence_length)
  # One mutation per base pair
  n_mutations <- sequence_length * f_dna_used
  # Pick the indices that will have a mutation
  random_indices <- 1 + (sort(floor(runif(min = 0, max = sequence_length, n = n_mutations))))
  # Put a random base pair there
  testit::assert(all(random_indices >= 1))
  testit::assert(all(random_indices <= length(nucleotides)))
  for (random_index in random_indices) {
    nucleotides[random_index] <- sample(x = 0:3, size = 1)
  }
  # Return the number of non-silent nucleotides
  testit::assert(sum(nucleotides != 0) > 0)
  sum(nucleotides != 0)
}
expected_n_diffs_sim <- mean(replicate(n = 100, calc_exp_n_diffs(sequence_length, f_dna_used)))
print(expected_n_diffs_sim)

From C++ I expect 7111.68 (code is at appendix):

expected_n_diffs_cpp_naive <- 7111.68 # Expects 15K mutations
expected_n_diffs_cpp_smart <- 7112.42 # Number of mutations also follows an exponential distribution

From Wikipedia:

# http://treethinkers.org/jukes-cantor-model-of-dna-substitution/
expected_n_diffs_tt <- (3/4)*(1 - exp(-8 * mutation_rate * crown_age / 2)) * sequence_length
print(expected_n_diffs_tt)

3. Simulate some alignments

At the root of the tree, we put a sequence of just as.

root_sequence <- rep("a", sequence_length)
n_replicates <- 100
diffs_1 <- rep(NA, n_replicates)
diffs_2 <- rep(NA, n_replicates)

for (i in seq(1, n_replicates))
{
  set.seed(i)
  data <- phangorn::simSeq(
    tree, 
    l = sequence_length, 
    rootseq = root_sequence, 
    type= "DNA", 
    rate = mutation_rate
  )
  diffs_1[i] <- sum(as.character(data)[1, ] != root_sequence)
  diffs_2[i] <- sum(as.character(data)[2, ] != root_sequence)
}
n_diffs_measured <- (sum(diffs_1) + sum(diffs_2)) / (n_replicates * 2)
print(n_diffs_measured)

Here is the last alignment:

ape::image.DNAbin(ape::as.DNAbin(data), col = c("white", "red", "green", "blue"))

4. Plot the results

library(ggplot2)

ggplot(
  data = data.frame(n = c(diffs_1, diffs_2)),
  aes(x = n)
) + geom_histogram(binwidth = 100) + 
  geom_vline(xintercept = n_diffs_measured, col = "black") + 
  geom_vline(xintercept = expected_n_diffs_naive, col = "pink") + 
  geom_vline(xintercept = expected_n_diffs_sim, col = "red") + 
  geom_vline(xintercept = expected_n_diffs_cpp, col = "green") + 
  geom_vline(xintercept = expected_n_diffs_tt, col = "yellow") + 
  scale_x_continuous(limits = c(0, sequence_length))

Appendix

C++ Naive

#include <algorithm>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>

int calc_non_silent(std::mt19937& rng_engine)
{
    const int sequence_lenghth{15000};
    std::vector<int> nucleotides(sequence_lenghth, 0);

    //Use -1, as distr is inclusive
    std::uniform_int_distribution<int> index_distr(0, sequence_lenghth - 1);

    std::uniform_int_distribution<int> nucleotide_distr(0, 3);

    const int n_mutations{sequence_lenghth};
    for (int i=0; i!=n_mutations; ++i)
    {
      const int index = index_distr(rng_engine);
      nucleotides.at(index) = nucleotide_distr(rng_engine);
    }
    return std::count_if(
      std::begin(nucleotides),
      std::end(nucleotides),
      [](const int x) { return x != 0; }
    );
}

int main()
{
  std::mt19937 rng_engine;

  const int n{1000};
  std::vector<int> v;
  for (int i=0; i!=n; ++i)
  {
    v.push_back(calc_non_silent(rng_engine));
  }
  const int sum = std::accumulate(std::begin(v), std::end(v), 0);
  const double average = static_cast<double>(sum) / static_cast<double>(n);
  std::cout << average << '\n';
}

C++ smart

#include <algorithm>
#include <iostream>
#include <iterator>
#include <numeric>
#include <vector>
#include <random>

int pick_n_mutations(
  std::mt19937& rng_engine,
  const double mut_rate)
{
  std::exponential_distribution<double> mut_distr(mut_rate);
  double t{0.0};
  int i{0};
  while (1)
  {
    //std::cout << i << ' ' << t << '\n';
    const double dt{mut_distr(rng_engine)};
    t += dt;
    ++i;
    if (t > 15.0) break;
  }
  return i;
}

int calc_non_silent(
  std::mt19937& rng_engine,
  const double mut_rate
)
{
    const int n_mutations = pick_n_mutations(rng_engine, mut_rate);

    const int sequence_lenghth{15000};
    std::vector<int> nucleotides(sequence_lenghth, 0);

    //Use -1, as distr is inclusive
    std::uniform_int_distribution<int> index_distr(0, sequence_lenghth - 1);

    std::uniform_int_distribution<int> nucleotide_distr(0, 3);

    for (int i=0; i!=n_mutations; ++i)
    {
      const int index = index_distr(rng_engine);
      nucleotides.at(index) = nucleotide_distr(rng_engine);
    }
    return std::count_if(
      std::begin(nucleotides),
      std::end(nucleotides),
      [](const int x) { return x != 0; }
    );
}

int main()
{
  std::mt19937 rng_engine;
  const double p_mutation = 1.0 / 15.0;
  const double mut_rate = 15000.0 * p_mutation;

  const int n{1000};
  std::vector<int> v;
  for (int i=0; i!=n; ++i)
  {
    v.push_back(calc_non_silent(rng_engine, mut_rate));
  }
  std::copy(std::begin(v), std::end(v), std::ostream_iterator<int>(std::cout, " "));
  const int sum = std::accumulate(std::begin(v), std::end(v), 0);
  const double average = static_cast<double>(sum) / static_cast<double>(n);
  std::cout << average << '\n';
}

richelbilderbeek / raket

Prerequisite: phangorn::simSeq is correct #11