More general mutation models

petrelharp commented 6 years ago

We'd like to provide support for other models of mutation. Here is a proposal for how to do this, in a two-stage protocol, that separates generating locations of the mutations (which can be done independently on each edge) from allelic states. We'd need to do two things:

Finite sites, parent-independent mutation: First, we need to modify the current algorithm to produce mutations at a finite set of sites.

I'd propose for this to be only at integer locations; we could make the set of possible locations more general, but why bother?
These mutations would be generated independently on each edge, and derived state could be random floats in [0,1) to ensure uniqueness (and stored as bytes).
We would not need to specify the mutation parent property, because it could be done afterwards with compute_mutation_parent() if we traversed the edges in order by forwards-time.

Transformation to finite-state model: The next step is to convert the mutations produced in the first step -- or the mutations produced by the infinite-sites model -- to a finite-state model, e.g., Jukes-Cantor. To do this:

We'd be provided with an alphabet of length $m$, a distribution $\pii$ for the root state, and a $m \times m$ matrix $Q{ij}$ that gives the rate at which allele $i$ mutates into allele $j$, with $Q_{ii} = 0$. We then define $\mu = \max_i \sumj Q{ij}$ and $P{ij} = Q{ij} / \mu$ for $i \neq j$ and $P{ii} = 1 - \sum{j \neq i} P_{ij}$.
We generate parent-independent mutations as above using mutation rate $\mu$, and use compute_mutation_parents to find the mutation parents.
We step through the sites:
- generating for each site an ancestral state $i$ from $\pi$, and
- generating a derived allele for each mutation from probabilities $P_{ij}$; if the derived state is $i$ then we discard the mutation.

Note that thanks to having mutation parents we don't need to generate the trees to do this. Also note that if the derived states produced by parent-independent mutation are random floats, we don't need to produce any more randomness to do this.

Advantages to this decoupled scheme:

The main advantage to this scheme is it simplifies bookkeeping:

mutation locations are chosen while only looking at the edge table
tree topology only needs to be parsed while computing the mutation parent column
allelic identity of mutations then only needs to look at the mutation and site table

This decoupling is totally general, and should make it easy to implement new models.

Disadvantages:

We end up generating more mutations than we need to.
Storing floats as bytes in the derived state seems inefficient.
Incorporating existing mutations makes this more complicated.

cc @molpopgen

petrelharp commented 6 years ago

Further notes:

Parent-independent mutation (step 1 above) can be pretty much done with the existing method, only replacing the "distance" in floating-point coordinates by the number of integers within the edge.
It is not at all obvious how to add new mutations to an existing set when not under the infinite sites mdoel. The only thing to do, really, is to suppose any existing mutations fall somewhere uniformly at random on the edge, so that a new mutations on the same edge fall randomly interspersed. But, why would we even want to leave existing mutations? The only use case I'm aware of is when the existing mutations are non-neutral, in which case we don't actually want to add more mutations to those sites.
So, my proposal is that the finite-sites mutation model, when preserving existing mutations, actually does not place any new mutations at already mutated sites. (And, not rejection sample, but when proposing to mutate an already-mutated site, should just do nothing.)

molpopgen commented 6 years ago

I'd propose for this to be only at integer locations; we could make the set of possible locations more general, but why bother?

This makes senses for finite-site schemes, but one has to keep in mind that not all int/double conversions are simple when the numbers get really big. For example:

#include <iostream>
#include <cstdint>
using namespace std;

int main(int argc, char ** argv)
{
    double x = 9007199254000000; //A bit less than 2^53.

    while (x+1 != x){++x;}
    std::cout << static_cast<uint64_t>(x-1) << ' ' << static_cast<uint64_t>(x) << ' ' << static_cast<uint64_t>(x+1) << '\n';
}

In practice, you'll probably never simulate a genome that large, but the fact that most simulations tend to handle positions as doubles means that the int/double casting becomes a nagging issue that may bite you eventually.

These mutations would be generated independently on each edge, and derived state could be random floats in [0,1) to ensure uniqueness (and stored as bytes).

This may be way overkill? I assume you mean double and not the actual C float? In either case, the number of possible values in [0,1) is quite large. It seems simpler to store derived state as something like int32_t and, for each replicate, store a map of [alphabet entry, int32_t] for later decoding. That's the same size as a float, smaller than a double, and possibly more portable.

We would not need to specify the mutation parent property, because it could be done afterwards with compute_mutation_parent() if we traversed the edges in order by forwards-time.

petrelharp commented 6 years ago

In practice, you'll probably never simulate a genome that large, but the fact that most simulations tend to handle positions as doubles means that the int/double casting becomes a nagging issue that may bite you eventually.

How do you propose doing this - taking interlocus_spacing as an argument; could default to 1.0? Would this fix the issue?

... random floats in [0,1)... This may be way overkill?

True, and we don't actually avoid the nonuniqueness problem anyhow. I was just trying to avoid keeping track of what alleles had been used at that site already, which we have to do because we've said that the derived state had to be different than the mutation.parent's derived state. So, I guess we'd have to have a way of looking up the alleles that have been assigned at each site already.

molpopgen commented 6 years ago

It is not at all obvious how to add new mutations to an existing set when not under the infinite sites mdoel. The only thing to do, really, is to suppose any existing mutations fall somewhere uniformly at random on the edge, so that a new mutations on the same edge fall randomly interspersed. But, why would we even want to leave existing mutations? The only use case I'm aware of is when the existing mutations are non-neutral, in which case we don't actually want to add more mutations to those sites. So, my proposal is that the finite-sites mutation model, when preserving existing mutations, actually does not place any new mutations at already mutated sites. (And, not rejection sample, but when proposing to mutate an already-mutated site, should just do nothing.)

Imagine the scenario where you evolve for some amount of time, and then do some analysis, and then evolve some more. There's a chance that the state will change at already-mutated sites, and it seems that this case should be supported.

molpopgen commented 6 years ago

How do you propose doing this - taking interlocus_spacing as an argument; could default to 1.0? Would this fix the issue?

I'm not sure what interlocus_spacing is? My bigger picture point is that you can convert lots of ints to floating-point values, but at some point the back conversion may stop "working", meaning that i and i+1 both equal i. This is just one of the annoying features of computing. I think you can be somewhat lazy here and rely on IEEE 754 support in any compiler that you're likely to deal with, meaning that positions > 2^53-ish should be disallowed.

petrelharp commented 6 years ago

Imagine the scenario where you evolve for some amount of time, and then do some analysis, and then evolve some more.

Oh, right, duh. Well, that case would be fine, because the proposed workflow only gets messed up if there are out-of-order mutations coming in.

So, I propose two flags: keep, for whether or not to keep existing mutations (as we have already), and new_sites_only, which if True would discard any proposed mutations at already-mutated sites. Then the two use cases would be:

simulate for a while, add mutations, simulate for time T longer, add new mutations: keep=True, new_sites_only=False, since=T
simulate with selected mutations, add neutral ones: keep=True, new_sites_only=True.

petrelharp commented 6 years ago

I think you can be somewhat lazy here and rely on IEEE 754 support in any compiler that you're likely to deal with, meaning that positions > 2^53-ish should be disallowed.

Sounds good. My other suggestion wouldn't have helped anything.

molpopgen commented 6 years ago

So, I propose two flags: keep, for whether or not to keep existing mutations (as we have already), and new_sites_only, which if True would discard any proposed mutations at already-mutated sites. Then the two use cases would be:

simulate for a while, add mutations, simulate for time T longer, add new mutations: keep=True, new_sites_only=False, since=T simulate with selected mutations, add neutral ones: keep=True, new_sites_only=True.

I think this works. For the case of selected mutations and someone is doing evolve-stop-evolve again, it is up to the forward simulation engine to properly update mutation/site tables w.r.to derived state at the relevant nodes.

petrelharp commented 6 years ago

General picture: we cannot generate mutations twice on the same chunk of the trees. Our two use cases therefore either (a) generate mutations on a different time slice, or (b) generate mutations on a different sequence slice.

So, it should be an error to: say new_sites_only=False but have any existing mutations before the since parameter.

jeromekelleher commented 6 years ago

Sounds good in general to me @petrelharp, but I'm a bit confused about some of the details. I'll need to think about it a bit more.

One high-level point though: Why are we trying so hard to avoid using the trees? Having a valid and indexed tree sequence is a pre-requisite for compute_mutation_parents anyway, so why not use the trees and evolve downwards with a traversal for each site (assuming you've picked your sites already, via your step 1)? OK it's slower, but it should be possible to state very general sequence evolution models this way. I'd always imagined that this was how SeqGen worked, for example.

petrelharp commented 6 years ago

Why are we trying so hard to avoid using the trees?

I'm just trying to factor things out into simple steps that are re-usable between simulation methods. Maybe we should think out the alternative method.

molpopgen commented 6 years ago

I think the scheme laid out by @petrelharp is the most straightforward for 'standard' cases. (I also think this is how seq-gen works, as it handles primarily standard models of molecular evolution, IIRC.)

The much more difficult case involves mutation models where the transition matrix for a site depends upon the k-mer in which the site is centered. For these models, I think you have to generate the ancestral sequence for each node for each marginal tree and account for the complication that k-mers cross marginal tree (stop, start) boundaries. Further, these models appear to present a complication for the idea of not simulating neutral sites 'as you go', as the mutation matrix for the selected sites will depend on the (unsimulated) neutral sites. This case is potentially pathological for how we've all been thinking of using tree sequences.

For @jeromekelleher's idea of using trees, under models that the matrix method can handle, the scheme would be to go site by site and assign a derived state to each node, taking care to avoid already-mutated (presumably selected?) sites along the way. This gets repeated from root to tip for each marginal.

For models of intermediate complexity (Markov models of dinucleotides), I assume that there's something out there in the phylogenetic literature. I'd wager that these are handleable via the @petrelharp scheme, and that the matrices are just larger and there are probably some simplifying symmetry assumptions.

jeromekelleher commented 6 years ago

OK, thanks for the clarification @petrelharp and @molpopgen, this confirms I'm just being thick. Peter's scheme is AOK with me.

petrelharp commented 6 years ago

The much more difficult case involves mutation models where the transition matrix for a site depends upon the k-mer in which the site is centered.

Yep. However, note that we could do these as above, only looking up haplotypes in step (3).

Further, these models appear to present a complication for the idea of not simulating neutral sites 'as you go', as the mutation matrix for the selected sites will depend on the (unsimulated) neutral sites.

Ack, you are right. Well, but then the neighboring neutral sites are sorta not neutral. I could imagine sorting this out. But agreed: there are many complications we should ignore atm.

petrelharp commented 6 years ago

Note: we should include #508 also when implementing this.

jeromekelleher commented 5 years ago

Thinking about this again (while dealing with #711), I came up with the following scheme which I think must be pretty efficient (although perhaps not as general as @petrelharp's scheme?):

Generate your sites and ancestral states as you want,
Given you are at a site which has an overall mutation rate mu. Let k = random.poisson(mu * tree.total_branch_length).
For each of the k mutations, do: choose a node u with probabability tree.branch_length(node) / tree.total_branch_length. Determine the state s at u by traversing up the tree --- if you encounter any mutations before the root, this is the state of u. Choose a new state s' according to the transition matrix of choice (JC69 is just uniform from all states S / {s}, and things like HKY85 and GTR seem fine).

There may be probabilistic subtleties here which I'm missing, so I'm not certain the approach is correct, but I am sure it can be done efficiently.

The cumulative branch length for each particular node can be stored efficiently using a Fenwick tree (like we do for cumulative recombination weight), and we can incrementally maintain this structure as the trees change. Using this Fenwick tree we can them implement the "choose a branch with a given weight" operation easily (right?).
Traversing upwards to get the state requires log(n) time, on average, so that's fine.

So, the cost is something like O(log n) per mutation O(log n) per tree transition which seems about as good as you're going to do.

This does treat sites as isolated (so not worrying about surrounding sites), but I guess you could add this complication later.

petrelharp commented 5 years ago

A scheme like this is totally general, with the caveat that since mutation rate may differ by identity of the site, mu must be the maximum mutation rate and some of those "mutations" might not change the state (and so be ignored). So, this would be fine.

The advantage to my proposal is it let us put down mutations in just the same way we do currently - independently by edge - so we only need to traverse trees at those sites where we have more than one mutation. It would therefore require writing less new code.

jeromekelleher commented 5 years ago

The advantage to my proposal is it let us put down mutations in just the same way we do currently - independently by edge - so we only need to traverse trees at those sites where we have more than one mutation. It would therefore require writing less new code.

And your scheme is also more efficient, as we just have to look at it edge-by-edge. My only issue with your method is that I don't understand it, which makes me reluctant to start implementing it! I'm deeply suspicious of random programs, and I assume anything that's not "obviously right" will have subtle problems... But I think I'm slowly seeing how it works. We're basically generating a bunch of 'stateless' mutations at finite sites along the edges, and then in a second pass choosing what the states actually are, right?

So, for every edge/slice of the mutation map we'd generate k mutations from poisson(rate), and then uniformly assign these into the site 'bins' along this continuous interval. In the second pass, we'd go through the trees and for every site (a) create an ancestral state; and (b) going down the tree, for each mutation generate a state according to the transition matrix. (We don't have to do a full traversal to implement b, we can sort the sites by time and traverse upwards to find previously assigned state.)

Is this right? What are the uniform doubles assigned to the derived state in your summary above for then?

jeromekelleher commented 5 years ago

Another question: what happens when we have multiple mutations on a branch for a give site? My take is that we should only have mutation (i.e., remove all the silent changes along the branch), although it would be best to do this afterwards in a postprocessing step, since the state changes matter.

It'll make our lives easier in tskit algorithms if we can assume there's one mutation per branch, right?

petrelharp commented 5 years ago

The methods are quite similar. Let me slightly rewrite yours to make this more clear:

Generate your sites and ancestral states as you want,
Let mu be the overall (i.e., maximum) mutation rate. Let k = random.poisson(mu * tree.total_branch_length).
For each of the k mutations, do: choose a node u with probabability tree.branch_length(node) / tree.total_branch_length. Determine the state s at u by traversing up the tree --- if you encounter any mutations before the root, this is the state of u. Choose a new state s' according to the transition matrix P of choice (see below); if s' = s then don't add the mutation.

M[a,b] is the probability per generation that a nucleotide a is inherited as b in the next generation, with $\sumb M[a,b] = 1$, then let $mu[a] = \sum{a \neq b} M[a,b]$ (the mutation rate for nucleotide a) and $\mu = \max_a \mu[a]$ (the overall mutation rate). Then the transition matrix P we use above is P[a,a] = 1 - mu[a]/mu and P[a,b] = M[a,b] / mu.

So: suppose there's two states, 0 and 1, and 0 -> 1 at rate 0.01, while 1 -> 0 at rate 0.005.

To do the "tree" method (yours), you lay down k abstract mutations on the tree with rate mu = 0.01, and then you go down the tree and set the derived states appropriately (i.e., if the state at that point in the tree was a 0, then derived state will be a 1), but throwing away half the mutations that happened on a 1 background.

To do the "edge" method (mine), you lay down abstract mutations on the whole tree sequence (edgewise) with rate mu = 0.01, and then you (as above) set the derived states appropriately, throwing away half the mutations that happened on a 1 background.

So, it's the same method: both are rejection sampling; just we're choosing the proposals differently.

petrelharp commented 5 years ago

Life would be easier if we could assume only one mutation per branch, but I don't think we can assume that. Well, we could, but then we'd need to deal with the cases when that's not true on input (e.g., from SLiM). That was the point of the whole mutation.parent thing. I vote to continue to allow multiple mutations per branch.

Whether to leave multiple mutations in: probably we don't usually need them, but it would make it easier to compare to theory if we left them in?

molpopgen commented 5 years ago

Regarding number of mutations per site/branch: the theory will make a statement about the state of the descendant node given the state of the ancestral node + branch length + transition matrix. The detailed mutation history wouldn't be helpful?

jeromekelleher commented 5 years ago

Life would be easier if we could assume only one mutation per branch, but I don't think we can assume that. Well, we could, but then we'd need to deal with the cases when that's not true on input (e.g., from SLiM). That was the point of the whole mutation.parent thing. I vote to continue to allow multiple mutations per branch.

That and being able to determine your state without having to do a full upwards traversal.

Whether to leave multiple mutations in: probably we don't usually need them, but it would make it easier to compare to theory if we left them in?

Hmmm, true. How about we provide a method squash_mutations (or something) that'll squash the mutations on a branch down to one. It's trivial to implement with the mutation parents. There's definitely something to be said for being able to get rid of the redundant state changes, even if it's just for compression efficiency.

jeromekelleher commented 5 years ago

So, it's the same method: both are rejection sampling; just we're choosing the proposals differently.

That was helpful, thanks. OK, I think I get it now. I'll have a go at implementing soon --- I'm working on LS in the general case of abitrary mutations, so I need to be able to simulate them at scale.

hyanwong commented 5 years ago

I'm not sure if this is useful , or where it fits in, but @marianne-aspbury and I are using something like the following code to generate finite sites mutations. It's basically the original suggestion by @petrelharp but (since it's a simple model) we don't need to throw out mutations (we just pick one from the remaining "atgc" letters, after removing the previous derived state).

The only difficult bit is assigning the derived states to mutations so that they aren't the same as the previous derived state on a lineage. There a bit of faffing with dtype='|S1' etc here.


import msprime
import tskit
import itertools
import random
import math
import numpy as np

# Construct a tree sequence with integerized breakpoints
length = 100
recomb_map = msprime.RecombinationMap.uniform_map(length, rate=1, num_loci=length)
ts = msprime.simulate(30, mutation_rate=1, recombination_map=recomb_map)

# sentinel_allele = b"\0" # changed following comments below
null_allele = b"\0"
states = np.array(['a','t','g','c'], dtype='|S1')

tables = ts.dump_tables()
tables.sites.clear()
tables.mutations.clear()
for integer_pos, sites_at_pos in itertools.groupby(ts.sites(), lambda x: math.floor(x.position)):
  ancestral_state = random.choice(states)
  site_id = tables.sites.add_row(integer_pos, ancestral_state)
  # Order mutations by time
  mutations = []
  for site in sites_at_pos:
    mutations.extend(site.mutations)
  mutations.sort(key = lambda x: ts.node(x.node).time, reverse=True)
  for m in mutations:
    # Assign mutations with null parents & derived_states (will be filled in later)
    tables.mutations.add_row(site_id, m.node, derived_state=null_allele, parent=tskit.NULL)

# Assign parents
tables.compute_mutation_parents()

# Assign derived states (use an array of chars)
ancestral_states = tables.sites.ancestral_state.view(dtype='|S1')
mutation_derived_state = np.full(tables.mutations.num_rows, null_allele, dtype='|S1')
for i, m in enumerate(tables.mutations):
    if m.parent == tskit.NULL:
        prev_state = ancestral_states[m.site]
    else:
        prev_state = mutation_derived_state[m.parent]
    # Pick a new state that is different from the old one
    new_state = random.choice([s for s in states if s != prev_state])
    mutation_derived_state[i] = new_state

tables.mutations.derived_state = mutation_derived_state.view(tables.mutations.derived_state.dtype)

finite_sites_ts = tables.tree_sequence()

# Try printing them out
for h in finite_sites_ts.haplotypes():
  print(h)

hyanwong commented 5 years ago

@jeromekelleher - I suspect that anyone using compute_mutation_parents could be bitten by the requirement to have a different derived_state for a mutation to that of the parent mutation. Would you like me to add a note to the documentation for compute_mutation_parents to point this out, maybe even with a snippet of code from my post immediately above to suggest one way to deal with this?

jeromekelleher commented 5 years ago

This is neat @hyanwong --- I haven't thought hard about the properties of the mutations it outputs, but it looks plausible to me. What's the issue with compute_mutation_parents --- we're not checking the allele values to see if they make sense, are we? I don't get why a string with an embedded null helps here, isn't it the same if you have an arbitrary one character string here?

hyanwong commented 5 years ago

What's the issue with compute_mutation_parents --- we're not checking the allele values to see if they make sense, are we?

It's only that if for some reason the parents have changed, or we didn't know them in the first place (which is presumably why we are running compute_mutation_parents) then there is the distinct possibility of setting the parent of a mutation to be one with the same derived_state simply by chance. If this happens, the sequence state can not be generated at that point (i.e. .haplotypes() and friends will fail). So we should warn users of compute_mutation_parents of this possibility, and the probable need to check or reset the derived_state for all their mutations after running the function.

I don't get why a string with an embedded null helps here, isn't it the same if you have an arbitrary one character string here?

Yes, anything could go here. I just stuck a \0 in so that it would be obvious if I messed up the logic.

jeromekelleher commented 5 years ago

It's only that if for some reason the parents have changed, or we didn't know them in the first place (which is presumably why we are running compute_mutation_parents) then there is the distinct possibility of setting the parent of a mutation to be one with the same derived_state simply by chance. If this happens, the sequence state can not be generated at that point (i.e. .haplotypes() and friends will fail). So we should warn users of compute_mutation_parents of this possibility, and the probable need to check or reset the derived_state for all their mutations after running the function.

We can add in a warning I guess, but the operation is low-level and intended for uses in which the user is in charge of making sure the state is consistent. This is a GIGO situation I think.

Yes, anything could go here. I just stuck a \0 in so that it would be obvious if I messed up the logic.

It looks like it's doing something special then though, and the name sentinel is confusing. A sentinel value is usually something you put at the end of an unbounded list to signify it's finished. In C strings, '\0' is the sentinel value used to terminate because they have no length associated with them.

hyanwong commented 5 years ago

We can add in a warning I guess

Given that this function is indeed documented, I think a brief note in the docs would be helpful (it would have been helpful to us, and saved us 1/2 hour of thinking). I'll bung it in a PR.

the name sentinel is confusing.

Right - my bad naming. I guess I should change this to "null_char" or something. It was just meant to be the character equivalent of None. Or I guess I could set the derived_state to ""

petrelharp commented 5 years ago

ping @carjed who just volunteered to do this! <3 <3

hyanwong commented 4 years ago

@carjed - any update on this. It would be great to get something kicked off?

petrelharp commented 4 years ago

Closed in #946

tskit-dev / msprime

More general mutation models #553