tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
177 stars 88 forks source link

adding microsatellite mutation models #2013

Closed andrewkern closed 2 years ago

andrewkern commented 2 years ago

i'm hoping to add some simple mutation models that would be appropriate for microsatellite data. i think this will be quite useful to folks who are working on non-model organisms, ecological genetics, and maybe even forensics?

The simplest, most classical model is the Stepwise Mutation Model of Kimura and Ohta (1978). I reckon that's a good place to start, although this nice paper by Calabrese and Durrett lays out a number of other models that would be simple enough to implement.

Here is how I would imagine the implementation to look for the stepwise microsat model. generally these models have a lo and hi parameter which governs the bounds on the number of repeat units of the microsatelllite

class SMM_MicrosatMutationModel(MatrixMutationModel):
    """
    Stepwise mutation model (Kimura and Ohta, 1978), 
    for microsatellite repeats.

    This is a :class:`.MatrixMutationModel` with alleles ``[lo, .. , hi]``,
    root distribution is uniform ``(lo, hi)``, and a transition matrix that reflects
    single-step mutation.

    :param int lo: lower bound of the microsatellite repeat length
    :param int hi: upper bound of the microsatellite repeat length
        See :ref:`??` for more details.
    """

    def __init__(self, lo, hi):
        alleles = list(range(lo, hi+1))
        root_distribution = [1/len(alleles)] * len(alleles)
        transition_matrix = np.zeros((len(alleles), len(alleles)))
        for i in range(len(alleles)):
            if i == 0:
                transition_matrix[i, i+1] = 1
            elif i == len(alleles)-1:
                transition_matrix[i, i-1] = 1
            else:
                transition_matrix[i, i-1] = 0.5
                transition_matrix[i, i+1] = 0.5

        super().__init__(alleles, root_distribution, transition_matrix)

what do others think?

jeromekelleher commented 2 years ago

LGTM. I'd make hi exclusive just to keep consistent with things like range, but otherwise looks very sensible.

So, to be concrete, the plan is to make alleles equal to [str(j) for j in range(lo, hi)] and then if you have allele j, you can only ever mutate to j - 1 or j + 1?

andrewkern commented 2 years ago

So, to be concrete, the plan is to make alleles equal to [str(j) for j in range(lo, hi)] and then if you have allele j, you can only ever mutate to j - 1 or j + 1?

yep that's the SMM model in a nutshell

petrelharp commented 2 years ago

This sounds fine, although we should make sure to tell people not to set lo and hi to the actual min and max observed in their data. Alternatively, we could modify the infinite alleles or SLiM model code (not use the mutation matrix model), so that we can set hi=Inf (or, something large). It would be straightforward.

jeromekelleher commented 2 years ago

Alternatively, we could modify the infinite alleles or SLiM model code (not use the mutation matrix model), so that we can set hi=Inf (or, something large). It would be straightforward.

That's not quite the same thing though is it? With infinite alleles you can only every go from j to j + 1, whereas the model above you can go back to j - 1 with equal proba.

petrelharp commented 2 years ago

That's not quite the same thing though is it? With infinite alleles you can only every go from j to j + 1, whereas the model above you can go back to j - 1 with equal proba.

It's totally not the same thing - infinite alleles maintains a global counter for the next allele, and this would look at the previous state to choose a new one, but it provides a template. (so, maybe the SLiM mutation model code is better, as it actually modifies the previous allele)

jeromekelleher commented 2 years ago

Right, gotcha

andrewkern commented 2 years ago

This sounds fine, although we should make sure to tell people not to set lo and hi to the actual min and max observed in their data. Alternatively, we could modify the infinite alleles or SLiM model code (not use the mutation matrix model), so that we can set hi=Inf (or, something large). It would be straightforward.

another issue is the ancestral distribution. I think uniformly distributed is probably wrong?

jeromekelleher commented 2 years ago

The matrix method @andrewkern outlined above would work, though, right? Probably a good place to start if someone is doing this rather than diving into the C code immediately.

andrewkern commented 2 years ago

Yeah i think it should work just fine.

jeromekelleher commented 2 years ago

Unless the number of alleles is 10_000 or something it may be good enough.

petrelharp commented 2 years ago

Good enough, but I don't think anyone's doing it right now, and it'd be nice to not clutter up this list of models with out of date ones? Re: the root distribution - it'd be natural to allow a slight left/right bias and use the stationary distribution (I think there's a Slatkin paper on this?).

andrewkern commented 2 years ago

Good enough, but I don't think anyone's doing it right now, and it'd be nice to not clutter up this list of models with out of date ones?

not sure what you mean? SMM is the most popular microsat model in analyses and simulations AFAIK

petrelharp commented 2 years ago

I mean if we implement another slightly different but better SMM later. But I suppose that allowing hi=Inf is backwards-compatible, as is adding left/right jump probs. So, never mind.