pmckenz1 / ipcoal

Simulation package for getting gene tree distributions and loci / SNPs from input tree topologies and migration / Ne scenarios. Wraps around msprime.
MIT License
12 stars 7 forks source link

Support gamma rate variation in SeqModel #17

Closed eaton-lab closed 3 years ago

eaton-lab commented 4 years ago

Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods https://link.springer.com/article/10.1007%2FBF00160154

"There have been many attempts to account for such rate variation in phylogenetic analy- sis. Two approaches are taken. The first assumes that rates over sites are random variables drawn from a con- tinuous distribution; for example, Nei and Gojobori (1986), Jin and Nei (1990), Li et al. (1990), and Tamu- ra and Nei (1993) used the gamma distribution for rates over sites when they constructed estimators of the dis- tance between two sequences. The second approach us- es several categories of rates. The simplest model of this sort assumes that a proportion of sites are invariable while others are changing at a constant rate (e.g., Hasegawa et al. 1985; Palumbi 1989; Hasegawa and Ho- rai 1991). In accounting for the extreme rate hetero- geneity of the control region of the human mtDNA, Hasegawa et al. (1993) adopted a three-rate-category model, wherein some sites are assumed to be invariable while others are either moderately or highly variable. Biologically, a continuous distribution may seem to be more reasonable, and indeed, when fitting several models to the control region of human mtDNAs, Wake- ley (1993) found that a two-rate-category model could not fit the data properly, while the fit of a gamma dis- tribution was statistically acceptable"

This paper is mostly about discrete approximations of the gamma distribution which when used make for a much faster method for estimating phylogenies that have gamma-distributed rate variation. However, for simulations it seems that using a continuous distribution would be most appropriate.

  1. Describe a continuous distribution of site rates using a single parameter gamma distribution.

    # get a 1-d array of rates to multiply mutation rate
    rates = np.random.gamma(alpha, size=nsites)
  2. Allow the jsubstitute() function to accept a 1-d array of rates equal in length to the array of sites. Multiply each site by its rate within the function.

jsubstitute(pseq, probmats, rates):
    ...
    for site, rate in zip(ns, rates):
        ...
  1. Similarly implementing invariant sites should be easy. default arg is None, no invariant sites. Accept values between 0 and 1. Use binomial to randomly apply mask to sites to set multiplier to zero.
    
    # get 1-d boolean array to mask invariant sites
    mask = np.random.binomial(n=1, p=invprob, size=nsites).astype(bool)

set rates to zero at invariant sites

rates[mask] = 0



4. Implement gamma and invariant arg in `substitution_kwargs`:
default gamma arg is None, no gamma rate variation, which generates an array of 1s as the rates. Invariant masking can still apply on top of this.
eaton-lab commented 4 years ago

This raises an interesting opportunity, actually, that you could input a custom site rate vector. For example, higher rates on every third site to represent codons, or a concave curved rate vector to represent UCE rate variation. The latter is particularly intriguing...

eaton-lab commented 3 years ago

Done.