Closed eaton-lab closed 3 years ago
This raises an interesting opportunity, actually, that you could input a custom site rate vector. For example, higher rates on every third site to represent codons, or a concave curved rate vector to represent UCE rate variation. The latter is particularly intriguing...
Done.
Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods https://link.springer.com/article/10.1007%2FBF00160154
"There have been many attempts to account for such rate variation in phylogenetic analy- sis. Two approaches are taken. The first assumes that rates over sites are random variables drawn from a con- tinuous distribution; for example, Nei and Gojobori (1986), Jin and Nei (1990), Li et al. (1990), and Tamu- ra and Nei (1993) used the gamma distribution for rates over sites when they constructed estimators of the dis- tance between two sequences. The second approach us- es several categories of rates. The simplest model of this sort assumes that a proportion of sites are invariable while others are changing at a constant rate (e.g., Hasegawa et al. 1985; Palumbi 1989; Hasegawa and Ho- rai 1991). In accounting for the extreme rate hetero- geneity of the control region of the human mtDNA, Hasegawa et al. (1993) adopted a three-rate-category model, wherein some sites are assumed to be invariable while others are either moderately or highly variable. Biologically, a continuous distribution may seem to be more reasonable, and indeed, when fitting several models to the control region of human mtDNAs, Wake- ley (1993) found that a two-rate-category model could not fit the data properly, while the fit of a gamma dis- tribution was statistically acceptable"
This paper is mostly about discrete approximations of the gamma distribution which when used make for a much faster method for estimating phylogenies that have gamma-distributed rate variation. However, for simulations it seems that using a continuous distribution would be most appropriate.
Describe a continuous distribution of site rates using a single parameter gamma distribution.
Allow the
jsubstitute()
function to accept a 1-d array of rates equal in length to the array of sites. Multiply each site by its rate within the function.set rates to zero at invariant sites
rates[mask] = 0