jjmccollum / teiphy

A Python package for converting TEI XML collations to NEXUS, BEAST 2.7 XML, and other formats
MIT License
11 stars 3 forks source link

Assignment of random starting rate parameter values, new prior distributions, and support for `--seed` option #75

Closed jjmccollum closed 1 year ago

jjmccollum commented 1 year ago

For larger collations with increasingly varied substitution matrices at different sites/variation units, it becomes increasingly likely that at least one of the substitution matrices will be singular under the initial assignments of transcriptional rate parameters (which presently all default to 2.0). This ceases to be a problem after BEAST 2's first evaluation of the site likelihoods, because in subsequent likelihood calculations, it samples these rate parameter values randomly according to a prior distribution, and the resulting substitution matrices will almost never be singular. Since BEAST 2 XML inputs require that every RealParameter element have an initial @value attribute specified, we need to supply these values, but we should be able to avoid singular matrices if we set these values randomly in to_beast.

This can be done by using distributions available in numpy, which is already a dependency of teiphy. I suggest replacing the current offset log-normal prior distribution assumed for transcriptional rate parameters with a simple gamma distribution (perhaps with alpha="5" and beta="2" as default values). Then we can sample random initial values for the rate parameters from this distribution using something like

rng = np.random.default_rng(seed)
sample = rng.gamma(5.0, 2.0)

To ensure replicability for generated outputs, we will need to add support for a --seed command-line argument.