jjmccollum / teiphy

A Python package for converting TEI XML collations to NEXUS, BEAST 2.7 XML, and other formats
MIT License
11 stars 3 forks source link

Add support for BEAST XML output #66

Closed jjmccollum closed 1 year ago

jjmccollum commented 1 year ago

For use cases with BEAST (2) as the target phylogenetic software, conversion to NEXUS followed by a second conversion through BEAUti is presently supported, but direct conversion to a BEAST XML input file would allow for the mapping of additional features, the most notable of these being variation unit-specific substitution models and additional parameters to be incorporated into these models.

Because of the extensive nature of BEAST XML files, the conversion process will involve starting with a template file and adding new elements for witnesses, including fields for their sequences and date calibrations, and root frequencies and substitution models for each variation unit.

This feature may will probably take extra effort to implement, so this effort should be undertaken on a dedicated branch.

rbturnbull commented 1 year ago

I think that's what you fixed in the patch

jjmccollum commented 1 year ago

Yeah, it is. Do you have to change the path to your beast executable?

jjmccollum commented 1 year ago

You might already be doing this, but you should be able to follow the commands in the beast.yml workflow for cloning the necessary repos, building the BEAST executable library, installing the necessary packages with packagemanager, and running beast with the XML output, right?

rbturnbull commented 1 year ago

I worked it out. I had to delete the Beast packages directory in my user directory. I think beast copies libraries there the first time it runs. The tipdates code is running for me. I'll try to get it going in the CI.

jjmccollum commented 1 year ago

@rbturnbull All right, I've switched over to a Jinja2 template for BEAST XML output, and I've also switched from a strict clock to a UCRelaxedClockModel. Test coverage is still at 100%, and the BEAST test is running end-to-end. Following the lead of the phylopaul example inputs, I've set the rateCategories input to the clock model to an IntegerParameter with dimension 2*|taxa| - 2. It looks like this so far:

<branchRateModel spec="UCRelaxedClockModel" id="clock" tree="@tree">
    <distr spec="LogNormalDistributionModel" M="0.0" S="1.0"/>
    <parameter spec="parameter.RealParameter" id="clock.rate" name="clock.rate" lower="0.0" upper="100.0" value="1.0"/>
    <rateCategories spec="parameter.IntegerParameter" id="clock.rateCategories"  dimension="{{ clock_rate_categories }}" value="1.0"/>
</branchRateModel>

Comparing this to your examples, I suspect that the clock.rate parameter is superfluous (the M and S inputs to the LogNormalDistributionModel distribution should be sufficient, and I can make both of them parameters to be estimated by the model).

jjmccollum commented 1 year ago

@rbturnbull Okay, I've adjusted the clock model to look more like what you have in the phylopaul examples:

<branchRateModel spec="UCRelaxedClockModel" id="clock" tree="@tree" clock.rate="@clock.mean" rateCategories="@clock.rateCategories">
    <distr spec="LogNormalDistributionModel" meanInRealSpace="true" S="@clock.stdDev">
        <M spec="parameter.RealParameter" id="clock.distr.M" lower="0.0" upper="1.0" value="1.0" estimate="false"/>
    </distr>
</branchRateModel>
...
<stateNode spec="parameter.RealParameter" id="clock.mean" lower="0.0" upper="Infinity" value="1.0"/>
<stateNode spec="parameter.RealParameter" id="clock.stdDev" lower="0.0" upper="Infinity" value="0.000001" estimate="false"/>
<stateNode spec="parameter.IntegerParameter" id="clock.rateCategories"  dimension="{{ clock_rate_categories }}" value="1"/>
rbturnbull commented 1 year ago

We need to do some work on the clock models. The value here for the stdDev is fixed to be very low. I'm not sure if we want that to be the default.

jjmccollum commented 1 year ago

Should we also make it a parameter to be estimated by the model, or should we just default it to 1.0?

rbturnbull commented 1 year ago

I think it's best for the model to estimate the parameter. I'm not sure if there are going to be scale issues when setting the prior. We need to do a bit of experimentation to work out the best way to do it.

jjmccollum commented 1 year ago

I nearly have this ready (along with the code and unit tests for parsing root date ranges), but I had some questions about the clock model parameters:

<operator id="tree_updown" spec="UpDownOperator" scaleFactor=".75" weight="10">
    <up idref="tree"/>
    <down idref="clock.mean"/>
</operator>
<operator id="clock.rate_Scaler" spec="ScaleOperator" scaleFactor=".75" weight="1" parameter="@clock.mean"/>
rbturnbull commented 1 year ago

If we're estimating the standard deviation then we'll need a prior and an operator.

jjmccollum commented 1 year ago

I've gone with identical log-normal priors for both clock.mean and clock.stdDev for now. Their priors probably shouldn't be exactly equal, but we can discuss that later. I also added a ScaleOperator for clock.stdDev with identical parameters to the ScaleOperator for clock.mean; I did not add an upDownOperator relating clock.stdDev to tree (referring to the tree height, I assume?) like clock.mean has, but that can be easily changed.

jjmccollum commented 1 year ago

@rbturnbull All right, the --clock command-line option (for MrBayes and BEAST outputs) is now implemented; the permitted options are strict (the default option), uncorrelated (which corresponds to the UCRelaxedClockModel in BEAST and the Independent Gamma Rate model in MrBayes), and local (BEAST only). Unit tests are in place for this code, coverage is at 100%, and the mrbayes.yml and beast.yml workflows have been updated to test all valid options.

jjmccollum commented 1 year ago

@rbturnbull Per Remco's suggestion on https://github.com/CompEvol/beast2/issues/1075, I'd like to add priors on the transcriptional rate parameters to see if this prevents ill-conditioned substitution matrices from arising. If I did this, I could even tabulate the occurrences of different transcriptional change categories (marked with the @ana tag in a rdg element) in ignored readings (e.g., defective readings and subreadings) and use their relative proportions to inform the priors.

I imagine that a LogNormal distribution with an offset of 1, a mean somewhere above 1 (maybe 2 by default, but a number could be initialized by a tabulation like the one described as above), and a standard deviation of 1 would be a good place to start. If we did this, then would we want to make the mean of the distribution a separate parameter and replace the operators for the transcriptional rate parameters with operators for the means of their prior distributions?

rbturnbull commented 1 year ago

Hi @jjmccollum - I like the idea of a log-normal distribution as the prior. I'm not sure about the offset. I think the best thing to do is to start simple and try out various options. Do you want to catch up on zoom to discuss?

jjmccollum commented 1 year ago

@rbturnbull That would be good. I won't be available after Tuesday this week, but if you're available to chat before that, then that would work for me. Thanks!

jjmccollum commented 1 year ago

@rbturnbull Okay, I implemented log-normal priors on the transcriptional rates (with M=1.0 and S=1.0 to start) in the most recent two commits today, and the BEAST runs for all three clock models with a chain length of 20,000,000 completed without matrix errors in about 3 hours and 45 minutes. So we should be good using the ComplexSubstitutionModel again. The next step will be to determine more intuitive prior distributions on the transcriptional rates in general, but perhaps we can discuss that in person in a couple days.

rbturnbull commented 1 year ago

great work @jjmccollum

jjmccollum commented 1 year ago

@rbturnbull I'm nearly done writing up the new documentation for the features in this branch, so we're just about ready to do a pull request for this branch. There is one thing I noticed that doesn't seem right, though: in the ubs_ephesians.xml input, I specify that witness 18 has a fixed date:

<witness n="18">
    <origDate when="1364"/>
</witness>

As expected, this is mapped directly in the trait element under the tree element as 18=1364 in the output BEAST 2.7 XML file, and since there is no date range, no MRCAPrior is included for this witness. Thus, I would expect the tip for this witness to have no date range in the output DensiTree, but it seems to have a date range nevertheless:

image

It may be good to figure out why this is happening before we merge in this branch.

rbturnbull commented 1 year ago

That's quite strange. I don't understand why that is happening. Do you have the .trees file in the output that you used to create the densitree? Can you please send that to me?

rbturnbull commented 1 year ago

Chrysostom also looks to have a variable tip date even though it is set to 407 in the XML.

jjmccollum commented 1 year ago

This issue may have something to do with the BDSKY model playing around with the tip dates; see https://github.com/CompEvol/beast2/issues/981.

jjmccollum commented 1 year ago

I've just merged beast-xml into main; I'm pleased to say we can now close this issue!