jjmccollum / teiphy

A Python package for converting TEI XML collations to NEXUS, BEAST 2.7 XML, and other formats
MIT License
11 stars 3 forks source link

Add support for BEAST XML output #66

Closed jjmccollum closed 1 year ago

jjmccollum commented 1 year ago

For use cases with BEAST (2) as the target phylogenetic software, conversion to NEXUS followed by a second conversion through BEAUti is presently supported, but direct conversion to a BEAST XML input file would allow for the mapping of additional features, the most notable of these being variation unit-specific substitution models and additional parameters to be incorporated into these models.

Because of the extensive nature of BEAST XML files, the conversion process will involve starting with a template file and adding new elements for witnesses, including fields for their sequences and date calibrations, and root frequencies and substitution models for each variation unit.

This feature may will probably take extra effort to implement, so this effort should be undertaken on a dedicated branch.

jjmccollum commented 1 year ago

See https://github.com/CompEvol/beast2/issues/1075 for a discussion of features that may need to be added in a BEAST plugin to make this work.

jjmccollum commented 1 year ago

I've begun implementing this feature on a new beast-xml branch. I'm adding a template string like the following to common.py:

"""
BEAST XML template string
"""
beast_xml_template = """
<beast beautitemplate="Standard" beautistatus="" namespace="beast.core:beast.evolution.alignment:beast.evolution.tree.coalescent:beast.core.util:beast.evolution.nuc:beast.evolution.operators:beast.evolution.sitemodel:beast.evolution.substitutionmodel:beast.evolution.likelihood" required="" version="2.6">
    <data id="{id}" spec="Alignment" dataType="standard">
        <!-- Start sequences -->
        <userDataType id="StandardData.0" spec="beast.evolution.datatype.StandardData" nrOfStates="{nsymbols}">
            <!-- Start charstatelabels -->
        </userDataType>
    </data>
</beast>
"""

My plan is to call string.format() on the string with the appropriate variables, then pass the formatted string to LXML's etree.parse() method. With the XML object in hand, I can append sequences of child elements (e.g., sequence, charstatelabel) populated from the collation data.

Ideally, the template string will be minimal, to avoid overcomplicating matters. For this reason, and to make sure I'm accounting for everything appropriately, I'll post questions about specific elements in this issue as I implement this feature.

jjmccollum commented 1 year ago

@rbturnbull My first question concerns the children of the tree element. Here is what this block looks like in the 1Corinthians-lewismk-strict-nonhomogeneous.xml file:

<tree id="Tree.t:1Corinthians" spec="beast.evolution.tree.Tree" name="stateNode">
    <trait id="dateTrait.t:1Corinthians" spec="beast.evolution.tree.TraitSet" traitname="date" value="A=449,D=549,81=1044,&#8501;=349,L=849,P=849,&#936;=899,104=1087,365=1149,630=1200,1175=950,1241=1149,1505=1249,1506=1320,1739=950,1881=1350,2464=850,b=800,P46=212,B=349,F=849,G=849,33=849,Ambst=384,sy__p=500,co=350,sy__h=666,C=449,ar=850,d=500,6=1249,vg=400,P11=549,Cl=215,Or=254,048=449,K=849,0243=950">
        <taxa id="TaxonSet.1Corinthians2" spec="TaxonSet">
            <alignment id="1Corinthians2" spec="FilteredAlignment" filter="1-6,8-15,18-33,36-42,44-48,50-55,57-58,61-63,65-82,84-87,89-92,96,98-99,101-110,112-116,118-119,122,125-126,128-130,132-138,140-143,145-149,151-154,156,158-159,161-164,167,170-175,177-178,180-181,183-190,192,194-202,205,208-215,218-222,224,226,228-231,233-234,236-237,240-243,245-247,250-281,284-286,288-289,291-295,297-305,307,309,311-312,314,316-323,325-337,339-340,342-346,348-351,353-362,364-365,368,371-378,380,382-388,390-392,394-395,397-399,401-420,422-424,426-427,429-430,433-437,440-443,447-452,454">
                <data idref="1Corinthians"/>
                <userDataType id="morphDataType.1Corinthians2" spec="beast.evolution.datatype.StandardData" ambiguities="12 01 02" nrOfStates="2"/>
            </alignment>
        </taxa>
    </trait>
    <taxonset idref="TaxonSet.1Corinthians2"/>
</tree>

If I understand this correctly, the trait element is simply specifying the tip dates for the taxa (=witnesses). What is less clear to me is what the children of this element are doing. It looks like you're specifying a restricted taxon set consisting of the whole collation's sequence alignment data, but filtered for just the sites (=variation units) with two states (=variant readings).

Is this restriction somehow necessary for or related to the taxon dates? Do I need to replicate this structure in the template string, or can I specify the taxa of the tree in a simpler way without filtering the sites?

jjmccollum commented 1 year ago

In other words, would the following block (which links the taxonset to the alignment identified by the {id} placeholder and then links the trait to this taxonset) work in the template?

<tree id="Tree.t:{id}" spec="beast.evolution.tree.Tree" name="stateNode">
    <taxonset id="TaxonSet.{id}" spec="TaxonSet" alignment="@{id}"/>
    <trait id="dateTrait.t:{id}" spec="beast.evolution.tree.TraitSet" traitname="date" taxa="@TaxonSet.{id}" value="{date_map}"/>
</tree>
rbturnbull commented 1 year ago

Hi @jjmccollum - I think this section was left unchanged after saving the XML from BEAUTi. I think the morph-models package (https://github.com/CompEvol/morph-models/) divides up the alignment into partitions according to the number of states. My hunch is that the trait component just needs a 'taxa'/'TaxonSet' object and that requires an alignment to work out the taxa and the easiest way is to just give it the first alignment object in the list. I assume that any kind of alignment with the right taxa would work (but it would be good to check the TaxonSet code in Beast to see what it is doing.

jjmccollum commented 1 year ago

@rbturnbull I think this should be good. It looks like the testTipDates.xml example file in the beast2 GitHub repo does something similar:

<tree spec='beast.base.evolution.tree.ClusterTree' id='tree' clusterType='upgma'>
    <trait spec='beast.base.evolution.tree.TraitSet' traitname='date-forward' units='year'
            value='
            D4Brazi82  = 1982,
            D4ElSal83  = 1983,
            D4ElSal94  = 1994,
            D4Indon76  = 1976,
            D4Indon77  = 1977,
            D4Mexico84 = 1984,
            D4NewCal81 = 1981,
            D4Philip64 = 1964,
            D4Philip56 = 1956,
            D4Philip84 = 1984,
            D4PRico86  = 1986,
            D4SLanka78 = 1978,
            D4Tahiti79 = 1979,
            D4Tahiti85 = 1985,
            D4Thai63   = 1963,
            D4Thai78   = 1978,
            D4Thai84   = 1984
            '>
        <taxa spec='TaxonSet' alignment='@alignment'/>
    </trait>
    <input name='taxa' idref='alignment'/>
</tree>

And in testTipDates2.xml, the tree element doesn't even have a separate taxa child:

<tree estimate="true" id="tree" name="stateNode">
    <trait id="datetrait" spec="beast.base.evolution.tree.TraitSet" traitname="date" units="year">
        Lemur_catta=1,
        M_fascicularis=1
        <taxa spec='beast.base.evolution.alignment.TaxonSet' alignment='@Primates'/>
    </trait>
</tree>
rbturnbull commented 1 year ago

that looks good.

jjmccollum commented 1 year ago

@rbturnbull I suspect that this is another artifact of BEAUti, but I wanted to check if you had any idea of what it does and whether it's necessary. The following element occurs under the state element, just before the custom rate parameters:

<stateNode id="rateCategories.c:1Corinthians" spec="parameter.IntegerParameter" dimension="74">1</stateNode>

It is referenced by four other elements: the branchRateModel defined at the first character and three operator elements (with ids CategoriesRandomWalk.c:1Corinthians, CategoriesSwapOperator.c:1Corinthians, and CategoriesUniform.c:1Corinthians).

jjmccollum commented 1 year ago

@rbturnbull Okay, I'm now generating a complete BEAST XML input with teiphy, but I'm running into my first issue with BEAST. I'm using v2.7.3, and I'm getting the following error:

java.lang.IllegalArgumentException: org.xml.sax.SAXParseException; lineNumber: 50; columnNumber: 178; Invalid byte 2 of 2-byte UTF-8 sequence.

This must be an encoding problem, but I'm not sure what I'm doing wrong. According to the BEAST 2 FAQ (https://www.beast2.org/2021/05/17/beast-xml.html), UTF-8 should be the encoding of the input file:

The only part allowed before the beast element is the XML declaration (which contains some information about the XML format), which should look like this <?xml version="1.0" encoding="UTF-8" standalone="no"?>.

But my output file already contains a header virtually identical to this one: I ensure that this happens in the following line of code in to_beast:

et.ElementTree(beast_xml).write(file_addr, encoding='utf-8', xml_declaration=True, pretty_print=True)

And even if I change the XML header to match the one presented in the BEAST 2 FAQ exactly, I still get the same error.

Usually, this means that the problem is with the encoding of the file itself, but when I open it in VS Code, it does indeed appear to have UTF-8 encoding.

If I slugify the Greek reading texts (used as the state labels) to be written in ASCII rather than Unicode, then I avoid this error. So I could just write my output to ASCII format and leave out the XML header, as some of the example files I've seen do. But if BEAST 2 actually does support Unicode input, as the FAQ suggests it does, then shouldn't it be fine with the inputs teiphy is generating?

rbturnbull commented 1 year ago

good question @jjmccollum - can you please send me the two versions of the input file, one in unicode and one in just ascii?

rbturnbull commented 1 year ago

Hi @jjmccollum - regarding rateCategories.c:1Corinthians - that was part of the non-homogeneous clock model that I was experimenting with. That doesn't need to be in teiphy

jjmccollum commented 1 year ago

@rbturnbull Thanks for clarifying! I've removed it from the template, along with the operators that modify it.

jjmccollum commented 1 year ago

@rbturnbull Also, I e-mailed you those two version of the input file. We recently had an update to the student accounts over here, so if you haven't received an e-mail yet, please let me know.

jjmccollum commented 1 year ago

All right, I was able to resolve the issue with the SAXParseException (see https://github.com/CompEvol/beast2/issues/1076). Turns out it was a locale issue with my Java Virtual Machine; if I enter

set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8

on the command line and then open the BEAST GUI with

BEAST.xml

then the parser reads the UTF-8 input correctly.

With that resolved, I've been able to continue migrating the XML template by debugging it against BEAST 2.7. I've now addressed the namespace errors and am dealing with more substantial errors.

Perhaps the most significant issue is that I have to figure out if it's necessary to remove constant sites from the alignment for input to BEAST. If I leave them in, then I get XML parsing errors because a transition matrix element is required in a substitution model, and its entries can't be empty—but there aren't any off-diagonal entries in a 1 x 1 matrix.

There is some documentation on how ascertainment bias correction can be set up in a BEAST XML input (https://www.beast2.org/2019/07/18/ascertainment-correction.html), but the proposed approach seems to assume that the same states occur in every site, which is less applicable for a Lewis-style model with multiple states (and in our case, the number of states and their meanings are not at all interchangeable from site to site). That said, it may be best just to force the omission of constant sites in the conversion to BEAST XML, unless you know of a better way.

In the meantime, I am currently trying to debug the following error from the parser:

java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
    at beast.base.evolution.alignment.Sequence.initProbabilities(Unknown Source)
    at beast.base.evolution.alignment.Sequence.initAndValidate(Unknown Source)
    at beast.base.parser.XMLParser.initBEASTObjects(Unknown Source)
    at beast.base.parser.XMLParser.parse(Unknown Source)
    at beast.base.parser.XMLParser.parseFile(Unknown Source)
    at beastfx.app.beast.BeastMCMC.parseArgs(Unknown Source)
    at beastfx.app.beast.Controller$2.run(Unknown Source)
beast.base.parser.XMLParserException: 
Error 110 parsing the xml input file

validate and intialize error: Index 2 out of bounds for length 2

Error detected about here:
  <beast>
      <data id='alignment' spec='Alignment'>
          <sequence spec='Sequence'>

    at beast.base.parser.XMLParser.initBEASTObjects(Unknown Source)
    at beast.base.parser.XMLParser.parse(Unknown Source)
    at beast.base.parser.XMLParser.parseFile(Unknown Source)
    at beastfx.app.beast.BeastMCMC.parseArgs(Unknown Source)
    at beastfx.app.beast.Controller$2.run(Unknown Source)

So it seems to be something wrong with one of the sequence elements.

jjmccollum commented 1 year ago

The error logged above seems to be arising from indexing strs[i] or pr[j] in this method of Sequence.java:

public void initProbabilities() {

    String data = dataInput.get();
    // remove spaces
    data = data.replaceAll("\\s", "");

    String str = data.trim();
    String[] strs = str.split(";");     
    for (int i=0; i<strs.length; i++) {
        String[] pr = strs[i].split(",");
        //double total = 0;
        for (int j=0; j<pr.length; j++) {               
            if (likelihoods == null) likelihoods = new double[strs.length][pr.length];
            likelihoods[i][j] = Double.parseDouble(pr[j].trim());
            //total += likelihoods[i][j]; 
        }           
    }
}

But it's not clear to me which sequence in the XML file is causing this error. Every sequence has 39 ";" character delimiters (and thus 39 + 1 = 40 substantive variation units) and 58 "," state delimiters (and thus 40 + 58 = 98 substantive readings) as expected.

jjmccollum commented 1 year ago

Oh, I see the issue now. The following memory allocation line assumes that all sites have the same number of states:

if (likelihoods == null) likelihoods = new double[strs.length][pr.length];

It looks like this is a BEAST issue. I'll go ahead and write it up on that repo.

jjmccollum commented 1 year ago

All right, I've raised the issue at https://github.com/CompEvol/beast2/issues/1077.

jjmccollum commented 1 year ago

With apologies for the long string of recent commits, I now have a GitHub workflow for BEAST that works as it should. It's still failing, but that's due to the initProbabilities error detailed above (and currently being resolved in an issue on the BEAST repo). I'll have to wait for that issue to be resolved before I can proceed.

jjmccollum commented 1 year ago

Two updates:

  1. The previous issue at https://github.com/CompEvol/beast2/issues/1077 has been resolved. Unfortunately, because the beast.yml workflow sets up BEAST by downloading the latest release from GitHub, the workflow won't get around this error until the next release is made.
  2. I've made some changes to the to_beast method so that singleton sites (i.e., variation units with only one substantive reading) can be included in the output. I just add a dummy state to each of these sites and add strip="true" to the alignment element so that all of these sites will be assigned a weight of 0. The siteModel for each constant site also incorporates the dummy state into the root frequencies (where the dummy state is assigned a value of 0) and into the substModel element (which corresponds to a 2x2 rate matrix with off-diagonal entries set to the default_rate parameter).
jjmccollum commented 1 year ago

@rbturnbull Here's a question about the birth-death skyline model: Did any problem-specific details for the tradition of 1 Corinthians inform your choice of the origin parameter (which you fix at 1250)? The other three parameters of the model are estimated, so I assume that their initial values are just reasonable initial guesses. I just want to make sure that the BEAST XML output by teiphy is generalizable to other traditions.

jjmccollum commented 1 year ago

@rbturnbull All right, I've worked out a better GitHub workflow for BEAST that pulls and builds the latest source code. With that, I've been able to debug further into the process. I'm nearly there, but I'm currently running into the following error as BEAST is initializing the tree likelihoods:

Failed to load BEAGLE library: no hmsbeagle-jni in java.library.path: /opt/hostedtoolcache/Python/3.10.9/x64/lib:/usr/local/lib:/home/runner/.beast/beast/jre/lib/amd64
TreeLikelihood(morphTreeLikelihood.character1) uses BeerLikelihoodCore
  FilteredAlignment(filter1): [taxa, patterns, sites] = [38, 1, 1]
java.lang.NegativeArraySizeException: -1
        at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.initCore(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.initAndValidate(Unknown Source)
    at beast.base.parser.XMLParser.initBEASTObjects(Unknown Source)
    at beast.base.parser.XMLParser.parse(Unknown Source)
    at beast.base.parser.XMLParser.parseFile(Unknown Source)
    at beastfx.app.beast.BeastMCMC.parseArgs(Unknown Source)
    at beastfx.app.beast.BeastMain.main(Unknown Source)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at beast.pkgmgmt.launcher.BeastLauncher.run(Unknown Source)
    at beast.pkgmgmt.launcher.BeastLauncher.main(Unknown Source)

Error 110 parsing the xml input file

validate and intialize error: -1

Error detected about here:
  <beast>
      <run id='mcmc' spec='MCMC'>
          <distribution id='posterior' spec='CompoundDistribution'>
              <distribution id='likelihood' spec='CompoundDistribution'>
                  <distribution id='morphTreeLikelihood.character1' spec='TreeLikelihood'>

I don't know if this is simply an error because BEAGLE is needed or if something else is throwing the NegativeArraySizeException. We do have 38 witnesses in the collation, so [taxa, patterns, sites] = [38, 1, 1] should be correct for each FilteredAlignment, right?

jjmccollum commented 1 year ago

For convenience, here is the block of code where the exception gets thrown:

    protected void setPartials(Node node, int patternCount) {
        if (node.isLeaf()) {
            Alignment data = dataInput.get();
            int states = data.getDataType().getStateCount();
            double[] partials = new double[patternCount * states];
            int k = 0;
            int taxonIndex = getTaxonIndex(node.getID(), data);
            for (int patternIndex_ = 0; patternIndex_ < patternCount; patternIndex_++) {                
                double[] tipLikelihoods = data.getTipLikelihoods(taxonIndex,patternIndex_);
                if (tipLikelihoods != null) {
                    for (int state = 0; state < states; state++) {
                        partials[k++] = tipLikelihoods[state];
                    }
                }
                else {
                    int stateCount = data.getPattern(taxonIndex, patternIndex_);
                    boolean[] stateSet = data.getStateSet(stateCount);
                    for (int state = 0; state < states; state++) {
                         partials[k++] = (stateSet[state] ? 1.0 : 0.0);                
                    }
                }
            }
            likelihoodCore.setNodePartials(node.getNr(), partials);

        } else {
            setPartials(node.getLeft(), patternCount);
            setPartials(node.getRight(), patternCount);
        }
    }
rbturnbull commented 1 year ago

@rbturnbull Here's a question about the birth-death skyline model: Did any problem-specific details for the tradition of 1 Corinthians inform your choice of the origin parameter (which you fix at 1250)? The other three parameters of the model are estimated, so I assume that their initial values are just reasonable initial guesses. I just want to make sure that the BEAST XML output by teiphy is generalizable to other traditions.

hi @jjmccollum - I think i was going for a rough start date of AD 100 for the start date of the initial collection of the Pauline corpus and the latest witness I was using was dated to around 1350. We could have estimated the root date I think.

rbturnbull commented 1 year ago

@rbturnbull All right, I've worked out a better GitHub workflow for BEAST that pulls and builds the latest source code. With that, I've been able to debug further into the process. I'm nearly there, but I'm currently running into the following error as BEAST is initializing the tree likelihoods:

I don't think you'll need BEAGLE. Do you have the XML that you used which generated the error? If we look at character1 then that might show us what's going on

jjmccollum commented 1 year ago

@rbturnbull Yeah, I have the XML. Here's the element for character 1:

<distribution spec="TreeLikelihood" id="morphTreeLikelihood.character1" useAmbiguities="true" useTipLikelihoods="true" tree="@tree">
  <data spec="FilteredAlignment" id="filter1" data="@alignment" filter="1">
      <userDataType spec="StandardData" id="morphDataType.character1"/>
  </data>
  <siteModel spec="SiteModel" id="morphSiteModel.character1">
      <parameter spec="parameter.RealParameter" id="mutationRate.character1" name="mutationRate" value="1.0" estimate="false"/>
      <parameter spec="parameter.RealParameter" id="gammaShape.character1" name="shape" value="1.0" estimate="false"/>
      <substModel spec="GeneralSubstitutionModel" id="substModel.character1">
          <!-- Equilibrium frequencies -->
          <frequencies spec="Frequencies" id="equilibriumfreqs.character1">
              <frequencies spec="parameter.RealParameter" id="equilibriumfrequencies.character1" value="0.5 0.5" estimate="false"/>
          </frequencies>
          <parameter spec="parameter.CompoundValuable" id="rates.character1" name="rates">
              <!-- Start rate vars -->
              <var idref="default_rate"/><var spec="RPNcalculator" expression="Clar_rate Byz_rate +"><parameter idref="Clar_rate"/><parameter idref="Byz_rate"/></var><!-- End rate vars -->
          </parameter>
      </substModel>
  </siteModel>
  <!-- root frequencies -->
  <rootFrequencies spec="Frequencies" id="rootfreqs.character1">
      <frequencies spec="parameter.RealParameter" id="rootfrequencies.character1" value="0.6000000000000001 0.4" estimate="false"/>
  </rootFrequencies>
  <branchRateModel idref="strictClock"/>
</distribution>
jjmccollum commented 1 year ago

@rbturnbull Regarding the java.lang.NegativeArraySizeException, the problem is that the setPartials method is attempting to initialize an array with a size of -1. The only place where such an initialization occurs in the method is on the line

double[] partials = new double[patternCount * states];

So it seems that somehow, patternCount * states == -1. It seems plausible that the problem is coming from the states variable, initialized on the line

int states = data.getDataType().getStateCount();

The Alignment.getDataType method, in turn, gets the DataType instance associated with the FilteredAlignment for the site. The DataType.getStateCount method is declared as follows in the DataType interface:

/**
  * @return number of states for this data type. Assuming there is a finite
  *         number of states, or -1 otherwise.
  */
int getStateCount();

The problem is that the userDataType element under each site's distribution element was lacking a nrOfStates attribute. I've added this in the latest commit, and the workflow is now proceeding without this error.

jjmccollum commented 1 year ago

@rbturnbull Now I'm running into the following error at site 3:

java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 2
    at beast.base.evolution.datatype.StandardData.getStatesForCode(Unknown Source)
    at beast.base.evolution.datatype.DataType$Base.isAmbiguousCode(Unknown Source)
    at beast.base.evolution.alignment.FilteredAlignment.calcPatterns(Unknown Source)
    at beast.base.evolution.alignment.FilteredAlignment.initAndValidate(Unknown Source)
    at beast.base.parser.XMLParser.initBEASTObjects(Unknown Source)
    at beast.base.parser.XMLParser.parse(Unknown Source)
    at beast.base.parser.XMLParser.parseFile(Unknown Source)
    at beastfx.app.beast.BeastMCMC.parseArgs(Unknown Source)
    at beastfx.app.beast.BeastMain.main(Unknown Source)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at beast.pkgmgmt.launcher.BeastLauncher.run(Unknown Source)
    at beast.pkgmgmt.launcher.BeastLauncher.main(Unknown Source)

Here is the method in the StandardData class where the exception gets thrown:

public int[] getStatesForCode(int state) {
        if (state >= 0) {
            return mapCodeToStateSet[state];
        } else {
            return mapCodeToStateSet[mapCodeToStateSet.length - 1];
        }
    }

And here is the block of the FilteredAlignment.calcPatterns method that invokes this method:

    protected void calcPatterns() {
        int nrOfTaxa = counts.size();
        int nrOfSites = filter.length;
        DataType baseType = alignmentInput.get().m_dataType;
        // convert data to transposed int array
        int[][] data = new int[nrOfSites][nrOfTaxa];
        String missingChar = Character.toString(DataType.MISSING_CHAR);
        String gapChar = Character.toString(DataType.GAP_CHAR);
        for (int i = 0; i < nrOfTaxa; i++) {
            List<Integer> sites = counts.get(i);
            for (int j = 0; j < nrOfSites; j++) {
                data[j][i] = sites.get(filter[j]);
                if (convertDataType) {
                    try {
                        boolean needsBrackets = baseType.isAmbiguousCode(data[j][i]) && 
                                ! baseType.getCharacter(data[j][i]).equals(missingChar) &&
                                ! baseType.getCharacter(data[j][i]).equals(gapChar);
                        String code = needsBrackets ?
                                "{"+baseType.getCharacter(data[j][i]) + "}" :
                                    baseType.getCharacter(data[j][i]);
                        data[j][i] = m_dataType.stringToEncoding(code).get(0);
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                }
            }
        }

Apparently, in the following 2-state site, we are trying to get states for a code whose index is 3:

<charstatelabels spec="UserDataType" characterName="B10K1V14U2" codeMap="0=0, 1=1, ?=0 1" states="2" value="ο, ος"/>
...
<distribution spec="TreeLikelihood" id="morphTreeLikelihood.character3" useAmbiguities="true" useTipLikelihoods="true" tree="@tree">
  <data spec="FilteredAlignment" id="filter3" data="@alignment" filter="3">
      <userDataType spec="StandardData" id="morphDataType.character3" nrOfStates="2"/>
  </data>
  <siteModel spec="SiteModel" id="morphSiteModel.character3">
      <parameter spec="parameter.RealParameter" id="mutationRate.character3" name="mutationRate" value="1.0" estimate="false"/>
      <parameter spec="parameter.RealParameter" id="gammaShape.character3" name="shape" value="1.0" estimate="false"/>
      <substModel spec="GeneralSubstitutionModel" id="substModel.character3">
          <!-- Equilibrium frequencies -->
          <frequencies spec="Frequencies" id="equilibriumfreqs.character3">
              <frequencies spec="parameter.RealParameter" id="equilibriumfrequencies.character3" value="0.5 0.5" estimate="false"/>
          </frequencies>
          <parameter spec="parameter.CompoundValuable" id="rates.character3" name="rates">
              <!-- Start rate vars -->
              <var spec="RPNcalculator" expression="LingConf_rate Byz_rate +">
                <parameter idref="LingConf_rate"/>
                <parameter idref="Byz_rate"/>
              </var>
              <var idref="Clar_rate"/>
              <!-- End rate vars -->
          </parameter>
      </substModel>
  </siteModel>
  <!-- root frequencies -->
  <rootFrequencies spec="Frequencies" id="rootfreqs.character3">
      <frequencies spec="parameter.RealParameter" id="rootfrequencies.character3" value="0.8 0.2" estimate="false"/>
  </rootFrequencies>
  <branchRateModel idref="strictClock"/>
</distribution>
jjmccollum commented 1 year ago

This means that for some site j and taxon i, the state data[j][i] in the FilteredAlignment with id="filter3" has a value of 3. Now I just need to figure out where it's getting that...

jjmccollum commented 1 year ago

Tracing things backwards a bit, we have

data[j][i] = sites.get(filter[j]);

For the FilteredAlignment with id="filter3", the filter array should have only one entry, which should be the (zero-based) index of the site (i.e., 2). The sites list is initialized for taxon i a couple lines earlier:

List<Integer> sites = counts.get(i);

Working back from there counts is a member of the FilteredAlignment class; it is initialized in the initAndValidate method of the class:

counts = data.getCounts();

The Alignment.getCounts() method, in turn, is defined as follows:

/**
  * Returns a List of Integer Lists where each Integer List represents
  * the sequence corresponding to a taxon.  The taxon is identified by
  * the position of the Integer List in the outer List, which corresponds
  * to the nodeNr of the corresponding leaf node and the position of the
  * taxon name in the taxaNames list.
  *
  * @return integer representation of sequence alignment
  */
public List<List<Integer>> getCounts() {
    return counts;
}

And here is the loop in the Alignment.initializeWithSequenceList method that populates this list of lists:

            for (Sequence seq : sequences) {
                counts.add(seq.getSequence(m_dataType));
                if (taxaNames.contains(seq.getTaxon())) {
                    throw new RuntimeException("Duplicate taxon found in alignment: " + seq.getTaxon());
                }
                taxaNames.add(seq.getTaxon());
                tipLikelihoods.add(seq.getLikelihoods());
                // if seq.isUncertain() == false then the above line adds 'null'
            // to the list, indicating that this particular sequence has no tip likelihood information
                usingTipLikelihoods |= (seq.getLikelihoods() != null);              
                if (seq.totalCountInput.get() != null) {
                    stateCounts.add(seq.totalCountInput.get());
                } else {
                    stateCounts.add(m_dataType.getStateCount());
                }
            }
            if (counts.size() == 0) {
                // no sequence data
                throw new RuntimeException("Sequence data expected, but none found");
            }

So if I understand correctly, the value of 3 seems to be creeping in somewhere in here.

jjmccollum commented 1 year ago

Here is where the list of states that is added to the counts list is retrieved:

    public List<Integer> getSequence(DataType dataType) {
        List<Integer> sequence;
        if (uncertain) {
            sequence = new ArrayList<>();
            for (int i=0; i<likelihoods.length; i++) {
                double m = likelihoods[i][0];
                int index = 0;
                for (int j=0; j<likelihoods[i].length; j++) {
                    if (likelihoods[i][j] > m ) {
                        m = likelihoods[i][j];
                        index = j;
                    }               
                }
                sequence.add(index);
            }
        }
        else {
            String data = dataInput.get();
            // remove spaces
            data = data.replaceAll("\\s", "");
            sequence = dataType.stringToEncoding(data);
        }

        if (totalCountInput.get() == null) {
            // derive default from char-map
            totalCountInput.setValue(dataType.getStateCount(), this);
        }
        return sequence;
    }

For our purposes, we enter the if (uncertain) block, since our sequence values are given as tip likelihoods. This block then finds the index of the state with the highest likelihood for each site and treats this as the numerical index for a single representative state at site i in the sequence. But assuming the Sequence.initProbabilities method is working correctly following the changes in https://github.com/CompEvol/beast2/issues/1077, the indices in this converted state list for the sequence should all be shorter than the number of states at their respective sites.

jjmccollum commented 1 year ago

@rbturnbull All right, I solved the problem! I had to add the nrOfStates attribute to the userDataType element that contains the charstatelabels elements. I'm still not exactly sure why that fixes the error I was seeing, but it probably has something to do with the totalCountInput that appears in the above code snippets.

But with that solved, the beast.yml workflow is now parsing the XML file all the way through! We now have a deeper and thornier problem to solve in the calculations of likelihoods:

Singular matrix encountered
java.lang.IllegalArgumentException: Singular matrix
    at beast.base.evolution.substitutionmodel.DefaultEigenSystem.luinverse(Unknown Source)
    at beast.base.evolution.substitutionmodel.DefaultEigenSystem.decomposeMatrix(Unknown Source)
    at beast.base.evolution.substitutionmodel.GeneralSubstitutionModel.getTransitionProbabilities(Unknown Source)
    at beast.base.evolution.substitutionmodel.GeneralSubstitutionModel.getTransitionProbabilities(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.traverse(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.traverse(Unknown Source)
    at beast.base.evolution.likelihood.TreeLikelihood.calculateLogP(Unknown Source)
    at beast.base.inference.CompoundDistribution.calculateLogP(Unknown Source)
    at beast.base.inference.CompoundDistribution.calculateLogP(Unknown Source)
    at beast.base.inference.State.robustlyCalcPosterior(Unknown Source)
    at beast.base.inference.MCMC.run(Unknown Source)
    at beastfx.app.beast.BeastMCMC.run(Unknown Source)
    at beastfx.app.beast.BeastMain.main(Unknown Source)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at beast.pkgmgmt.launcher.BeastLauncher.run(Unknown Source)
    at beast.pkgmgmt.launcher.BeastLauncher.main(Unknown Source)

Unfortunately, allowing for general specifications of the off-diagonal entries in transition matrices makes it possible that the resulting matrix will be singular. This may be something we have to address in https://github.com/CompEvol/beast2/issues/1075. I'm not sure if there's a simple way to teiphy to check this with arbitrary transition matrix entries ahead of time. We might be able to avoid the problem in practice if we assign random starting values to the rate parameters.

jjmccollum commented 1 year ago

Okay, I tried assigning random starting values to the rate parameters, and that didn't fix things. So we'll need to have a more involved solution to avoid a singular matrix.

jjmccollum commented 1 year ago

@rbturnbull Thankfully, things weren't as bad I as thought! When I specified the --drop-constant flag in beast.yml, BEAST ran end-to-end on the output XML file without complaining about a singular matrix. From this, I realized that the substModel I supplied for singleton sites was the culprit. Specifically, I was setting the equilibrium frequencies to 1 for the constant state and 0 for the dummy state, and this was creating the issue. I've changed the equilibrium frequencies to 0.5 and 0.5, and now beast.yml runs end-to-end even without the --drop-constant flag. So the feature appears to have been implemented successfully! (I still need to update the tests to get back to 100% coverage, but the hard part is done now.)

The issue of equilibrium frequencies did raise an interesting question, though. I followed your phylopaul XML examples in using uniform distributions for equilibrium frequencies, but in practice, we could estimate the actual equilibrium frequencies using the states in the extant witnesses. (The problem is that conjectured readings would be assigned equilibrium frequencies of 0, which could produce the same problem with singular matrices.) Would there be any advantage to this, or is it best just to assign all states equal equilibrium frequencies?

rbturnbull commented 1 year ago

Hi @jjmccollum - good questions. The equilibrium frequency for textual states is a challenge. Let's talk more about it when we meet.

jjmccollum commented 1 year ago

@rbturnbull Sounds good! On a related note, we may want to discuss whether the ComplexSubstitutionModel class (which is not constrained to be reversible) is a more suitable choice for our purposes. This class stores the equilibrium frequencies as a member, but it does not appear to use them for its transition probability calculations. In general, we do not need to assume reversibility, as we do not consider the root of the tree arbitrary (i.e., our root frequencies tend to be asymmetrical) in the first place.

rbturnbull commented 1 year ago

Hi @jjmccollum - I'm just familiarising myself with the code to generate the beast output. I like the way of using templates. Have you considered using Jinja2 for handling the kind of templating system? That is designed for this kind of thing.

jjmccollum commented 1 year ago

@rbturnbull I used it for a similar application a while ago, but I'd forgotten about it! It would make serializing lists of elements much more straightforward.

rbturnbull commented 1 year ago

A nice byproduct is that the templates could be stored as regular XML files so syntax highlighting would work.

jjmccollum commented 1 year ago

That is a nice benefit. I don't think it would take too much effort for me to make that change on the beast-xml branch. If you think the tip date sampling won't take too long to implement, I might just wait until we've merged in that code before I proceed, unless you'd like to have the Jinja2 templates in place before you implement that feature.

rbturnbull commented 1 year ago

OK. I might give it a go in the branch I've got now for it and then we can convert to jinja.

rbturnbull commented 1 year ago

@jjmccollum - are you happy for me to convert some of the string format statements to f-strings? (https://miguendes.me/73-examples-to-help-you-master-pythons-f-strings)

jjmccollum commented 1 year ago

@rbturnbull That's fine! I need to get accustomed to using them anyway.

rbturnbull commented 1 year ago

Hi @jjmccollum - it's good that teiphy produces output compatible with beast 2.7.x. But if we add some extra items in the namespace at the top then we can get the code to run on beast 2.6.x as well. I'm not sure if that's worth it though. We can just say that it only supports 2.7. What do you think?

rbturnbull commented 1 year ago

Hi @jjmccollum - forget my last message. I think it'd be too hard to support both beast 2.6 and 2.7. Let's just support 2.7.

jjmccollum commented 1 year ago

@rbturnbull Apologies for my delayed response! I agree that it should be easier just to support BEAST 2.7 moving forward.

rbturnbull commented 1 year ago

Am I right that it won't run without the latest version of Beast 2 on Github? I'm having trouble compiling that from source. Hopefully I'll work it out and I can run the code properly.

jjmccollum commented 1 year ago

That's correct; I had to patch an issue in the initialization method for the Sequence class. That change is now in the latest source code, but there hasn't been a release incorporating it yet.

rbturnbull commented 1 year ago

I've been able to work out the issue with installing the code from source but it is still failing with an error reading the Sequence.

jjmccollum commented 1 year ago

What does the error look like?

rbturnbull commented 1 year ago
Error 110 parsing the xml input file

validate and intialize error: Index 2 out of bounds for length 2

Error detected about here:
  <beast>
      <data id='alignment' spec='Alignment'>
          <sequence spec='Sequence'>