Closed jjmccollum closed 1 year ago
See https://github.com/CompEvol/beast2/issues/1075 for a discussion of features that may need to be added in a BEAST plugin to make this work.
I've begun implementing this feature on a new beast-xml
branch. I'm adding a template string like the following to common.py
:
"""
BEAST XML template string
"""
beast_xml_template = """
<beast beautitemplate="Standard" beautistatus="" namespace="beast.core:beast.evolution.alignment:beast.evolution.tree.coalescent:beast.core.util:beast.evolution.nuc:beast.evolution.operators:beast.evolution.sitemodel:beast.evolution.substitutionmodel:beast.evolution.likelihood" required="" version="2.6">
<data id="{id}" spec="Alignment" dataType="standard">
<!-- Start sequences -->
<userDataType id="StandardData.0" spec="beast.evolution.datatype.StandardData" nrOfStates="{nsymbols}">
<!-- Start charstatelabels -->
</userDataType>
</data>
</beast>
"""
My plan is to call string.format()
on the string with the appropriate variables, then pass the formatted string to LXML's etree.parse()
method. With the XML object in hand, I can append sequences of child elements (e.g., sequence
, charstatelabel
) populated from the collation data.
Ideally, the template string will be minimal, to avoid overcomplicating matters. For this reason, and to make sure I'm accounting for everything appropriately, I'll post questions about specific elements in this issue as I implement this feature.
@rbturnbull My first question concerns the children of the tree
element. Here is what this block looks like in the 1Corinthians-lewismk-strict-nonhomogeneous.xml
file:
<tree id="Tree.t:1Corinthians" spec="beast.evolution.tree.Tree" name="stateNode">
<trait id="dateTrait.t:1Corinthians" spec="beast.evolution.tree.TraitSet" traitname="date" value="A=449,D=549,81=1044,ℵ=349,L=849,P=849,Ψ=899,104=1087,365=1149,630=1200,1175=950,1241=1149,1505=1249,1506=1320,1739=950,1881=1350,2464=850,b=800,P46=212,B=349,F=849,G=849,33=849,Ambst=384,sy__p=500,co=350,sy__h=666,C=449,ar=850,d=500,6=1249,vg=400,P11=549,Cl=215,Or=254,048=449,K=849,0243=950">
<taxa id="TaxonSet.1Corinthians2" spec="TaxonSet">
<alignment id="1Corinthians2" spec="FilteredAlignment" filter="1-6,8-15,18-33,36-42,44-48,50-55,57-58,61-63,65-82,84-87,89-92,96,98-99,101-110,112-116,118-119,122,125-126,128-130,132-138,140-143,145-149,151-154,156,158-159,161-164,167,170-175,177-178,180-181,183-190,192,194-202,205,208-215,218-222,224,226,228-231,233-234,236-237,240-243,245-247,250-281,284-286,288-289,291-295,297-305,307,309,311-312,314,316-323,325-337,339-340,342-346,348-351,353-362,364-365,368,371-378,380,382-388,390-392,394-395,397-399,401-420,422-424,426-427,429-430,433-437,440-443,447-452,454">
<data idref="1Corinthians"/>
<userDataType id="morphDataType.1Corinthians2" spec="beast.evolution.datatype.StandardData" ambiguities="12 01 02" nrOfStates="2"/>
</alignment>
</taxa>
</trait>
<taxonset idref="TaxonSet.1Corinthians2"/>
</tree>
If I understand this correctly, the trait
element is simply specifying the tip dates for the taxa (=witnesses). What is less clear to me is what the children of this element are doing. It looks like you're specifying a restricted taxon set consisting of the whole collation's sequence alignment data, but filtered for just the sites (=variation units) with two states (=variant readings).
Is this restriction somehow necessary for or related to the taxon dates? Do I need to replicate this structure in the template string, or can I specify the taxa of the tree in a simpler way without filtering the sites?
In other words, would the following block (which links the taxonset
to the alignment
identified by the {id}
placeholder and then links the trait
to this taxonset
) work in the template?
<tree id="Tree.t:{id}" spec="beast.evolution.tree.Tree" name="stateNode">
<taxonset id="TaxonSet.{id}" spec="TaxonSet" alignment="@{id}"/>
<trait id="dateTrait.t:{id}" spec="beast.evolution.tree.TraitSet" traitname="date" taxa="@TaxonSet.{id}" value="{date_map}"/>
</tree>
Hi @jjmccollum - I think this section was left unchanged after saving the XML from BEAUTi. I think the morph-models package (https://github.com/CompEvol/morph-models/) divides up the alignment into partitions according to the number of states. My hunch is that the trait component just needs a 'taxa'/'TaxonSet' object and that requires an alignment to work out the taxa and the easiest way is to just give it the first alignment object in the list. I assume that any kind of alignment with the right taxa would work (but it would be good to check the TaxonSet code in Beast to see what it is doing.
@rbturnbull I think this should be good. It looks like the testTipDates.xml
example file in the beast2
GitHub repo does something similar:
<tree spec='beast.base.evolution.tree.ClusterTree' id='tree' clusterType='upgma'>
<trait spec='beast.base.evolution.tree.TraitSet' traitname='date-forward' units='year'
value='
D4Brazi82 = 1982,
D4ElSal83 = 1983,
D4ElSal94 = 1994,
D4Indon76 = 1976,
D4Indon77 = 1977,
D4Mexico84 = 1984,
D4NewCal81 = 1981,
D4Philip64 = 1964,
D4Philip56 = 1956,
D4Philip84 = 1984,
D4PRico86 = 1986,
D4SLanka78 = 1978,
D4Tahiti79 = 1979,
D4Tahiti85 = 1985,
D4Thai63 = 1963,
D4Thai78 = 1978,
D4Thai84 = 1984
'>
<taxa spec='TaxonSet' alignment='@alignment'/>
</trait>
<input name='taxa' idref='alignment'/>
</tree>
And in testTipDates2.xml
, the tree
element doesn't even have a separate taxa
child:
<tree estimate="true" id="tree" name="stateNode">
<trait id="datetrait" spec="beast.base.evolution.tree.TraitSet" traitname="date" units="year">
Lemur_catta=1,
M_fascicularis=1
<taxa spec='beast.base.evolution.alignment.TaxonSet' alignment='@Primates'/>
</trait>
</tree>
that looks good.
@rbturnbull I suspect that this is another artifact of BEAUti, but I wanted to check if you had any idea of what it does and whether it's necessary. The following element occurs under the state
element, just before the custom rate parameters:
<stateNode id="rateCategories.c:1Corinthians" spec="parameter.IntegerParameter" dimension="74">1</stateNode>
It is referenced by four other elements: the branchRateModel
defined at the first character and three operator
elements (with ids CategoriesRandomWalk.c:1Corinthians
, CategoriesSwapOperator.c:1Corinthians
, and CategoriesUniform.c:1Corinthians
).
@rbturnbull Okay, I'm now generating a complete BEAST XML input with teiphy
, but I'm running into my first issue with BEAST. I'm using v2.7.3, and I'm getting the following error:
java.lang.IllegalArgumentException: org.xml.sax.SAXParseException; lineNumber: 50; columnNumber: 178; Invalid byte 2 of 2-byte UTF-8 sequence.
This must be an encoding problem, but I'm not sure what I'm doing wrong. According to the BEAST 2 FAQ (https://www.beast2.org/2021/05/17/beast-xml.html), UTF-8 should be the encoding of the input file:
The only part allowed before the beast element is the XML declaration (which contains some information about the XML format), which should look like this
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
.
But my output file already contains a header virtually identical to this one: I ensure that this happens in the following line of code in to_beast
:
et.ElementTree(beast_xml).write(file_addr, encoding='utf-8', xml_declaration=True, pretty_print=True)
And even if I change the XML header to match the one presented in the BEAST 2 FAQ exactly, I still get the same error.
Usually, this means that the problem is with the encoding of the file itself, but when I open it in VS Code, it does indeed appear to have UTF-8 encoding.
If I slugify
the Greek reading texts (used as the state labels) to be written in ASCII rather than Unicode, then I avoid this error. So I could just write my output to ASCII format and leave out the XML header, as some of the example files I've seen do. But if BEAST 2 actually does support Unicode input, as the FAQ suggests it does, then shouldn't it be fine with the inputs teiphy
is generating?
good question @jjmccollum - can you please send me the two versions of the input file, one in unicode and one in just ascii?
Hi @jjmccollum - regarding rateCategories.c:1Corinthians
- that was part of the non-homogeneous clock model that I was experimenting with. That doesn't need to be in teiphy
@rbturnbull Thanks for clarifying! I've removed it from the template, along with the operators that modify it.
@rbturnbull Also, I e-mailed you those two version of the input file. We recently had an update to the student accounts over here, so if you haven't received an e-mail yet, please let me know.
All right, I was able to resolve the issue with the SAXParseException
(see https://github.com/CompEvol/beast2/issues/1076). Turns out it was a locale issue with my Java Virtual Machine; if I enter
set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8
on the command line and then open the BEAST GUI with
BEAST.xml
then the parser reads the UTF-8 input correctly.
With that resolved, I've been able to continue migrating the XML template by debugging it against BEAST 2.7. I've now addressed the namespace errors and am dealing with more substantial errors.
Perhaps the most significant issue is that I have to figure out if it's necessary to remove constant sites from the alignment for input to BEAST. If I leave them in, then I get XML parsing errors because a transition matrix element is required in a substitution model, and its entries can't be empty—but there aren't any off-diagonal entries in a 1 x 1 matrix.
There is some documentation on how ascertainment bias correction can be set up in a BEAST XML input (https://www.beast2.org/2019/07/18/ascertainment-correction.html), but the proposed approach seems to assume that the same states occur in every site, which is less applicable for a Lewis-style model with multiple states (and in our case, the number of states and their meanings are not at all interchangeable from site to site). That said, it may be best just to force the omission of constant sites in the conversion to BEAST XML, unless you know of a better way.
In the meantime, I am currently trying to debug the following error from the parser:
java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
at beast.base.evolution.alignment.Sequence.initProbabilities(Unknown Source)
at beast.base.evolution.alignment.Sequence.initAndValidate(Unknown Source)
at beast.base.parser.XMLParser.initBEASTObjects(Unknown Source)
at beast.base.parser.XMLParser.parse(Unknown Source)
at beast.base.parser.XMLParser.parseFile(Unknown Source)
at beastfx.app.beast.BeastMCMC.parseArgs(Unknown Source)
at beastfx.app.beast.Controller$2.run(Unknown Source)
beast.base.parser.XMLParserException:
Error 110 parsing the xml input file
validate and intialize error: Index 2 out of bounds for length 2
Error detected about here:
<beast>
<data id='alignment' spec='Alignment'>
<sequence spec='Sequence'>
at beast.base.parser.XMLParser.initBEASTObjects(Unknown Source)
at beast.base.parser.XMLParser.parse(Unknown Source)
at beast.base.parser.XMLParser.parseFile(Unknown Source)
at beastfx.app.beast.BeastMCMC.parseArgs(Unknown Source)
at beastfx.app.beast.Controller$2.run(Unknown Source)
So it seems to be something wrong with one of the sequence
elements.
The error logged above seems to be arising from indexing strs[i]
or pr[j]
in this method of Sequence.java
:
public void initProbabilities() {
String data = dataInput.get();
// remove spaces
data = data.replaceAll("\\s", "");
String str = data.trim();
String[] strs = str.split(";");
for (int i=0; i<strs.length; i++) {
String[] pr = strs[i].split(",");
//double total = 0;
for (int j=0; j<pr.length; j++) {
if (likelihoods == null) likelihoods = new double[strs.length][pr.length];
likelihoods[i][j] = Double.parseDouble(pr[j].trim());
//total += likelihoods[i][j];
}
}
}
But it's not clear to me which sequence in the XML file is causing this error. Every sequence has 39 ";" character delimiters (and thus 39 + 1 = 40 substantive variation units) and 58 "," state delimiters (and thus 40 + 58 = 98 substantive readings) as expected.
Oh, I see the issue now. The following memory allocation line assumes that all sites have the same number of states:
if (likelihoods == null) likelihoods = new double[strs.length][pr.length];
It looks like this is a BEAST issue. I'll go ahead and write it up on that repo.
All right, I've raised the issue at https://github.com/CompEvol/beast2/issues/1077.
With apologies for the long string of recent commits, I now have a GitHub workflow for BEAST that works as it should. It's still failing, but that's due to the initProbabilities
error detailed above (and currently being resolved in an issue on the BEAST repo). I'll have to wait for that issue to be resolved before I can proceed.
Two updates:
beast.yml
workflow sets up BEAST by downloading the latest release from GitHub, the workflow won't get around this error until the next release is made.to_beast
method so that singleton sites (i.e., variation units with only one substantive reading) can be included in the output. I just add a dummy state to each of these sites and add strip="true"
to the alignment element so that all of these sites will be assigned a weight of 0. The siteModel
for each constant site also incorporates the dummy state into the root frequencies (where the dummy state is assigned a value of 0) and into the substModel
element (which corresponds to a 2x2 rate matrix with off-diagonal entries set to the default_rate
parameter).@rbturnbull Here's a question about the birth-death skyline model: Did any problem-specific details for the tradition of 1 Corinthians inform your choice of the origin
parameter (which you fix at 1250)? The other three parameters of the model are estimated, so I assume that their initial values are just reasonable initial guesses. I just want to make sure that the BEAST XML output by teiphy
is generalizable to other traditions.
@rbturnbull All right, I've worked out a better GitHub workflow for BEAST that pulls and builds the latest source code. With that, I've been able to debug further into the process. I'm nearly there, but I'm currently running into the following error as BEAST is initializing the tree likelihoods:
Failed to load BEAGLE library: no hmsbeagle-jni in java.library.path: /opt/hostedtoolcache/Python/3.10.9/x64/lib:/usr/local/lib:/home/runner/.beast/beast/jre/lib/amd64
TreeLikelihood(morphTreeLikelihood.character1) uses BeerLikelihoodCore
FilteredAlignment(filter1): [taxa, patterns, sites] = [38, 1, 1]
java.lang.NegativeArraySizeException: -1
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.setPartials(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.initCore(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.initAndValidate(Unknown Source)
at beast.base.parser.XMLParser.initBEASTObjects(Unknown Source)
at beast.base.parser.XMLParser.parse(Unknown Source)
at beast.base.parser.XMLParser.parseFile(Unknown Source)
at beastfx.app.beast.BeastMCMC.parseArgs(Unknown Source)
at beastfx.app.beast.BeastMain.main(Unknown Source)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at beast.pkgmgmt.launcher.BeastLauncher.run(Unknown Source)
at beast.pkgmgmt.launcher.BeastLauncher.main(Unknown Source)
Error 110 parsing the xml input file
validate and intialize error: -1
Error detected about here:
<beast>
<run id='mcmc' spec='MCMC'>
<distribution id='posterior' spec='CompoundDistribution'>
<distribution id='likelihood' spec='CompoundDistribution'>
<distribution id='morphTreeLikelihood.character1' spec='TreeLikelihood'>
I don't know if this is simply an error because BEAGLE is needed or if something else is throwing the NegativeArraySizeException
. We do have 38 witnesses in the collation, so [taxa, patterns, sites] = [38, 1, 1]
should be correct for each FilteredAlignment
, right?
For convenience, here is the block of code where the exception gets thrown:
protected void setPartials(Node node, int patternCount) {
if (node.isLeaf()) {
Alignment data = dataInput.get();
int states = data.getDataType().getStateCount();
double[] partials = new double[patternCount * states];
int k = 0;
int taxonIndex = getTaxonIndex(node.getID(), data);
for (int patternIndex_ = 0; patternIndex_ < patternCount; patternIndex_++) {
double[] tipLikelihoods = data.getTipLikelihoods(taxonIndex,patternIndex_);
if (tipLikelihoods != null) {
for (int state = 0; state < states; state++) {
partials[k++] = tipLikelihoods[state];
}
}
else {
int stateCount = data.getPattern(taxonIndex, patternIndex_);
boolean[] stateSet = data.getStateSet(stateCount);
for (int state = 0; state < states; state++) {
partials[k++] = (stateSet[state] ? 1.0 : 0.0);
}
}
}
likelihoodCore.setNodePartials(node.getNr(), partials);
} else {
setPartials(node.getLeft(), patternCount);
setPartials(node.getRight(), patternCount);
}
}
@rbturnbull Here's a question about the birth-death skyline model: Did any problem-specific details for the tradition of 1 Corinthians inform your choice of the
origin
parameter (which you fix at 1250)? The other three parameters of the model are estimated, so I assume that their initial values are just reasonable initial guesses. I just want to make sure that the BEAST XML output byteiphy
is generalizable to other traditions.
hi @jjmccollum - I think i was going for a rough start date of AD 100 for the start date of the initial collection of the Pauline corpus and the latest witness I was using was dated to around 1350. We could have estimated the root date I think.
@rbturnbull All right, I've worked out a better GitHub workflow for BEAST that pulls and builds the latest source code. With that, I've been able to debug further into the process. I'm nearly there, but I'm currently running into the following error as BEAST is initializing the tree likelihoods:
I don't think you'll need BEAGLE. Do you have the XML that you used which generated the error? If we look at character1 then that might show us what's going on
@rbturnbull Yeah, I have the XML. Here's the element for character 1:
<distribution spec="TreeLikelihood" id="morphTreeLikelihood.character1" useAmbiguities="true" useTipLikelihoods="true" tree="@tree">
<data spec="FilteredAlignment" id="filter1" data="@alignment" filter="1">
<userDataType spec="StandardData" id="morphDataType.character1"/>
</data>
<siteModel spec="SiteModel" id="morphSiteModel.character1">
<parameter spec="parameter.RealParameter" id="mutationRate.character1" name="mutationRate" value="1.0" estimate="false"/>
<parameter spec="parameter.RealParameter" id="gammaShape.character1" name="shape" value="1.0" estimate="false"/>
<substModel spec="GeneralSubstitutionModel" id="substModel.character1">
<!-- Equilibrium frequencies -->
<frequencies spec="Frequencies" id="equilibriumfreqs.character1">
<frequencies spec="parameter.RealParameter" id="equilibriumfrequencies.character1" value="0.5 0.5" estimate="false"/>
</frequencies>
<parameter spec="parameter.CompoundValuable" id="rates.character1" name="rates">
<!-- Start rate vars -->
<var idref="default_rate"/><var spec="RPNcalculator" expression="Clar_rate Byz_rate +"><parameter idref="Clar_rate"/><parameter idref="Byz_rate"/></var><!-- End rate vars -->
</parameter>
</substModel>
</siteModel>
<!-- root frequencies -->
<rootFrequencies spec="Frequencies" id="rootfreqs.character1">
<frequencies spec="parameter.RealParameter" id="rootfrequencies.character1" value="0.6000000000000001 0.4" estimate="false"/>
</rootFrequencies>
<branchRateModel idref="strictClock"/>
</distribution>
@rbturnbull Regarding the java.lang.NegativeArraySizeException
, the problem is that the setPartials
method is attempting to initialize an array with a size of -1. The only place where such an initialization occurs in the method is on the line
double[] partials = new double[patternCount * states];
So it seems that somehow, patternCount * states == -1
. It seems plausible that the problem is coming from the states
variable, initialized on the line
int states = data.getDataType().getStateCount();
The Alignment.getDataType
method, in turn, gets the DataType
instance associated with the FilteredAlignment
for the site. The DataType.getStateCount
method is declared as follows in the DataType
interface:
/**
* @return number of states for this data type. Assuming there is a finite
* number of states, or -1 otherwise.
*/
int getStateCount();
The problem is that the userDataType
element under each site's distribution
element was lacking a nrOfStates
attribute. I've added this in the latest commit, and the workflow is now proceeding without this error.
@rbturnbull Now I'm running into the following error at site 3:
java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 2
at beast.base.evolution.datatype.StandardData.getStatesForCode(Unknown Source)
at beast.base.evolution.datatype.DataType$Base.isAmbiguousCode(Unknown Source)
at beast.base.evolution.alignment.FilteredAlignment.calcPatterns(Unknown Source)
at beast.base.evolution.alignment.FilteredAlignment.initAndValidate(Unknown Source)
at beast.base.parser.XMLParser.initBEASTObjects(Unknown Source)
at beast.base.parser.XMLParser.parse(Unknown Source)
at beast.base.parser.XMLParser.parseFile(Unknown Source)
at beastfx.app.beast.BeastMCMC.parseArgs(Unknown Source)
at beastfx.app.beast.BeastMain.main(Unknown Source)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at beast.pkgmgmt.launcher.BeastLauncher.run(Unknown Source)
at beast.pkgmgmt.launcher.BeastLauncher.main(Unknown Source)
Here is the method in the StandardData
class where the exception gets thrown:
public int[] getStatesForCode(int state) {
if (state >= 0) {
return mapCodeToStateSet[state];
} else {
return mapCodeToStateSet[mapCodeToStateSet.length - 1];
}
}
And here is the block of the FilteredAlignment.calcPatterns
method that invokes this method:
protected void calcPatterns() {
int nrOfTaxa = counts.size();
int nrOfSites = filter.length;
DataType baseType = alignmentInput.get().m_dataType;
// convert data to transposed int array
int[][] data = new int[nrOfSites][nrOfTaxa];
String missingChar = Character.toString(DataType.MISSING_CHAR);
String gapChar = Character.toString(DataType.GAP_CHAR);
for (int i = 0; i < nrOfTaxa; i++) {
List<Integer> sites = counts.get(i);
for (int j = 0; j < nrOfSites; j++) {
data[j][i] = sites.get(filter[j]);
if (convertDataType) {
try {
boolean needsBrackets = baseType.isAmbiguousCode(data[j][i]) &&
! baseType.getCharacter(data[j][i]).equals(missingChar) &&
! baseType.getCharacter(data[j][i]).equals(gapChar);
String code = needsBrackets ?
"{"+baseType.getCharacter(data[j][i]) + "}" :
baseType.getCharacter(data[j][i]);
data[j][i] = m_dataType.stringToEncoding(code).get(0);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
Apparently, in the following 2-state site, we are trying to get states for a code whose index is 3:
<charstatelabels spec="UserDataType" characterName="B10K1V14U2" codeMap="0=0, 1=1, ?=0 1" states="2" value="ο, ος"/>
...
<distribution spec="TreeLikelihood" id="morphTreeLikelihood.character3" useAmbiguities="true" useTipLikelihoods="true" tree="@tree">
<data spec="FilteredAlignment" id="filter3" data="@alignment" filter="3">
<userDataType spec="StandardData" id="morphDataType.character3" nrOfStates="2"/>
</data>
<siteModel spec="SiteModel" id="morphSiteModel.character3">
<parameter spec="parameter.RealParameter" id="mutationRate.character3" name="mutationRate" value="1.0" estimate="false"/>
<parameter spec="parameter.RealParameter" id="gammaShape.character3" name="shape" value="1.0" estimate="false"/>
<substModel spec="GeneralSubstitutionModel" id="substModel.character3">
<!-- Equilibrium frequencies -->
<frequencies spec="Frequencies" id="equilibriumfreqs.character3">
<frequencies spec="parameter.RealParameter" id="equilibriumfrequencies.character3" value="0.5 0.5" estimate="false"/>
</frequencies>
<parameter spec="parameter.CompoundValuable" id="rates.character3" name="rates">
<!-- Start rate vars -->
<var spec="RPNcalculator" expression="LingConf_rate Byz_rate +">
<parameter idref="LingConf_rate"/>
<parameter idref="Byz_rate"/>
</var>
<var idref="Clar_rate"/>
<!-- End rate vars -->
</parameter>
</substModel>
</siteModel>
<!-- root frequencies -->
<rootFrequencies spec="Frequencies" id="rootfreqs.character3">
<frequencies spec="parameter.RealParameter" id="rootfrequencies.character3" value="0.8 0.2" estimate="false"/>
</rootFrequencies>
<branchRateModel idref="strictClock"/>
</distribution>
This means that for some site j
and taxon i
, the state data[j][i]
in the FilteredAlignment
with id="filter3"
has a value of 3. Now I just need to figure out where it's getting that...
Tracing things backwards a bit, we have
data[j][i] = sites.get(filter[j]);
For the FilteredAlignment
with id="filter3"
, the filter
array should have only one entry, which should be the (zero-based) index of the site (i.e., 2). The sites
list is initialized for taxon i
a couple lines earlier:
List<Integer> sites = counts.get(i);
Working back from there counts
is a member of the FilteredAlignment
class; it is initialized in the initAndValidate
method of the class:
counts = data.getCounts();
The Alignment.getCounts()
method, in turn, is defined as follows:
/**
* Returns a List of Integer Lists where each Integer List represents
* the sequence corresponding to a taxon. The taxon is identified by
* the position of the Integer List in the outer List, which corresponds
* to the nodeNr of the corresponding leaf node and the position of the
* taxon name in the taxaNames list.
*
* @return integer representation of sequence alignment
*/
public List<List<Integer>> getCounts() {
return counts;
}
And here is the loop in the Alignment.initializeWithSequenceList
method that populates this list of lists:
for (Sequence seq : sequences) {
counts.add(seq.getSequence(m_dataType));
if (taxaNames.contains(seq.getTaxon())) {
throw new RuntimeException("Duplicate taxon found in alignment: " + seq.getTaxon());
}
taxaNames.add(seq.getTaxon());
tipLikelihoods.add(seq.getLikelihoods());
// if seq.isUncertain() == false then the above line adds 'null'
// to the list, indicating that this particular sequence has no tip likelihood information
usingTipLikelihoods |= (seq.getLikelihoods() != null);
if (seq.totalCountInput.get() != null) {
stateCounts.add(seq.totalCountInput.get());
} else {
stateCounts.add(m_dataType.getStateCount());
}
}
if (counts.size() == 0) {
// no sequence data
throw new RuntimeException("Sequence data expected, but none found");
}
So if I understand correctly, the value of 3 seems to be creeping in somewhere in here.
Here is where the list of states that is added to the counts
list is retrieved:
public List<Integer> getSequence(DataType dataType) {
List<Integer> sequence;
if (uncertain) {
sequence = new ArrayList<>();
for (int i=0; i<likelihoods.length; i++) {
double m = likelihoods[i][0];
int index = 0;
for (int j=0; j<likelihoods[i].length; j++) {
if (likelihoods[i][j] > m ) {
m = likelihoods[i][j];
index = j;
}
}
sequence.add(index);
}
}
else {
String data = dataInput.get();
// remove spaces
data = data.replaceAll("\\s", "");
sequence = dataType.stringToEncoding(data);
}
if (totalCountInput.get() == null) {
// derive default from char-map
totalCountInput.setValue(dataType.getStateCount(), this);
}
return sequence;
}
For our purposes, we enter the if (uncertain)
block, since our sequence values are given as tip likelihoods. This block then finds the index of the state with the highest likelihood for each site and treats this as the numerical index for a single representative state at site i
in the sequence. But assuming the Sequence.initProbabilities
method is working correctly following the changes in https://github.com/CompEvol/beast2/issues/1077, the indices in this converted state list for the sequence should all be shorter than the number of states at their respective sites.
@rbturnbull All right, I solved the problem! I had to add the nrOfStates
attribute to the userDataType
element that contains the charstatelabels
elements. I'm still not exactly sure why that fixes the error I was seeing, but it probably has something to do with the totalCountInput
that appears in the above code snippets.
But with that solved, the beast.yml
workflow is now parsing the XML file all the way through! We now have a deeper and thornier problem to solve in the calculations of likelihoods:
Singular matrix encountered
java.lang.IllegalArgumentException: Singular matrix
at beast.base.evolution.substitutionmodel.DefaultEigenSystem.luinverse(Unknown Source)
at beast.base.evolution.substitutionmodel.DefaultEigenSystem.decomposeMatrix(Unknown Source)
at beast.base.evolution.substitutionmodel.GeneralSubstitutionModel.getTransitionProbabilities(Unknown Source)
at beast.base.evolution.substitutionmodel.GeneralSubstitutionModel.getTransitionProbabilities(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.traverse(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.traverse(Unknown Source)
at beast.base.evolution.likelihood.TreeLikelihood.calculateLogP(Unknown Source)
at beast.base.inference.CompoundDistribution.calculateLogP(Unknown Source)
at beast.base.inference.CompoundDistribution.calculateLogP(Unknown Source)
at beast.base.inference.State.robustlyCalcPosterior(Unknown Source)
at beast.base.inference.MCMC.run(Unknown Source)
at beastfx.app.beast.BeastMCMC.run(Unknown Source)
at beastfx.app.beast.BeastMain.main(Unknown Source)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at beast.pkgmgmt.launcher.BeastLauncher.run(Unknown Source)
at beast.pkgmgmt.launcher.BeastLauncher.main(Unknown Source)
Unfortunately, allowing for general specifications of the off-diagonal entries in transition matrices makes it possible that the resulting matrix will be singular. This may be something we have to address in https://github.com/CompEvol/beast2/issues/1075. I'm not sure if there's a simple way to teiphy
to check this with arbitrary transition matrix entries ahead of time. We might be able to avoid the problem in practice if we assign random starting values to the rate parameters.
Okay, I tried assigning random starting values to the rate parameters, and that didn't fix things. So we'll need to have a more involved solution to avoid a singular matrix.
@rbturnbull Thankfully, things weren't as bad I as thought! When I specified the --drop-constant
flag in beast.yml
, BEAST ran end-to-end on the output XML file without complaining about a singular matrix. From this, I realized that the substModel
I supplied for singleton sites was the culprit. Specifically, I was setting the equilibrium frequencies to 1 for the constant state and 0 for the dummy state, and this was creating the issue. I've changed the equilibrium frequencies to 0.5 and 0.5, and now beast.yml
runs end-to-end even without the --drop-constant
flag. So the feature appears to have been implemented successfully! (I still need to update the tests to get back to 100% coverage, but the hard part is done now.)
The issue of equilibrium frequencies did raise an interesting question, though. I followed your phylopaul
XML examples in using uniform distributions for equilibrium frequencies, but in practice, we could estimate the actual equilibrium frequencies using the states in the extant witnesses. (The problem is that conjectured readings would be assigned equilibrium frequencies of 0, which could produce the same problem with singular matrices.) Would there be any advantage to this, or is it best just to assign all states equal equilibrium frequencies?
Hi @jjmccollum - good questions. The equilibrium frequency for textual states is a challenge. Let's talk more about it when we meet.
@rbturnbull Sounds good! On a related note, we may want to discuss whether the ComplexSubstitutionModel
class (which is not constrained to be reversible) is a more suitable choice for our purposes. This class stores the equilibrium frequencies as a member, but it does not appear to use them for its transition probability calculations. In general, we do not need to assume reversibility, as we do not consider the root of the tree arbitrary (i.e., our root frequencies tend to be asymmetrical) in the first place.
Hi @jjmccollum - I'm just familiarising myself with the code to generate the beast output. I like the way of using templates. Have you considered using Jinja2 for handling the kind of templating system? That is designed for this kind of thing.
@rbturnbull I used it for a similar application a while ago, but I'd forgotten about it! It would make serializing lists of elements much more straightforward.
A nice byproduct is that the templates could be stored as regular XML files so syntax highlighting would work.
That is a nice benefit. I don't think it would take too much effort for me to make that change on the beast-xml
branch. If you think the tip date sampling won't take too long to implement, I might just wait until we've merged in that code before I proceed, unless you'd like to have the Jinja2 templates in place before you implement that feature.
OK. I might give it a go in the branch I've got now for it and then we can convert to jinja.
@jjmccollum - are you happy for me to convert some of the string format statements to f-strings? (https://miguendes.me/73-examples-to-help-you-master-pythons-f-strings)
@rbturnbull That's fine! I need to get accustomed to using them anyway.
Hi @jjmccollum - it's good that teiphy produces output compatible with beast 2.7.x. But if we add some extra items in the namespace at the top then we can get the code to run on beast 2.6.x as well. I'm not sure if that's worth it though. We can just say that it only supports 2.7. What do you think?
Hi @jjmccollum - forget my last message. I think it'd be too hard to support both beast 2.6 and 2.7. Let's just support 2.7.
@rbturnbull Apologies for my delayed response! I agree that it should be easier just to support BEAST 2.7 moving forward.
Am I right that it won't run without the latest version of Beast 2 on Github? I'm having trouble compiling that from source. Hopefully I'll work it out and I can run the code properly.
That's correct; I had to patch an issue in the initialization method for the Sequence
class. That change is now in the latest source code, but there hasn't been a release incorporating it yet.
I've been able to work out the issue with installing the code from source but it is still failing with an error reading the Sequence.
What does the error look like?
Error 110 parsing the xml input file
validate and intialize error: Index 2 out of bounds for length 2
Error detected about here:
<beast>
<data id='alignment' spec='Alignment'>
<sequence spec='Sequence'>
For use cases with BEAST (2) as the target phylogenetic software, conversion to NEXUS followed by a second conversion through BEAUti is presently supported, but direct conversion to a BEAST XML input file would allow for the mapping of additional features, the most notable of these being variation unit-specific substitution models and additional parameters to be incorporated into these models.
Because of the extensive nature of BEAST XML files, the conversion process will involve starting with a template file and adding new elements for witnesses, including fields for their sequences and date calibrations, and root frequencies and substitution models for each variation unit.
This feature may will probably take extra effort to implement, so this effort should be undertaken on a dedicated branch.