Closed jjmccollum closed 2 years ago
i've got some code for generating files for STEMMA that might help
I've added you Joey to the dcodex_carlson repo where I have some code to generate files in Steven's STEMMA format. The code is probably very messy and it was just to get something working once and it is certainly not ready for release. Parts of it might help with this project, I'm not sure.
Thanks, Rob! I'll focus on incorporating a conversion method for this format next.
All right, I have a rudimentary version of the to_stemma
method up and running. Before we close this issue, it may be good for us (and @stemmatic) to look over/attempt to use the output so we can be sure that it is formatted as it should be. For the purpose of replicating my output, the command I used was
poetry run teiphy --format stemma -t"reconstructed" -t"defective" -t"orthographic" -t"subreading" -m"lac" -m"overlap" -s"*" -s"T" -s"/1" -s"/2" -s"/3" example\ubs_ephesians.xml ubs_ephesians
We'll also have to check if PAUP can actually run the output of the to_nexus
method. If the StatesFormat=Frequency
setting with vectors of state frequencies is not readable, then I'll need to add an optional argument to the method that will add new symbols for ambiguous states to the file's Equate
block. (We should still allow the method to use StatesFormat=Frequency
by default, as having this level of precision about different states is preferred, and even if PAUP does not support this format, other software might.)
Okay, the StatesFormat=StatesPresent
NEXUS setting is now supported with the states_present
optional input for to_nexus
(with a corresponding --states-present
option in main.py
). PAUP* should work with outputs generated with this setting (although we'll still want to run a test, of course).
We should test at least:
We should also hunt for other examples of a critical apparatus in TEI online and use those. I'm sure there will be edge cases which cause trouble.
@rbturnbull Good news: I just tested the output of to_nexus
with PAUP 4.0, and after tweaking the code of to_nexus
a bit to reformat some labels, the output NEXUS file works with PAUP! I'll try with one of the other programs next.
Okay, I've tested the NEXUS output with all four programs suggested above, and I can get some version of output to work with each of them. (I have to remove fragmentary witnesses for the collation to pass validation in IQ-TREE, but this isn't a formatting problem; I can get the collation to work with BEAST2's BEAUti import if I change alphabetical state symbols to numeric ones, but this is a more serious problem on BEAST2's end.) I thought it might be worthwhile to note here which NEXUS blocks throw errors in which programs because the programs do not recognize them:
StatesFormat
StatesFormat
, CharLabels
, Equate
StatesFormat
(iqtree.exe
prints an error message saying, Sorry, only STATESFORMAT=STATESPRESENT supported at this time
); the Equate
block is allowed, but all equate symbols for ambiguous states are just treated as missingStatesFormat
(BEAUti seems to misinterpret state frequency vectors as taxon labels, so it can be assumed to require StatesFormat=StatesPresent
); in general, BEAUti seems to have a problem with non-numeric symbols, as it throws a java.lang.NumberFormatException
when it encounters a symbol like a
or b
.Most of these formats can be accommodated (although, again, the state symbol limitation with BEAST2 is a bit more concerning), but would it be best to add other parameters to to_nexus
to produce more program-specific NEXUS outputs?
I think it might be good to talk about some of these issues over zoom. Do you have time this week?
Other than Monday morning and Friday, I should be available anytime this week!
Also, I just added code and unit tests for conversion to Hennig86 format, which is used for TNT.
All right, I've added some extra code to read date ranges from witnesses in the TEI XML collation (if they are specified) and generate a chron file detailing these for STEMMA output. I've also added a to_distance_matrix
method (and accompanying unit tests) for generating a NumPy array representing a distance matrix between collated witnesses. I think that this was everything that I had in mind, so I will finally close this issue.
For use within Python, a
to_numpy
export method would be useful; this would allow us to convert a TEI XML collation to a ready-made collation matrix input to machine learning packages (e.g.,nimfa
for non-negative matrix factorization). It could also serve as a stepping stone forto_csv
andto_xlsx
methods viapandas
.As for STEMMA, I'll have to look more closely at the structure of the files in https://github.com/stemmatic/mss to see if TEI XML is rich enough to make the conversion straightforward; it may be more of a challenge.