Add export methods for other output formats

jjmccollum commented 2 years ago

For use within Python, a to_numpy export method would be useful; this would allow us to convert a TEI XML collation to a ready-made collation matrix input to machine learning packages (e.g., nimfa for non-negative matrix factorization). It could also serve as a stepping stone for to_csv and to_xlsx methods via pandas.

As for STEMMA, I'll have to look more closely at the structure of the files in https://github.com/stemmatic/mss to see if TEI XML is rich enough to make the conversion straightforward; it may be more of a challenge.

rbturnbull commented 2 years ago

i've got some code for generating files for STEMMA that might help

rbturnbull commented 2 years ago

I've added you Joey to the dcodex_carlson repo where I have some code to generate files in Steven's STEMMA format. The code is probably very messy and it was just to get something working once and it is certainly not ready for release. Parts of it might help with this project, I'm not sure.

jjmccollum commented 2 years ago

Thanks, Rob! I'll focus on incorporating a conversion method for this format next.

jjmccollum commented 2 years ago

All right, I have a rudimentary version of the to_stemma method up and running. Before we close this issue, it may be good for us (and @stemmatic) to look over/attempt to use the output so we can be sure that it is formatted as it should be. For the purpose of replicating my output, the command I used was

poetry run teiphy --format stemma -t"reconstructed" -t"defective" -t"orthographic" -t"subreading" -m"lac" -m"overlap" -s"*" -s"T" -s"/1" -s"/2" -s"/3" example\ubs_ephesians.xml ubs_ephesians

jjmccollum commented 2 years ago

We'll also have to check if PAUP can actually run the output of the to_nexus method. If the StatesFormat=Frequency setting with vectors of state frequencies is not readable, then I'll need to add an optional argument to the method that will add new symbols for ambiguous states to the file's Equate block. (We should still allow the method to use StatesFormat=Frequency by default, as having this level of precision about different states is preferred, and even if PAUP does not support this format, other software might.)

jjmccollum commented 2 years ago

Okay, the StatesFormat=StatesPresent NEXUS setting is now supported with the states_present optional input for to_nexus (with a corresponding --states-present option in main.py). PAUP* should work with outputs generated with this setting (although we'll still want to run a test, of course).

rbturnbull commented 2 years ago

We should test at least:

PAUP*
Beast2
MrBayes
IQTree

rbturnbull commented 2 years ago

We should also hunt for other examples of a critical apparatus in TEI online and use those. I'm sure there will be edge cases which cause trouble.

jjmccollum commented 2 years ago

@rbturnbull Good news: I just tested the output of to_nexus with PAUP 4.0, and after tweaking the code of to_nexus a bit to reformat some labels, the output NEXUS file works with PAUP! I'll try with one of the other programs next.

jjmccollum commented 2 years ago

Okay, I've tested the NEXUS output with all four programs suggested above, and I can get some version of output to work with each of them. (I have to remove fragmentary witnesses for the collation to pass validation in IQ-TREE, but this isn't a formatting problem; I can get the collation to work with BEAST2's BEAUti import if I change alphabetical state symbols to numeric ones, but this is a more serious problem on BEAST2's end.) I thought it might be worthwhile to note here which NEXUS blocks throw errors in which programs because the programs do not recognize them:

PAUP*: StatesFormat
MrBayes: StatesFormat, CharLabels, Equate
IQ-TREE: StatesFormat (iqtree.exe prints an error message saying, Sorry, only STATESFORMAT=STATESPRESENT supported at this time); the Equate block is allowed, but all equate symbols for ambiguous states are just treated as missing
BEAST2: StatesFormat (BEAUti seems to misinterpret state frequency vectors as taxon labels, so it can be assumed to require StatesFormat=StatesPresent); in general, BEAUti seems to have a problem with non-numeric symbols, as it throws a java.lang.NumberFormatException when it encounters a symbol like a or b.

Most of these formats can be accommodated (although, again, the state symbol limitation with BEAST2 is a bit more concerning), but would it be best to add other parameters to to_nexus to produce more program-specific NEXUS outputs?

rbturnbull commented 2 years ago

I think it might be good to talk about some of these issues over zoom. Do you have time this week?

jjmccollum commented 2 years ago

Other than Monday morning and Friday, I should be available anytime this week!

jjmccollum commented 2 years ago

Also, I just added code and unit tests for conversion to Hennig86 format, which is used for TNT.

jjmccollum commented 2 years ago

All right, I've added some extra code to read date ranges from witnesses in the TEI XML collation (if they are specified) and generate a chron file detailing these for STEMMA output. I've also added a to_distance_matrix method (and accompanying unit tests) for generating a NumPy array representing a distance matrix between collated witnesses. I think that this was everything that I had in mind, so I will finally close this issue.

jjmccollum / teiphy

Add export methods for other output formats #4