jjmccollum / teiphy

A Python package for converting TEI XML collations to NEXUS, BEAST 2.7 XML, and other formats
MIT License
11 stars 3 forks source link

:abc: a few edits to the paper #34

Closed rbturnbull closed 2 years ago

rbturnbull commented 2 years ago

Hi @jjmccollum - I made a few edits to the paper. Happy to discuss them if you like. I'm thinking that it would be good to show the result a phylogenetic analysis of the Ephesians example to demonstrate that the code works. At the moment, we are just describing the the TEI XML input but there's not much to show the output of teiphy. What do you think about showing a tree from PAUP using maximum parsimony?

jjmccollum commented 2 years ago

@rbturnbull This looks good! I successfully produced a maximum-parsimony stemma with PAUP* back when I was testing whether the outputs worked with it, so I should be able to include one of those easily. There were many trees with the best-found score and several flat portions in the tree I looked at, but these things are probably due to the small number of variation units in the UBS collation. I can try again with the subreadings included in the output to distinguish the witnesses a bit more. I'll merge these changes and fix some errors of my own that I missed the first time.

rbturnbull commented 2 years ago

Maybe we produce a consensus tree and show that. Let's look at it after you run it with the subreadings. Maybe we just include a subset of witnesses for the demo? I think we make it clear that this is just a demonstration with a small number of variation units and that you will be producing a more detailed phylogenetic analysis in the future.

jjmccollum commented 2 years ago

@rbturnbull I included subreadings and used only the continuous-text Greek manuscripts and lectionaries (for a total of 167 witnesses). PAUP* appears to be finding over 500,000 distinct topologies with the best-found cost of 377, and the consensus tree of all of them is almost entirely flat. The issue may simply be that the ratio of witnesses to variation units still allows for too many equally parsimonious explanations of the data. I could use an even smaller subset of the witnesses; do you have any intuition of how small a subset I should aim to use?