jjmccollum / teiphy

A Python package for converting TEI XML collations to NEXUS, BEAST 2.7 XML, and other formats
MIT License
11 stars 3 forks source link

Failure with no user-oriented error message #64

Closed cmsmcq closed 1 year ago

cmsmcq commented 1 year ago

Describe the bug When I run teiphy on a sample document specifying the output file temp.nex, I get what appears to be a Python stack trace (highlighting lines main.py:111, collation.py:1113, and collation.py:325), followed by the message "ValueError: max() arg is an empty sequence".

If I specify temp.phy as the output, the stack trace hits main.py:111, collation.py:1126, and collation.py:625.

If I specify temp.fasta as the output, the stack trace hits main.py:111, collation.py:1129, and collation.py:695.

If I specify temp.csv or temp.tsv, I get no stack trace and a three-character output file consisting of two double quotes and a newline charater.

To Reproduce This happens whenever I try to run teiphy on my TEI document.

Expected behavior My initial vague expectation (this is my first attempt to use teiphy) was that I would get output.

Failing that, I was hoping for a message to suggest what has gone wrong. Maybe my input is not as expected, maybe my setup is not as expected.

Error message I'll attach a log file.

Environment

Additional context I am not a regular Python user, so I may easily have failed to do something that any rational Python user would know needs to be done. I have not used teiphy before, so I don't know that my input file satisfies the expectations of the software. I would consult the documentation to find out what those expectations are, but I do not know where to find it. (Aha. Clicking on the link on the repo splash page takes me to an online version of the documentation. I don't see any description there of expectations regarding the TEI input document.)

In other words, the error here is almost certainly mine, or something odd in my environment; I am reporting it as an issue only because I don't know where else to get help. (And my problems may suggest ways in which the software and documentation might be bullet-proofed against cheerful idiot users like me.)

I have placed the input file in input.zip, since Github seems not to like the idea of attaching an XML document to an issue.

input.zip installation-20221222.log run.log

rbturnbull commented 1 year ago

Dear @cmsmcq - thanks for your interest in teiphy!

The problem is that the code assumes that there is a listWit (https://guidelines.tei-c.de/en/html/ref-listWit.html) in the sourceDesc and there wasn't enough error checking to detect the problem. @jjmccollum and I will fix that but in the meantime, here is your TEI file with the listWit added. Hopefully this should work in the current version.

(I've added a .txt suffix because github doesn't allow uploading .xml) quentin.tei.listWit.xml.txt

rbturnbull commented 1 year ago

Hi @jjmccollum - if there is no listWit, should teiphy raise an error or should it look for witnesses from the different readings? Also, I don't think we check if there are different witnesses in the readings than in listWit. Should there be an error if a reading has a witness that hasn't been defined or should it just add it to the list. My feeling is that it might be good to just add witnesses as it goes. That means that you wouldn't need to add a listWit if you didn't need to give any extra info about the witnesses. What do you think?

cmsmcq commented 1 year ago

Thank you, @rbturnbull - I confirm that if I include a listWit element containing all the sigla as a child of sourceDesc, the program runs without error and produces output that looks plausible to someone who like me has no experience with Nexus data.

And for the record: if the listWit includes only a subset of the witnesses, I get output for those witnesses. If the listWit includes a witness not included in any app elements, I get output for the listed witnesses, with question marks for the ghost. I suspect that for some users, raising warnings might be helpful in these cases, but for the experimentation I currently have in view, the current behavior seems fine to me.

For what it's worth, I will note in passing that after filing the bug report and looking more carefully at the example from Ephesians pointed to from the documentation, I did add a listWit element, which had no effect on the behavior, probably because I added it not as a child of sourceDesc but as a child of note within bibl within sourceDesc. I am reluctant to have listWit as a child of sourceDesc in this case because the different witnesses in this case were not the source of the electronic text: all information about the witnesses to this artificial tradition comes from the book described in the sourceDesc/bibl element.

So if it's not too inconvenient, I would urge the developers to find and read the listWit no matter where it occurs in the teiHeader. Or even better, no matter where it occurs in the TEI document, since my first instinct for this case is to put it in the front matter or back matter of the text element.

And if there is none, then you do have one user who would find it intuitive if the default behavior in the absence of a listWit element were to emit output describing every witness found in a wit attribute (in XPath terms, distinct-values($input /descendant::* /@wit /tokenize(., '\s+')[normalize-space()]) -- but my notions of what's intuitive are not always shared by other users, so I offer them only for what they are worth.

rbturnbull commented 1 year ago

Thanks very much @cmsmcq. It's great to have your feedback. That all makes a lot of sense. I'll talk about these issues when @jjmccollum is available after Christmas.

jjmccollum commented 1 year ago

All right, I'm back from vacation! @cmsmcq Thanks for bringing this up, and @rbturnbull, thanks for identifying the issue and offering some suggestions!

Based on the TEI Guidelines for the listWit element, it can indeed be contained by various elements other than sourceDesc, so it may be best to modify the Collation class's parse_list_wit method to use an XPath expression like //tei:listWit to find a witness list, take the first one it finds, and parse its witness subelements.

Regarding the preferred behavior if there is no listWit element, I like @cmsmcq's suggestion of throwing an exception and reporting all distinct sigla that occur in the @wit attributes of the rdg and witDetail elements. It would also be worthwhile to include a validation loop that emits warning messages (without throwing an exception) if any sigla occur that are not found in the listWit (after any siglum suffixes specified to be ignored have been stripped) occur in any @wit strings.

As for @rbturnbull's suggestion of populating the witness list from the @wit strings directly, I've tried this before in the open-cbgm library, and I've found that it runs into problems when sigla with removable suffixes (like "T", "V", "f", and "r") occur in the collation. Either these extended sigla will be erroneously added as distinct witnesses, or witness sigla whose base forms end in those suffixes (e.g., "MT" for Majority Text) will have part of their IDs erroneously removed. I therefore consider it safest to require the presence of a listWit that explicitly identifies the in the XML file.

I can start implementing these changes in a listwit-validation branch if all of this sounds good.

rbturnbull commented 1 year ago

that all sounds great @jjmccollum

jjmccollum commented 1 year ago

@rbturnbull I've just pushed some changes to the listwit-validation branch that should resolve this issue. You're welcome to look over it and make any further changes as you see fit (code coverage is still 100%, but I haven't modified the documentation). If everything looks good to you, then I can merge the branch into main and update the release to v0.1.4.

rbturnbull commented 1 year ago

thanks @jjmccollum - I'll have a look tomorrow

jjmccollum commented 1 year ago

@rbturnbull Just wanted to follow up with you to see if you had any further suggestions for changes on the listwit-validation branch. If not, I'm happy merging it and closing this issue.

rbturnbull commented 1 year ago

Hi @jjmccollum - I'm so sorry for the delay. The changes look great. Terrific work! My only suggestion is to perhaps subclass Exception so that it is more specific than just a generic 'Exception' for when there isn't a listWit element. Or you can find a more informative built-in exception class. I think it's usually a good practice to be specific so that users can catch the exception they want and not just any and all exceptions.