openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
699 stars 36 forks source link

[REVIEW]: teiphy: A Python Package for Converting TEI XML Collations to NEXUS and Other Formats #4879

Closed editorialbot closed 1 year ago

editorialbot commented 1 year ago

Submitting author: !--author-handle-->@jjmccollum<!--end-author-handle-- (Joey McCollum) Repository: https://github.com/jjmccollum/teiphy Branch with paper.md (empty if default branch): Version: v0.1.3 Editor: !--editor-->@majensen<!--end-editor-- Reviewers: @tresoldi, @SimonGreenhill Archive: 10.5281/zenodo.7455638

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/e0a813f4cdf56e9f6ae5d555ce6ed93b"><img src="https://joss.theoj.org/papers/e0a813f4cdf56e9f6ae5d555ce6ed93b/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/e0a813f4cdf56e9f6ae5d555ce6ed93b/status.svg)](https://joss.theoj.org/papers/e0a813f4cdf56e9f6ae5d555ce6ed93b)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@tresoldi & @SimonGreenhill, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @majensen know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @SimonGreenhill

📝 Checklist for @tresoldi

editorialbot commented 1 year ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 1 year ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.05 s (918.9 files/s, 121045.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          15            269            476           1559
XML                              4              0             30           1373
SVG                              4              0              0            985
reStructuredText                 9            220            189            438
TeX                              1             41              0            226
Markdown                         4             74              0            223
YAML                             7              7             16            213
TOML                             1              7              0             45
DOS Batch                        1              8              1             26
make                             1              4              7              9
Bourne Shell                     2              1              1              7
-------------------------------------------------------------------------------
SUM:                            49            631            720           5104
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 1 year ago

Wordcount for paper.md is 1969

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.16995/dscn.291 is OK
- 10.1038/s41586-020-2649-2 is OK

MISSING DOIs

- None

INVALID DOIs

- None
SimonGreenhill commented 1 year ago

Review checklist for @SimonGreenhill

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

SimonGreenhill commented 1 year ago

Evaluation

This looks like a nicely written tool that will be very useful. The paper explains the use-case nicely, and the code is clear, very well-documented (both in the docs and in the code), and follows modern python best-practices.

The main issue I see is that the example in the paper is dealt with only cursorily rather than completely. This has the side effect of hiding the power of this program. And, while reading the paper, I got interested in the UBS dataset and wanted to know what the analysis told me, so some brief explanation of what the phylogeny shows would round the paper out nicely.

Other than that, I only have very minor recommendations which I enumerate below.

Nice work!

Comments on the Paper

Comments on the code

    "Programming Language :: Python :: 3",
    "Operating System :: OS Independent",
    "Programming Language :: Python :: 3.5",  # etc -- which versions do you support?
jjmccollum commented 1 year ago

@SimonGreenhill Thanks for your thorough and insightful comments! I'll open a branch or two for revisions to the paper and additions to the code. I don't think it should be too hard to add support for phy and fasta, and it would definitely be within the scope of the project to do so.

rbturnbull commented 1 year ago

Thanks for all your feedback @SimonGreenhill - that's all really helpful

jjmccollum commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/bioinformatics/btu033 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1371/journal.pcbi.1006650 is OK
- 10.16995/dscn.291 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.5281/zenodo.3509134 is OK
- 10.25080/Majora-92bf1922-00a is OK

MISSING DOIs

- None

INVALID DOIs

- None
jjmccollum commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

jjmccollum commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/bioinformatics/btu033 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1371/journal.pcbi.1006650 is OK
- 10.16995/dscn.291 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.5281/zenodo.3509134 is OK
- 10.25080/Majora-92bf1922-00a is OK

MISSING DOIs

- None

INVALID DOIs

- None
jjmccollum commented 1 year ago

@SimonGreenhill We've made your suggested revisions to the paper and the code, and both should now be accessible in the GitHub repo. I've also had editorialbot regenerate the paper PDF here for easy access. If you have any further comments, they are appreciated!

SimonGreenhill commented 1 year ago

Wonderful, looks all good to me, @jjmccollum. Nice work!

danielskatz commented 1 year ago

@majensen - what's the status of this submission?

majensen commented 1 year ago

Apologies @jjmccollum @danielskatz have been AWOL. @tresoldi can you update us on your review please?

tresoldi commented 1 year ago

Hello, so sorry for the delay. I will finish by tomorrow noon (31/11, European time).

tresoldi commented 1 year ago

Review checklist for @tresoldi

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

tresoldi commented 1 year ago

Evaluation

The paper presents a high quality software that, as described by the authors, fills a very important void in phylogenetic studies for textual traditions. I can see their work becoming part of a "best practice pipeline" for such analyses, and I will surely start to recommend it.

Not addressing the topics already pointed by @SimonGreenhill :

Comments on the software

Comments on the paper

And, once more, great work @jjmccollum !

jjmccollum commented 1 year ago

@tresoldi Thank you for these comments! These changes seem straightforward and shouldn't take too long to implement (though I might need @rbturnbull's help on the --ascertainment option for NEXUS outputs). I'll open a branch or two for revisions to the paper and additions to the code.

Regarding the citation of PAUP*, I wasn't able to find a paper describing it, but it looks like the website just suggests citing the software as follows:

Swofford, D. L. 2003. PAUP. Phylogenetic Analysis Using Parsimony (and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.

If you had a paper in mind that would be more appropriate to cite, just let us know; otherwise, I will cite the software above.

jjmccollum commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

jjmccollum commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/bioinformatics/btu033 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1371/journal.pcbi.1006650 is OK
- 10.16995/dscn.291 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.5281/zenodo.3509134 is OK
- 10.25080/Majora-92bf1922-00a is OK

MISSING DOIs

- None

INVALID DOIs

- None
tresoldi commented 1 year ago

@jjmccollum for ascertainment, only adding a uniform character with zeros should be enough. I have seen a small group of people working in computational linguistics experimenting with having a missing value (?) in the ascertainment character when the value is missing (so that the ascertainment character value has both zeros and questions marks), but it is not the common practice and would over-complicate things. Maybe for a future version, if somebody asks for it.

As for PAUP* that is the reference I see from time to time, that is perfect.

jjmccollum commented 1 year ago

@tresoldi Oh, that makes things much easier! Thanks for clarifying that. I should be able to add that feature for most or all output formats (although I don't know if every phylogenetic program we test in our workflows will use the ascertainment character).

rbturnbull commented 1 year ago

Hi @jjmccollum - there are helpful explanations about correcting for ascertainment bias with a dummy character in the Lewis 2001 paper and in this paper: Felsenstein, J. (1992), PHYLOGENIES FROM RESTRICTION SITES: A MAXIMUM-LIKELIHOOD APPROACH. Evolution, 46: 159-173. https://doi.org/10.1111/j.1558-5646.1992.tb01991.x

I'm happy to work on this.

rbturnbull commented 1 year ago

Hi @tresoldi - I've been reading papers on correcting for ascertainment bias and looking through implementations in phylogenetic software programs. I now think that it is actually out of scope for the teiphy tool because there are different ways of doing it that are bound up with the phylogenetic method that you are using. I think it's best for teiphy to just convert the TEI XML data and represent it in nexus, CSV etc and to leave it up to the user to correct for ascertainment bias in the phylogenetic software package that they use. Are you happy with that?

tresoldi commented 1 year ago

Hi @rbturnbull . The short answer is that yes, I am happy with it.

Longer answer. I was considering the issue this afternoon, and you are right that different models and tools (besides research decisions) would require different approaches for ascertainment bias (even more considering how correction has been used mostly for better dating in biology/linguistics, which works differently for textual traditions).

Having an option for the most common correction (a uniform character with absence) still sounds like a good addition, but not one upon which the paper acceptance would depend. I would be happier with a comment (in the documentation) explicitly saying that teiphy makes no such correction which left to the user/software, but this is not "mandatory".

As you mention, users can easily perform it from other formats, especially tabular ones. The output for such task could, however, be improved: it might be my fault, but teiphy seems to only emit tabular files with probabilities (e.g., 0.333 for the three values of B10K1V1U24-26 for manuscript 01), given that the --states-present option only applies to nexus output.

While parsing the latter is easy, I would like to ask you to investigate if, for users' convenience, it is possible to have a tabular output similar to the matrix of the nexus one when --states-present is set. It would be even better if you could provide the results in long table format, something like (optionally using the nexus character value indexes, along with question marks, instead of the actual reading):

Taxon,Character,Value
UBS,B10K1V1U24-26,εν εφεσω
P46,B10K1V1U24-26,om.
01,B10K1V1U24-26,
02,B10K1V1U24-26,εν εφεσω
03,B10K1V1U24-26,
04,B10K1V1U24-26,om.
UBS,B10K1V6U22-26,εν τω ηγαπημενω
P46,B10K1V6U22-26,εν τω ηγαπημενω
01,B10K1V6U22-26,εν τω ηγαπημενω
02,B10K1V6U22-26,εν τω ηγαπημενω
03,B10K1V6U22-26,εν τω ηγαπημενω
04,B10K1V6U22-26,εν τω ηγαπημενω υιω αυτου

Once more, this is not a request but a suggestion, as it goes beyond your initial scope and does not diminish the validity and importance of your submission. I would be happy to open it as an issue in the teiphy repository once the paper is published, where we could discuss in more detail.

jjmccollum commented 1 year ago

@tresoldi I think I could implement a --long-table option for tabular outputs (NumPy, Pandas DataFrame, CSV, TSV) like the one detailed above. To make things easier, I might just include both the reading symbol and the long form of the reading in consecutive columns:

Taxon,Character,State,Value
UBS,B10K1V1U24-26,0,εν εφεσω
P46,B10K1V1U24-26,1,om.
01,B10K1V1U24-26,1,om.
02,B10K1V1U24-26,0,εν εφεσω
03,B10K1V1U24-26,1,om.
04,B10K1V1U24-26,?,?
UBS,B10K1V6U22-26,0,εν τω ηγαπημενω
P46,B10K1V6U22-26,0,εν τω ηγαπημενω
01,B10K1V6U22-26,0,εν τω ηγαπημενω
02,B10K1V6U22-26,0,εν τω ηγαπημενω
03,B10K1V6U22-26,0,εν τω ηγαπημενω
04,B10K1V6U22-26,1,εν τω ηγαπημενω υιω αυτου

For this, I would probably have to coerce all ambiguous readings to look like missing data (i.e., ?), because duplicating rows for the same taxon and character seems more potentially problematic. How does this sound to you?

rbturnbull commented 1 year ago

Hi @tresoldi - thanks for your comment. That all sounds great. I've added a section in the docs about ascertainment bias here: https://jjmccollum.github.io/teiphy/advanced.html#ascertainment-bias

jjmccollum commented 1 year ago

@tresoldi Okay, I just finished implementing the --long-table option as requested! I'm using four columns instead of three, as I described in my last comment, and because of how I loop through the entries in the collation, the rows are ordered by taxon first and character second, but both of these features can be easily adjusted in an editor like Excel.

If all of this looks satisfactory to you, then I'm happy with passing this back up to @majensen! If the paper proceeds to publication, then I will plan to bump the repository (and PyPI entry) to a new release with these features at the same time.

tresoldi commented 1 year ago

@jjmccollum @rbturnbull wonderful, thank you! No further issues from my side. :)

tresoldi commented 1 year ago

@majensen I recommend the paper for publication, it is a great addition to the TEI ecosystem and to digital stemmatology in general.

majensen commented 1 year ago

@tresoldi @SimonGreenhill thanks so much for your thoughtful and detailed reviews. @jjmccollum I will run through the paper with my editorial hat on, and will correct, rebuke, and encourage. After this, I will provide some instructions about creating an open archive prior to publication.

jjmccollum commented 1 year ago

@majensen Since we're getting close to the holiday break, I just wanted to let you know in advance that I will be on vacation (and without access to a computer) from December 21 to January 3. So if you send me your feedback during this time and notice that I am taking a while to respond, that is the reason. Thanks!

majensen commented 1 year ago

@jjmccollum thanks for the heads up. I've been delinquent - I will get this done this week and hope we can wrap up well before you take off.

majensen commented 1 year ago

@editorialbot generate pdf

majensen commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/bioinformatics/btu033 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1371/journal.pcbi.1006650 is OK
- 10.16995/dscn.291 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.5281/zenodo.3509134 is OK
- 10.25080/Majora-92bf1922-00a is OK
- 10.1080/106351501753462876 is OK

MISSING DOIs

- None

INVALID DOIs

- None
majensen commented 1 year ago

@jjmccollum the paper reads very well and I enjoyed it. It occurred to me that STEMMA's ability to detect contamination is exactly analogous to recombination. There is a large literature on "reticulate evolution", which is just accounting for the fact that recombination pulls in different histories for parts of the same gene. I suppose that using multiple sources to create a given manuscript is essentially this problem. The application I see in this work would be to check out algorithms that can detect and display these mixtures and see if TEIphy's encoding could generate a reticulation in the history of the three contaminated witnesses. I could go on because I find this pretty fascinating, but I won't. Check this ref I found https://www.sciencedirect.com/science/article/pii/S2666166722000715 which is methods-oriented and also contains a number of interesting references.

majensen commented 1 year ago

HOWEVER, don't do that for this paper. I think we are ready to go. Can you make an open archive of the codebase using Zenodo, FigShare or similar (see this tutorial https://github.com/openjournals/joss-reviews/issues/1839#issuecomment-560048729). If you would report the DOI of the resulting archive here that would be great.

rbturnbull commented 1 year ago

Thanks @majensen - this is a big area of research for us. We need to keep exploring ways of modelling contamination/reticulation. Thanks for the link for the article!

rbturnbull commented 1 year ago

@jjmccollum do you want me to create an archive of the project and put it on the University of Melbourne's FigShare platform? Or you can do it with Zenodo or FigShare yourself.

jjmccollum commented 1 year ago

Thanks so much, @majensen! @rbturnbull, I'll create a new release of the repo to connect to Zenodo now; this release should also include the changes we made in response to the other reviewers' comments.

rbturnbull commented 1 year ago

Terrific. It'd be good to release to PyPI as well.