Closed editorialbot closed 1 year ago
Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.
For a list of things I can do to help you, just type:
@editorialbot commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@editorialbot generate pdf
Software report:
github.com/AlDanial/cloc v 1.88 T=0.05 s (918.9 files/s, 121045.4 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 15 269 476 1559
XML 4 0 30 1373
SVG 4 0 0 985
reStructuredText 9 220 189 438
TeX 1 41 0 226
Markdown 4 74 0 223
YAML 7 7 16 213
TOML 1 7 0 45
DOS Batch 1 8 1 26
make 1 4 7 9
Bourne Shell 2 1 1 7
-------------------------------------------------------------------------------
SUM: 49 631 720 5104
-------------------------------------------------------------------------------
gitinspector failed to run statistical information for the repository
Wordcount for paper.md
is 1969
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.16995/dscn.291 is OK
- 10.1038/s41586-020-2649-2 is OK
MISSING DOIs
- None
INVALID DOIs
- None
This looks like a nicely written tool that will be very useful. The paper explains the use-case nicely, and the code is clear, very well-documented (both in the docs and in the code), and follows modern python best-practices.
The main issue I see is that the example in the paper is dealt with only cursorily rather than completely. This has the side effect of hiding the power of this program. And, while reading the paper, I got interested in the UBS dataset and wanted to know what the analysis told me, so some brief explanation of what the phylogeny shows would round the paper out nicely.
Other than that, I only have very minor recommendations which I enumerate below.
Nice work!
[ ] I would like more detail on the example, preferably with worked through command line steps. This is already in the example section of the docs, so please add this into the paper too (especially the step of generating a nexus file, showing and explaining the command line arguments).
[ ] Figure 1, and discussion: can you please describe what this tree show us (e.g. do these groupings make sense, do they tell us something interesting about this document?)
[ ] L16: please define what the acronym "TEI XML" means.
[ ] L37 "since then, phylogenetic methods have quickly evolved (Felsenstein, 2004),". Please reword -- this reads as if phylogenetic methods were driven by textual analyses, whereas, these methods have been in continuous development since the 70s and 80s, and not for textual analyses.
[ ] L52 "great chasm .. fixed". Please edit - it's a bit over the top, and/or the verb "fixed" is awkward. Perhaps "exists" or "hinders the use of TEI XML in phylogenetic contexts".
[ ] L60: TNT's claims of 'remarkable performance' may have been true in 2008 but now there are plenty of genomic-scale methods out there that can outcompete TNT (something like RAXML-NG probably blows it out of water for example). A better argument for TNT is that it has a comprehensive of morphological state models that may be appropriate for digital humanities data. You later go on to use tools like IQ-Tree and MrBayes, so it'd be worth broadening this section to mentioning these tools up front too. Don't let people think your program is innately tied to "cladistic" tools and approaches (and I mean "cladistic" here in the pejorative sense).
[ ] L72: Might be worth noting that this provides an immediate pathway into all the machine learning tools available in python (e.g. scikit-learn, tensorflow etc).
[ ] L75-76: Can you briefly explain what 'lacunae' and 'retroversions' are (this also strengthens the case for your program). Another term 'corrector' is used later in the m.s. without explanation as well.
[ ] L79: what do you mean by 'soft decision'?
[ ] It would be useful to have a version number, and expose this via --version
or similar.
[ ] pyproject.toml could include some more useful classifiers to help people find the package on PYPI e.g.:
"Programming Language :: Python :: 3",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3.5", # etc -- which versions do you support?
[ ] have you thought about adding support for phy or fasta formats? these are pretty common in bioinformatics these days and are the native format of tools like raxml. This may be out of the scope of your program so feel free to disagree, but the formats themselves are pretty trivial to write ("> taxon\n>sequence\n\n")
[ ] please add documentation for how to run tests into readme.rst
@SimonGreenhill Thanks for your thorough and insightful comments! I'll open a branch or two for revisions to the paper and additions to the code. I don't think it should be too hard to add support for phy and fasta, and it would definitely be within the scope of the project to do so.
Thanks for all your feedback @SimonGreenhill - that's all really helpful
@editorialbot check references
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/bioinformatics/btu033 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1371/journal.pcbi.1006650 is OK
- 10.16995/dscn.291 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.5281/zenodo.3509134 is OK
- 10.25080/Majora-92bf1922-00a is OK
MISSING DOIs
- None
INVALID DOIs
- None
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@editorialbot check references
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/bioinformatics/btu033 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1371/journal.pcbi.1006650 is OK
- 10.16995/dscn.291 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.5281/zenodo.3509134 is OK
- 10.25080/Majora-92bf1922-00a is OK
MISSING DOIs
- None
INVALID DOIs
- None
@SimonGreenhill We've made your suggested revisions to the paper and the code, and both should now be accessible in the GitHub repo. I've also had editorialbot regenerate the paper PDF here for easy access. If you have any further comments, they are appreciated!
Wonderful, looks all good to me, @jjmccollum. Nice work!
@majensen - what's the status of this submission?
Apologies @jjmccollum @danielskatz have been AWOL. @tresoldi can you update us on your review please?
Hello, so sorry for the delay. I will finish by tomorrow noon (31/11, European time).
The paper presents a high quality software that, as described by the authors, fills a very important void in phylogenetic studies for textual traditions. I can see their work becoming part of a "best practice pipeline" for such analyses, and I will surely start to recommend it.
Not addressing the topics already pointed by @SimonGreenhill :
--ascertainment
flag would be very welcome for the NEXUS output format in general (or even for all formats). Not only some software expects the user to perform the correction themselves by adding the states to the alignment, but in some cases we might prefer to have it in the model even when the software offers it as an option.And, once more, great work @jjmccollum !
@tresoldi Thank you for these comments! These changes seem straightforward and shouldn't take too long to implement (though I might need @rbturnbull's help on the --ascertainment
option for NEXUS outputs). I'll open a branch or two for revisions to the paper and additions to the code.
Regarding the citation of PAUP*, I wasn't able to find a paper describing it, but it looks like the website just suggests citing the software as follows:
Swofford, D. L. 2003. PAUP. Phylogenetic Analysis Using Parsimony (and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.
If you had a paper in mind that would be more appropriate to cite, just let us know; otherwise, I will cite the software above.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@editorialbot check references
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/bioinformatics/btu033 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1371/journal.pcbi.1006650 is OK
- 10.16995/dscn.291 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.5281/zenodo.3509134 is OK
- 10.25080/Majora-92bf1922-00a is OK
MISSING DOIs
- None
INVALID DOIs
- None
@jjmccollum for ascertainment, only adding a uniform character with zeros should be enough. I have seen a small group of people working in computational linguistics experimenting with having a missing value (?
) in the ascertainment character when the value is missing (so that the ascertainment character value has both zeros and questions marks), but it is not the common practice and would over-complicate things. Maybe for a future version, if somebody asks for it.
As for PAUP* that is the reference I see from time to time, that is perfect.
@tresoldi Oh, that makes things much easier! Thanks for clarifying that. I should be able to add that feature for most or all output formats (although I don't know if every phylogenetic program we test in our workflows will use the ascertainment character).
Hi @jjmccollum - there are helpful explanations about correcting for ascertainment bias with a dummy character in the Lewis 2001 paper and in this paper: Felsenstein, J. (1992), PHYLOGENIES FROM RESTRICTION SITES: A MAXIMUM-LIKELIHOOD APPROACH. Evolution, 46: 159-173. https://doi.org/10.1111/j.1558-5646.1992.tb01991.x
I'm happy to work on this.
Hi @tresoldi - I've been reading papers on correcting for ascertainment bias and looking through implementations in phylogenetic software programs. I now think that it is actually out of scope for the teiphy tool because there are different ways of doing it that are bound up with the phylogenetic method that you are using. I think it's best for teiphy to just convert the TEI XML data and represent it in nexus, CSV etc and to leave it up to the user to correct for ascertainment bias in the phylogenetic software package that they use. Are you happy with that?
Hi @rbturnbull . The short answer is that yes, I am happy with it.
Longer answer. I was considering the issue this afternoon, and you are right that different models and tools (besides research decisions) would require different approaches for ascertainment bias (even more considering how correction has been used mostly for better dating in biology/linguistics, which works differently for textual traditions).
Having an option for the most common correction (a uniform character with absence) still sounds like a good addition, but not one upon which the paper acceptance would depend. I would be happier with a comment (in the documentation) explicitly saying that teiphy makes no such correction which left to the user/software, but this is not "mandatory".
As you mention, users can easily perform it from other formats, especially tabular ones. The output for such task could, however, be improved: it might be my fault, but teiphy
seems to only emit tabular files with probabilities (e.g., 0.333 for the three values of B10K1V1U24-26 for manuscript 01
), given that the --states-present
option only applies to nexus output.
While parsing the latter is easy, I would like to ask you to investigate if, for users' convenience, it is possible to have a tabular output similar to the matrix of the nexus one when --states-present
is set. It would be even better if you could provide the results in long table format, something like (optionally using the nexus character value indexes, along with question marks, instead of the actual reading):
Taxon,Character,Value
UBS,B10K1V1U24-26,εν εφεσω
P46,B10K1V1U24-26,om.
01,B10K1V1U24-26,
02,B10K1V1U24-26,εν εφεσω
03,B10K1V1U24-26,
04,B10K1V1U24-26,om.
UBS,B10K1V6U22-26,εν τω ηγαπημενω
P46,B10K1V6U22-26,εν τω ηγαπημενω
01,B10K1V6U22-26,εν τω ηγαπημενω
02,B10K1V6U22-26,εν τω ηγαπημενω
03,B10K1V6U22-26,εν τω ηγαπημενω
04,B10K1V6U22-26,εν τω ηγαπημενω υιω αυτου
Once more, this is not a request but a suggestion, as it goes beyond your initial scope and does not diminish the validity and importance of your submission. I would be happy to open it as an issue in the teiphy repository once the paper is published, where we could discuss in more detail.
@tresoldi I think I could implement a --long-table
option for tabular outputs (NumPy, Pandas DataFrame, CSV, TSV) like the one detailed above. To make things easier, I might just include both the reading symbol and the long form of the reading in consecutive columns:
Taxon,Character,State,Value
UBS,B10K1V1U24-26,0,εν εφεσω
P46,B10K1V1U24-26,1,om.
01,B10K1V1U24-26,1,om.
02,B10K1V1U24-26,0,εν εφεσω
03,B10K1V1U24-26,1,om.
04,B10K1V1U24-26,?,?
UBS,B10K1V6U22-26,0,εν τω ηγαπημενω
P46,B10K1V6U22-26,0,εν τω ηγαπημενω
01,B10K1V6U22-26,0,εν τω ηγαπημενω
02,B10K1V6U22-26,0,εν τω ηγαπημενω
03,B10K1V6U22-26,0,εν τω ηγαπημενω
04,B10K1V6U22-26,1,εν τω ηγαπημενω υιω αυτου
For this, I would probably have to coerce all ambiguous readings to look like missing data (i.e., ?
), because duplicating rows for the same taxon and character seems more potentially problematic. How does this sound to you?
Hi @tresoldi - thanks for your comment. That all sounds great. I've added a section in the docs about ascertainment bias here: https://jjmccollum.github.io/teiphy/advanced.html#ascertainment-bias
@tresoldi Okay, I just finished implementing the --long-table
option as requested! I'm using four columns instead of three, as I described in my last comment, and because of how I loop through the entries in the collation, the rows are ordered by taxon first and character second, but both of these features can be easily adjusted in an editor like Excel.
If all of this looks satisfactory to you, then I'm happy with passing this back up to @majensen! If the paper proceeds to publication, then I will plan to bump the repository (and PyPI entry) to a new release with these features at the same time.
@jjmccollum @rbturnbull wonderful, thank you! No further issues from my side. :)
@majensen I recommend the paper for publication, it is a great addition to the TEI ecosystem and to digital stemmatology in general.
@tresoldi @SimonGreenhill thanks so much for your thoughtful and detailed reviews. @jjmccollum I will run through the paper with my editorial hat on, and will correct, rebuke, and encourage. After this, I will provide some instructions about creating an open archive prior to publication.
@majensen Since we're getting close to the holiday break, I just wanted to let you know in advance that I will be on vacation (and without access to a computer) from December 21 to January 3. So if you send me your feedback during this time and notice that I am taking a while to respond, that is the reason. Thanks!
@jjmccollum thanks for the heads up. I've been delinquent - I will get this done this week and hope we can wrap up well before you take off.
@editorialbot generate pdf
@editorialbot check references
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1093/sysbio/46.4.590 is OK
- 10.1111/cla.12160 is OK
- 10.1093/molbev/msaa015 is OK
- 10.1093/bioinformatics/btu033 is OK
- 10.1093/sysbio/sys029 is OK
- 10.1371/journal.pcbi.1006650 is OK
- 10.16995/dscn.291 is OK
- 10.1007/978-94-011-0325-1_2 is OK
- 10.1038/29667 is OK
- 10.1023/B:CHUM.0000009290.14571.59 is OK
- 10.1628/978-3-16-153324-2 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.5281/zenodo.3509134 is OK
- 10.25080/Majora-92bf1922-00a is OK
- 10.1080/106351501753462876 is OK
MISSING DOIs
- None
INVALID DOIs
- None
@jjmccollum the paper reads very well and I enjoyed it. It occurred to me that STEMMA's ability to detect contamination is exactly analogous to recombination. There is a large literature on "reticulate evolution", which is just accounting for the fact that recombination pulls in different histories for parts of the same gene. I suppose that using multiple sources to create a given manuscript is essentially this problem. The application I see in this work would be to check out algorithms that can detect and display these mixtures and see if TEIphy's encoding could generate a reticulation in the history of the three contaminated witnesses. I could go on because I find this pretty fascinating, but I won't. Check this ref I found https://www.sciencedirect.com/science/article/pii/S2666166722000715 which is methods-oriented and also contains a number of interesting references.
HOWEVER, don't do that for this paper. I think we are ready to go. Can you make an open archive of the codebase using Zenodo, FigShare or similar (see this tutorial https://github.com/openjournals/joss-reviews/issues/1839#issuecomment-560048729). If you would report the DOI of the resulting archive here that would be great.
Thanks @majensen - this is a big area of research for us. We need to keep exploring ways of modelling contamination/reticulation. Thanks for the link for the article!
@jjmccollum do you want me to create an archive of the project and put it on the University of Melbourne's FigShare platform? Or you can do it with Zenodo or FigShare yourself.
Thanks so much, @majensen! @rbturnbull, I'll create a new release of the repo to connect to Zenodo now; this release should also include the changes we made in response to the other reviewers' comments.
Terrific. It'd be good to release to PyPI as well.
Submitting author: !--author-handle-->@jjmccollum<!--end-author-handle-- (Joey McCollum) Repository: https://github.com/jjmccollum/teiphy Branch with paper.md (empty if default branch): Version: v0.1.3 Editor: !--editor-->@majensen<!--end-editor-- Reviewers: @tresoldi, @SimonGreenhill Archive: 10.5281/zenodo.7455638
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@tresoldi & @SimonGreenhill, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @majensen know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Checklists
📝 Checklist for @SimonGreenhill
📝 Checklist for @tresoldi