whedon commented 4 years ago

Submitting author: @oschwengers (Oliver Schwengers) Repository: https://github.com/oschwengers/referenceseeker Version: v1.6 Editor: @will-rowe Reviewer: @standage, @luizirber Archive: 10.5281/zenodo.3635630

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b04f2a93b8e5fb300cfa1c43189293b1"><img src="https://joss.theoj.org/papers/b04f2a93b8e5fb300cfa1c43189293b1/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b04f2a93b8e5fb300cfa1c43189293b1/status.svg)](https://joss.theoj.org/papers/b04f2a93b8e5fb300cfa1c43189293b1)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@standage & @luizirber, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @will-rowe know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @standage

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@oschwengers) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @luizirber

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@oschwengers) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[ ] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @standage, @luizirber it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 4 years ago

Attempting to check references...

whedon commented 4 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 4 years ago


OK DOIs

- 10.1186/1471-2105-14-60 is OK
- 10.1038/nbt.3820 is OK
- 10.1371/journal.pcbi.1005944 is OK
- 10.1007/s10482-017-0844-4 is OK
- 10.1099/ijs.0.64483-0 is OK
- 10.1093/nar/gkx1068 is OK
- 10.1186/s13059-016-0997-x is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

oschwengers commented 4 years ago

@will-rowe Thanks a lot for accepting the ReferenceSeeker editor role.

@luizirber @standage Thank you very much for agreeing to review my work and for the first checkmarks.

If there's anything I can do that may further help you in the reviewing process, please let me know.

oschwengers commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

oschwengers commented 4 years ago

I briefly mentioned the new support for bidirectional computations of ANI and conserved DNA values in the manuscript.

standage commented 4 years ago

Hi @oschwengers, thanks for the update. As of late last week I am almost done with my review. I hope to post it within the next day or two.

standage commented 4 years ago

The authors describe ReferenceSeeker, a Python package designed to select the most appropriate reference genome given a draft genome assembly. ReferenceSeeker implements a two-stage selection procedure: in the first, candidate reference genomes (CRGs) are identified using rapid k-mer based operations; in the second, it performs more computationally intensive alignment-based operations on a small number of CRGs to identify the reference genome (RG) most closely related to the input. The harmonic mean of average nucleotide identity (ANI) and conserved DNA (conDNA) statistics is the final selection and ranking criterion. These two statistics strike a balance between sequence similarity and reciprocal coverage as a measure of distance.

ReferenceSeeker addresses a fundamental preliminary step common to many comparative genomics studies, the results of which have a substantial impact on interpretation of downstream results. The tool furnished by the authors is well motivated, well described, and well implemented.

The software package is straightforward to install. I was able to install it successfully on an airgapped HPC system without using root privileges, with a reasonable amount of effort. The typical user will have an even easier time installing ReferenceSeeker with the suggested conda install method.
ReferenceSeeker requires pre-computed databases. The authors provide hyperlinks to stable archived snapshots of these required databases, but (critically) also provide the scripts necessary to build those scripts from scratch in the future.
The software package has an easy-to-use and self-documenting interface. Invoking referenceseeker --help lists the expected inputs and how to adjust similarity thresholds and configure other aspects of the tool's behavior.
In my hands, the tool performs as advertised. I randomly selected five private samples (microbial pathogens) for testing.
- In three cases, the top ReferenceSeeker result identified the correct species and strain reference.
- In one case, the top ReferenceSeeker results identified the correct species reference. The actual strain was unknown, but the ReferenceSeeker results appear quite reasonable compared to alternative approaches we've previously used for classification.
- In one case, the top ReferenceSeeker result identified the correct genus but the species call was incorrect. To be fair, this was a Shigella genome, which we have had a notoriously difficult time classifying correctly. The correct strain was within the top 10 results, which included various Shigella and E. coli strains. This is comparable to the performance of many leading tools on this particularly problematic sample.

This tool is a valuable contribution to the field, and I am enthusiastic to continue using it myself. However, I have two concerns that should be addressed before I can endorse ReferenceSeeker for publication at JOSS.

Major concern: precise definitions of statistics used

The terms ANI and "conserved DNA" are used throughout the paper and technical documentation, but no precise definitions are provided. The paper containing definitions for these statistics is cited throughout the manuscript, but never as a reference for defining conserved DNA.

The first reference appears in the Summary paragraph, but refers only to ANI.
The second reference appears in the Determination of RG section, but refers to species boundaries, and is written as if "conDNA" has already been defined.

Many readers will be familiar with ANI and conserved DNA, but many will not, and multiple formulations of these statistics exist. The authors need to make it unambiguously clear how these statistics are computed by ReferenceSeeker. Please ensure that the connection between conDNA and (Goris et al., 2007) is clear. I would even go so far as to suggest that explicit descriptions of how the statistics are calculated should be included, presumably in the Determination of RG section. Something like the following might work.

ReferenceSeeker uses both ANI and percentage of conserved DNA (conDNA) to derive a highly specific measure of microbial genome relationships (Goris et al., 2007). ANI is calculated as the mean identity of all nucmer matches that <...fill in remaining details...>. The conDNA statistic is calculated as the sum of <...fill in remaining details...>. The Mash distances used for preliminary selection of CRGs correlate well with ANI—these statistics capture nucleotide-level sequence similarity. However, Mash distances do not correlate well with the conDNA statistic, which captures reciprocal coverage of the query and reference sequences. The final RG selection criterion incorporates both ANI and conDNA to ensure that both sequence similarity and reciprocal coverage are considered when ranking RGs.

This may require editing/revising of the entire section.

Major concern: no automated test suite

Unless I missed something, there doesn't appear to be any automated test suite, or even a description of manual steps for testing the software. Tests serve at least two important purposes.

They allow users to independently verify that the software has been installed correctly and is running as expected on their system.
They allow developers to evaluate whether any proposed changes to the code (bug fixes, new features, refactoring, removing vestigial code, etc.) will have any unintended consequences.

There can be no doubt that the authors have been responsible in testing ReferenceSeeker: it would not function effectively if they had not. But it seems appropriate for publication in this venue, and as common practice, to encode some of these tests in an automated test suite. At the very minimum, users should have instructions they can manually run to test the software, but an automated test suite is strongly encouraged. (The pytest framework is the de facto standard for testing in the Python community.)

With an automated test suite, it would also be possible to go a step further and set up a continuous integration environment using a service like GitHub Actions, Travis CI, Drone.io, or CircleCI. CI services can be configured to run a test suite on a schedule (such as once per week or once per month) and/or when any updates are made to the code (such as from a push to master or a pull request from the community). This would alert the authors any time, for example, an update to a package dependency breaks ReferenceSeeker's behavior, helping to ensure ReferenceSeeker remains usable for more than just a short time following publication. Note: this should not be taken as a strict requirement for acceptance, but a stretch goal for increasing ReferenceSeeker's value over time.

Related threads:

Suggestion: ability for users to include private data

In addition to these major concerns, I also have a minor concern regarding the ability to include private data sets. Imagine for example a food-borne illness lab with numerous well-annotated Salmonella genomes beyond what's available in RefSeq, or an infectious disease lab with numerous well-annotated Yersinia pestis strains not publicly available. It would be valuable to scientists working in such labs to be able to include their private data along with the RefSeq genomes during the database build process, or to add them later. Note: this should also not be taken as a requirement for acceptance, but a suggestion for added value.

Related threads:

https://github.com/oschwengers/referenceseeker/issues/6

Remaining checklist items

[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided? (to be checked off once ANI and covDNA are fully described)

oschwengers commented 4 years ago

@standage Thank you so much for this excitingly comprehensive & constructive feedback! I'm happy that our tool showed reasonable and expected results in your benchmark. Of course, E. coli / Shigella is always a tough one. Not blaming someone else but there might just be still too many miss-classified genomes in the databases. Support for the incorporation of private / better-annotated data is definitely a good idea to tackle this.

ANI / conDNA: I adopted the ANI & conDNA calculations from Goris et al., 2007. But I totally agree that an exact explanation or formula is more than appropriate, especially since the formula is somehow "hidden" in prosa in the original article. I'll add this to the manuscript.

Automated tests / CI: I agree that at least a tiny test for the users or even an automated one (with CI) would be a true benefit. I'll create a little mock db as suggested and use this in a user test / CI.

Just recently, I discussed the support for private data with @nick60 but postponed it in first place. I guess, I'll just bring it forward and see what I can come up with. This would indeed be a very useful feature!

I'm more than happy to implement all these fruitful comments and suggestions and again, thank you very much!

standage commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

standage commented 4 years ago

Hi @will-rowe, my main concerns have been addressed, and I endorse this paper for publication at JOSS! This concludes my review.

Hi @oschwengers, thanks for sharing your work. Feel free to reach out in the future if needed.

oschwengers commented 4 years ago

Hi @standage , once again thank you very much for your very constructive and helpful review. I adopted the manuscript as requested and implemented all other suggestions:

a tiny test case to check local software installations: oschwengers/referenceseeker/issues/4
added unit tests and setup CI via GitHub actions: oschwengers/referenceseeker/issues/5
custom database creations: oschwengers/referenceseeker/issues/6

A last check mark on software paper - summary is left over - is there still anything missing you would like to see revised or added?

standage commented 4 years ago

is there still anything missing you would like to see revised or added?

Nope, I just missed that box. It's checked now. 👍

oschwengers commented 4 years ago

@will-rowe I just released v1.5 containing all recent changes & updates related to @standage 's fruitful comments. v1.5 is already deployed to PyPI and BioConda. Is it possible to point this submission/review to this new release instead of v1.4?

oschwengers commented 4 years ago

Just fixed 2 typos and added a short note on the new database init/import features.

oschwengers commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

oschwengers commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

will-rowe commented 4 years ago

Hi @oschwengers - thanks for addressing the reviewer comments so quickly.

I've just checked the docs and can't find any way to change the version during review. Also, @luizirber may have already done his review for the previous version? Once we have @luizirber's review and his comments are addressed, we will then ask you to tag a new release. I think the best we can do for now is to make Luiz aware that you have already addressed Daniel's comments.

luizirber commented 4 years ago

This looks great, and @standage already did a stellar job in his review, so just some things I missed:

Functionality

Installation: Test section has a typo in commands (test instead of tests), this works:
- referenceseeker referenceseeker/tests/db referenceseeker/tests/Salmonella_enterica_CFSAN000189.fasta
- ./referenceseeker/bin/referenceseeker referenceseeker/tests/db referenceseeker/tests/Salmonella_enterica_CFSAN000189.fasta

Documentation

Community guidelines: I didn't find instructions for contributing. Maybe add a CONTRIBUTING file, and instructions to run tests (I dug into the GitHub Actions to be sure it was pytest that is running them). A stretch goal would be to add issue and PR templates, requesting some info that is useful for you to reproduce problems (for issues) and a checklist for new contributions (in PRs).

oschwengers commented 4 years ago

Hi @luizirber , thanks a lot for your review and comments.

I renamed the test directory to tests as a I thought this might better fit some conventions. I must have missed those paths. Thanks for pointing out - they're fixed now: https://github.com/oschwengers/referenceseeker/commit/96d282e9a4155a82fdabbd4ee58432506ebf04a7

I also added a CONTRIBUTING.md file (https://github.com/oschwengers/referenceseeker/commit/5c3949ed533b4dead927faf3a91d22f78b8d1204) which contains the requested information on how to execute the PyTest tests as well as two initial issue templates for bug reports (https://github.com/oschwengers/referenceseeker/commit/18163c8d580871b200f39c98172fbd6335d3e64c) and feature requests (https://github.com/oschwengers/referenceseeker/commit/9303bc86bb7fb201c9c9d7bb4ecb81c18700fee0).

I hope, everything is addressed properly. Otherwise, please let me know.

@standage , @luizirber Again, thank you very much for your very committed reviews and extreamly constructive comments and ideas!

@will-rowe , please let me know what the next steps in the process are and what I can do.

will-rowe commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

will-rowe commented 4 years ago

Thanks @luizirber! I've just re-read the paper and it looks good. I think this is a great contribution.

Next steps are for you to tag a release @oschwengers, archive it and then paste the version number and archive DOI into this thread.

will-rowe commented 4 years ago

@whedon check references

whedon commented 4 years ago

Reference check summary:

OK DOIs

- 10.1186/1471-2105-14-60 is OK
- 10.1038/nbt.3820 is OK
- 10.1371/journal.pcbi.1005944 is OK
- 10.1007/s10482-017-0844-4 is OK
- 10.1099/ijs.0.64483-0 is OK
- 10.1093/nar/gkx1068 is OK
- 10.1186/s13059-016-0997-x is OK

MISSING DOIs

- None

INVALID DOIs

- None

oschwengers commented 4 years ago

Hi @will-rowe, thank you for the quick reply.

I just tagged a new release v1.6 and archived it at Zenodo with the following DOI: 10.5281/zenodo.3635630

will-rowe commented 4 years ago

@whedon set 10.5281/zenodo.3635630 archive

whedon commented 4 years ago

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@whedon commands

will-rowe commented 4 years ago

@whedon set 10.5281/zenodo.3635630 as archive

whedon commented 4 years ago

OK. 10.5281/zenodo.3635630 is the archive.

will-rowe commented 4 years ago

@whedon set v1.6 as version

whedon commented 4 years ago

OK. v1.6 is the version.

will-rowe commented 4 years ago

Hey @openjournals/joss-eics - we're good to go with this one.

Thank you @standage and @luizirber for your excellent reviews, and thank you @oschwengers for this great submission.

kthyng commented 4 years ago

Great! I'll take over from here.

kthyng commented 4 years ago

@whedon accept

whedon commented 4 years ago

Attempting dry run of processing paper acceptance...

whedon commented 4 years ago

Reference check summary:

OK DOIs

- 10.1186/1471-2105-14-60 is OK
- 10.1038/nbt.3820 is OK
- 10.1371/journal.pcbi.1005944 is OK
- 10.1007/s10482-017-0844-4 is OK
- 10.1099/ijs.0.64483-0 is OK
- 10.1093/nar/gkx1068 is OK
- 10.1186/s13059-016-0997-x is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 4 years ago

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/1265

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/1265, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

kthyng commented 4 years ago

Hi @oschwengers! Things are looking good, except can you please update the metadata in your Zenodo archive so that the title and authors exactly match your paper?

oschwengers commented 4 years ago

Hi @kthyng ! Title, author and affiliations are updated now. Unfortunately, Zenodo's affiliation form is not the easiest to fill out as there is only one short text box. Please, let me know if you need something changed.

kthyng commented 4 years ago

@oschwengers Great!

kthyng commented 4 years ago

@whedon accept deposit=true

whedon commented 4 years ago

Doing it live! Attempting automated processing of paper acceptance...

whedon commented 4 years ago

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦

openjournals / joss-reviews

[REVIEW]: ReferenceSeeker: rapid determination of appropriate reference genomes #1994

Status

Reviewer instructions & questions

Review checklist for @standage

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @luizirber

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Major concern: precise definitions of statistics used

Major concern: no automated test suite

Suggestion: ability for users to include private data

Remaining checklist items

Functionality

Documentation