openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
703 stars 36 forks source link

[REVIEW]: ReferenceSeeker: rapid determination of appropriate reference genomes #1994

Closed whedon closed 4 years ago

whedon commented 4 years ago

Submitting author: @oschwengers (Oliver Schwengers) Repository: https://github.com/oschwengers/referenceseeker Version: v1.6 Editor: @will-rowe Reviewer: @standage, @luizirber Archive: 10.5281/zenodo.3635630

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b04f2a93b8e5fb300cfa1c43189293b1"><img src="https://joss.theoj.org/papers/b04f2a93b8e5fb300cfa1c43189293b1/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b04f2a93b8e5fb300cfa1c43189293b1/status.svg)](https://joss.theoj.org/papers/b04f2a93b8e5fb300cfa1c43189293b1)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@standage & @luizirber, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @will-rowe know.

Please try and complete your review in the next two weeks

Review checklist for @standage

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @luizirber

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @standage, @luizirber it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 4 years ago
Attempting to check references...
whedon commented 4 years ago
Attempting PDF compilation. Reticulating splines etc...
whedon commented 4 years ago

OK DOIs

- 10.1186/1471-2105-14-60 is OK
- 10.1038/nbt.3820 is OK
- 10.1371/journal.pcbi.1005944 is OK
- 10.1007/s10482-017-0844-4 is OK
- 10.1099/ijs.0.64483-0 is OK
- 10.1093/nar/gkx1068 is OK
- 10.1186/s13059-016-0997-x is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

oschwengers commented 4 years ago

@will-rowe Thanks a lot for accepting the ReferenceSeeker editor role.

@luizirber @standage Thank you very much for agreeing to review my work and for the first checkmarks.

If there's anything I can do that may further help you in the reviewing process, please let me know.

oschwengers commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

oschwengers commented 4 years ago

I briefly mentioned the new support for bidirectional computations of ANI and conserved DNA values in the manuscript.

standage commented 4 years ago

Hi @oschwengers, thanks for the update. As of late last week I am almost done with my review. I hope to post it within the next day or two.

standage commented 4 years ago

The authors describe ReferenceSeeker, a Python package designed to select the most appropriate reference genome given a draft genome assembly. ReferenceSeeker implements a two-stage selection procedure: in the first, candidate reference genomes (CRGs) are identified using rapid k-mer based operations; in the second, it performs more computationally intensive alignment-based operations on a small number of CRGs to identify the reference genome (RG) most closely related to the input. The harmonic mean of average nucleotide identity (ANI) and conserved DNA (conDNA) statistics is the final selection and ranking criterion. These two statistics strike a balance between sequence similarity and reciprocal coverage as a measure of distance.

ReferenceSeeker addresses a fundamental preliminary step common to many comparative genomics studies, the results of which have a substantial impact on interpretation of downstream results. The tool furnished by the authors is well motivated, well described, and well implemented.

This tool is a valuable contribution to the field, and I am enthusiastic to continue using it myself. However, I have two concerns that should be addressed before I can endorse ReferenceSeeker for publication at JOSS.

Major concern: precise definitions of statistics used

The terms ANI and "conserved DNA" are used throughout the paper and technical documentation, but no precise definitions are provided. The paper containing definitions for these statistics is cited throughout the manuscript, but never as a reference for defining conserved DNA.

Many readers will be familiar with ANI and conserved DNA, but many will not, and multiple formulations of these statistics exist. The authors need to make it unambiguously clear how these statistics are computed by ReferenceSeeker. Please ensure that the connection between conDNA and (Goris et al., 2007) is clear. I would even go so far as to suggest that explicit descriptions of how the statistics are calculated should be included, presumably in the Determination of RG section. Something like the following might work.

ReferenceSeeker uses both ANI and percentage of conserved DNA (conDNA) to derive a highly specific measure of microbial genome relationships (Goris et al., 2007). ANI is calculated as the mean identity of all nucmer matches that <...fill in remaining details...>. The conDNA statistic is calculated as the sum of <...fill in remaining details...>. The Mash distances used for preliminary selection of CRGs correlate well with ANI—these statistics capture nucleotide-level sequence similarity. However, Mash distances do not correlate well with the conDNA statistic, which captures reciprocal coverage of the query and reference sequences. The final RG selection criterion incorporates both ANI and conDNA to ensure that both sequence similarity and reciprocal coverage are considered when ranking RGs.

This may require editing/revising of the entire section.

Major concern: no automated test suite

Unless I missed something, there doesn't appear to be any automated test suite, or even a description of manual steps for testing the software. Tests serve at least two important purposes.

  1. They allow users to independently verify that the software has been installed correctly and is running as expected on their system.
  2. They allow developers to evaluate whether any proposed changes to the code (bug fixes, new features, refactoring, removing vestigial code, etc.) will have any unintended consequences.

There can be no doubt that the authors have been responsible in testing ReferenceSeeker: it would not function effectively if they had not. But it seems appropriate for publication in this venue, and as common practice, to encode some of these tests in an automated test suite. At the very minimum, users should have instructions they can manually run to test the software, but an automated test suite is strongly encouraged. (The pytest framework is the de facto standard for testing in the Python community.)

With an automated test suite, it would also be possible to go a step further and set up a continuous integration environment using a service like GitHub Actions, Travis CI, Drone.io, or CircleCI. CI services can be configured to run a test suite on a schedule (such as once per week or once per month) and/or when any updates are made to the code (such as from a push to master or a pull request from the community). This would alert the authors any time, for example, an update to a package dependency breaks ReferenceSeeker's behavior, helping to ensure ReferenceSeeker remains usable for more than just a short time following publication. Note: this should not be taken as a strict requirement for acceptance, but a stretch goal for increasing ReferenceSeeker's value over time.

Related threads:

Suggestion: ability for users to include private data

In addition to these major concerns, I also have a minor concern regarding the ability to include private data sets. Imagine for example a food-borne illness lab with numerous well-annotated Salmonella genomes beyond what's available in RefSeq, or an infectious disease lab with numerous well-annotated Yersinia pestis strains not publicly available. It would be valuable to scientists working in such labs to be able to include their private data along with the RefSeq genomes during the database build process, or to add them later. Note: this should also not be taken as a requirement for acceptance, but a suggestion for added value.

Related threads:

Remaining checklist items

oschwengers commented 4 years ago

@standage Thank you so much for this excitingly comprehensive & constructive feedback! I'm happy that our tool showed reasonable and expected results in your benchmark. Of course, E. coli / Shigella is always a tough one. Not blaming someone else but there might just be still too many miss-classified genomes in the databases. Support for the incorporation of private / better-annotated data is definitely a good idea to tackle this.

ANI / conDNA: I adopted the ANI & conDNA calculations from Goris et al., 2007. But I totally agree that an exact explanation or formula is more than appropriate, especially since the formula is somehow "hidden" in prosa in the original article. I'll add this to the manuscript.

Automated tests / CI: I agree that at least a tiny test for the users or even an automated one (with CI) would be a true benefit. I'll create a little mock db as suggested and use this in a user test / CI.

Just recently, I discussed the support for private data with @nick60 but postponed it in first place. I guess, I'll just bring it forward and see what I can come up with. This would indeed be a very useful feature!

I'm more than happy to implement all these fruitful comments and suggestions and again, thank you very much!

standage commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

standage commented 4 years ago

Hi @will-rowe, my main concerns have been addressed, and I endorse this paper for publication at JOSS! This concludes my review.

Hi @oschwengers, thanks for sharing your work. Feel free to reach out in the future if needed.

oschwengers commented 4 years ago

Hi @standage , once again thank you very much for your very constructive and helpful review. I adopted the manuscript as requested and implemented all other suggestions:

A last check mark on software paper - summary is left over - is there still anything missing you would like to see revised or added?

standage commented 4 years ago

is there still anything missing you would like to see revised or added?

Nope, I just missed that box. It's checked now. 👍

oschwengers commented 4 years ago

@will-rowe I just released v1.5 containing all recent changes & updates related to @standage 's fruitful comments. v1.5 is already deployed to PyPI and BioConda. Is it possible to point this submission/review to this new release instead of v1.4?

oschwengers commented 4 years ago

Just fixed 2 typos and added a short note on the new database init/import features.

oschwengers commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

oschwengers commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

will-rowe commented 4 years ago

Hi @oschwengers - thanks for addressing the reviewer comments so quickly.

I've just checked the docs and can't find any way to change the version during review. Also, @luizirber may have already done his review for the previous version? Once we have @luizirber's review and his comments are addressed, we will then ask you to tag a new release. I think the best we can do for now is to make Luiz aware that you have already addressed Daniel's comments.

luizirber commented 4 years ago

This looks great, and @standage already did a stellar job in his review, so just some things I missed:

Functionality

Documentation

oschwengers commented 4 years ago

Hi @luizirber , thanks a lot for your review and comments.

I renamed the test directory to tests as a I thought this might better fit some conventions. I must have missed those paths. Thanks for pointing out - they're fixed now: https://github.com/oschwengers/referenceseeker/commit/96d282e9a4155a82fdabbd4ee58432506ebf04a7

I also added a CONTRIBUTING.md file (https://github.com/oschwengers/referenceseeker/commit/5c3949ed533b4dead927faf3a91d22f78b8d1204) which contains the requested information on how to execute the PyTest tests as well as two initial issue templates for bug reports (https://github.com/oschwengers/referenceseeker/commit/18163c8d580871b200f39c98172fbd6335d3e64c) and feature requests (https://github.com/oschwengers/referenceseeker/commit/9303bc86bb7fb201c9c9d7bb4ecb81c18700fee0).

I hope, everything is addressed properly. Otherwise, please let me know.

@standage , @luizirber Again, thank you very much for your very committed reviews and extreamly constructive comments and ideas!

@will-rowe , please let me know what the next steps in the process are and what I can do.

will-rowe commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

will-rowe commented 4 years ago

Thanks @luizirber! I've just re-read the paper and it looks good. I think this is a great contribution.

Next steps are for you to tag a release @oschwengers, archive it and then paste the version number and archive DOI into this thread.

will-rowe commented 4 years ago

@whedon check references

whedon commented 4 years ago
Reference check summary:

OK DOIs

- 10.1186/1471-2105-14-60 is OK
- 10.1038/nbt.3820 is OK
- 10.1371/journal.pcbi.1005944 is OK
- 10.1007/s10482-017-0844-4 is OK
- 10.1099/ijs.0.64483-0 is OK
- 10.1093/nar/gkx1068 is OK
- 10.1186/s13059-016-0997-x is OK

MISSING DOIs

- None

INVALID DOIs

- None
oschwengers commented 4 years ago

Hi @will-rowe, thank you for the quick reply.

I just tagged a new release v1.6 and archived it at Zenodo with the following DOI: 10.5281/zenodo.3635630

will-rowe commented 4 years ago

@whedon set 10.5281/zenodo.3635630 archive

whedon commented 4 years ago

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@whedon commands
will-rowe commented 4 years ago

@whedon set 10.5281/zenodo.3635630 as archive

whedon commented 4 years ago

OK. 10.5281/zenodo.3635630 is the archive.

will-rowe commented 4 years ago

@whedon set v1.6 as version

whedon commented 4 years ago

OK. v1.6 is the version.

will-rowe commented 4 years ago

Hey @openjournals/joss-eics - we're good to go with this one.

Thank you @standage and @luizirber for your excellent reviews, and thank you @oschwengers for this great submission.

kthyng commented 4 years ago

Great! I'll take over from here.

kthyng commented 4 years ago

@whedon accept

whedon commented 4 years ago
Attempting dry run of processing paper acceptance...
whedon commented 4 years ago
Reference check summary:

OK DOIs

- 10.1186/1471-2105-14-60 is OK
- 10.1038/nbt.3820 is OK
- 10.1371/journal.pcbi.1005944 is OK
- 10.1007/s10482-017-0844-4 is OK
- 10.1099/ijs.0.64483-0 is OK
- 10.1093/nar/gkx1068 is OK
- 10.1186/s13059-016-0997-x is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 4 years ago

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/1265

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/1265, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true
kthyng commented 4 years ago

Hi @oschwengers! Things are looking good, except can you please update the metadata in your Zenodo archive so that the title and authors exactly match your paper?

oschwengers commented 4 years ago

Hi @kthyng ! Title, author and affiliations are updated now. Unfortunately, Zenodo's affiliation form is not the easiest to fill out as there is only one short text box. Please, let me know if you need something changed.

kthyng commented 4 years ago

@oschwengers Great!

kthyng commented 4 years ago

@whedon accept deposit=true

whedon commented 4 years ago
Doing it live! Attempting automated processing of paper acceptance...
whedon commented 4 years ago

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦