[REVIEW]: Mashpit: sketching out genomic epidemiology

openjournals / joss-reviews

Reviews for the Journal of Open Source Software

Creative Commons Zero v1.0 Universal

722 stars 38 forks source link

[REVIEW]: Mashpit: sketching out genomic epidemiology #7306

Open editorialbot opened 1 month ago

editorialbot commented 1 month ago

Submitting author: !--author-handle-->@tongzhouxu@csoneson<!--end-editor-- Reviewers: @hkaspersen, @mberacochea Archive: Pending

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/760af75d515b1bc3d2fc87085fe79b92"><img src="https://joss.theoj.org/papers/760af75d515b1bc3d2fc87085fe79b92/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/760af75d515b1bc3d2fc87085fe79b92/status.svg)](https://joss.theoj.org/papers/760af75d515b1bc3d2fc87085fe79b92)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@hkaspersen & @mberacochea, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @csoneson know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Checklists

📝 Checklist for @hkaspersen

📝 Checklist for @mberacochea

editorialbot commented 1 month ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf

editorialbot commented 1 month ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

✅ OK DOIs

- 10.1093/bioinformatics/bty407 is OK
- 10.2807/1560-7917.es.2017.22.23.30544 is OK
- 10.1186/s13059-016-0997-x is OK
- 10.21105/joss.00027 is OK
- 10.1186/1471-2105-10-421 is OK
- 10.1128/aem.01746-19 is OK
- 10.1101/gr.251678.119 is OK
- 10.3389/fmicb.2017.00375 is OK

🟡 SKIP DOIs

- None

❌ MISSING DOIs

- None

❌ INVALID DOIs

- None

editorialbot commented 1 month ago

Software report:

github.com/AlDanial/cloc v 1.90  T=0.02 s (1110.0 files/s, 102938.1 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                           8            138            158           1050
Markdown                         3             90              0            298
HTML                             4             11              2            155
YAML                             4             11             11             97
TeX                              1              9              0             88
JavaScript                       2              0              7              2
CSS                              1              0              5              1
-------------------------------------------------------------------------------
SUM:                            23            259            183           1691
-------------------------------------------------------------------------------

Commit count by author:

   148  Tongzhou Xu
    14  tongzhouxu
     7  dependabot[bot]
     3  Lee Katz
     1  Lee Katz - Aspen
     1  Lee Katz gzu2

editorialbot commented 1 month ago

Paper file info:

📄 Wordcount for paper.md is 1226

✅ The paper includes a Statement of need section

editorialbot commented 1 month ago

License info:

🟡 License found: GNU General Public License v2.0 (Check here for OSI approval)

csoneson commented 1 month ago

👋🏼 @tongzhouxu, @hkaspersen, @mberacochea - this is the review thread for the submission. All of our communications will happen here from now on.

As a reviewer, the first step is to create a checklist for your review by entering

@editorialbot generate my checklist

as the top of a new comment in this thread. These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues directly in the software repository. If you do so, please mention this thread so that a link is created (and I can keep an eye on what is happening). Please also feel free to comment and ask questions in this thread. It is often easier to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.

Please feel free to ping me (@csoneson) if you have any questions or concerns. Thanks!

editorialbot commented 1 month ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

hkaspersen commented 1 month ago

Review checklist for @hkaspersen

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/tongzhouxu/mashpit?
[x] License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@tongzhouxu) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

mberacochea commented 1 month ago

Review checklist for @mberacochea

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/tongzhouxu/mashpit?
[x] License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@tongzhouxu) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[ ] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[ ] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[ ] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

hkaspersen commented 1 month ago

I have now completed my review of this software and manuscript. The authors present an interesting and useful tool that will solve a lot of issues regarding sensitive data and analysis speed. Below are some specific comments I would like the authors to address.

Functionality: The functionality of the software (v. 0.9.7) was tested on our HPC cluster, using Miniconda for installation by following the guidelines from the GitHub page. No problems were encountered during installation. The software was tested by sketching the Salmonella database as described in the example commands, followed by querying a local Salmonella genome against the database, using default settings.

[ ] The following warning was encountered: FutureWarning: Minimal version of pyarrow will soon be increased to 14.0.1. You are using 9.0.0. Please consider upgrading.. Unless version 9.0.0 of pyarrow is crucial, please consider updating this software.
[ ] The time used to prepare the Salmonella database was quite long, which is understandable. However, it would be useful for users to know how long one can expect this preparation to take, so they can plan accordingly.
[ ] Output: The png tree figure is clipped due to the number of genomes included (default 200). Make sure the figure is within the limits of the plot, or state in the documentation that "no tree figure will be generated when including more than x results".
[ ] I cannot reproduce the performance evaluation of Mashpit from the description provided in the manuscript, due to missing results description (see comment under "Manuscript")

Documentation: The documentation provided on GitHub is a bit lacking and could be expanded.

[ ] Please provide a longer description of the software and its uses on the github documentation. This makes it easier for users to understand what the software does and why it is useful.
[ ] No automated tests or test data is provided. Please consider adding a small dataset for users to test the software with. No github actions with CI tests was identified in the github repository.

Manuscript: The manuscript is well-written and provides context on why the software is useful, and what problem it solves.

Summary:
- Line 28 - 34: A brief description is provided on the performance of Mashpit compared to BLAST, but no actual comparison is presented. Please consider adding data on how much faster and/or accurate Mashpit is compared to BLAST. Are there any other software worth comparing to?
Mashpit design:
- [ ] Line 65 - 68: The authors state that they have evaluated the performance of Mashpit on four different foodborne pathogens. However, the results from these tests are not presented clearly. I assume this is the data presented in Figure 1? If so, please refer to the figure here. Due to the figure being clipped in the pdf, it makes it difficult to grasp all the details in the figure. Additionally, it would be very beneficial for the reader to know the exact results from these tests, and adding them to the text would be useful.
Minor comments:
- [ ] Line 43: Please Italicise Salmonella
- [ ] Line 68: Please correct Campylobactor to Campylobacter
- [ ] Line 78: Add 'h' to 'Mashpit'
- [ ] Line 80: Correct 'calclated'
- [ ] Figure 1: Only parts of the figure is visible. Make sure the full figure is visible.

tongzhouxu commented 1 month ago

Hi @hkaspersen , Thank you so much for your review and your suggestions. We greatly appreciate the time and effort you invested in evaluating our work. We will revise the codes and manuscript accordingly.

csoneson commented 2 weeks ago

👋🏻 Just wanted to check in on the progress of the reviews here. @hkaspersen - thanks for your initial comments! @mberacochea - could you let us know how things are going on your side, or if you have any questions. Thanks everyone!

mberacochea commented 2 weeks ago

Hi @csoneson! I've been traveling for the past few weeks, and I will sort out my review in the next few days.

mberacochea commented 1 week ago

I've finished my review :).

The authors present a command-line utility that addresses a problem in a convenient and performant way. The code is well-structured, and the installation and usage instructions work as expected. The repository includes a set of unit tests covering a significant portion of the API, automated through GitHub Actions. This approach leverages the efficient use of Sourmash to compare large numbers of genomes very quickly, providing a highly convenient tool that reduces the friction of this approach—from downloading references to querying using the user’s sample FASTA file.

Source code - docs and functionality

I've submitted a series of issues with my suggestions to improve the source code, those are:

These issues cover aspects of the quality of the source, functionality and documentation. I'll update my check list after the author review my tickets in the repo.

Paper

The paper is well written and it allows the reader to understand the purpose and scope of the application.

Some notes on particular lines:

Lines 11 and 12. PulseNet should be cited on line 11, and maybe mention that is hosted by NCBI.
Line 43, Salmonella should be italicise
Line 53, is sourmash an interface for Mash, or an implementation of the FracMinHash algorithm instead?
Line 80, typo "calclated"
Figure 1, it's clipped and not readable.
The webserver it's not mentioned in the paper.

A more general note about Mashpit: If I understood correctly, the accuracy of placing a genome in the correct SNP cluster isn’t very high (70% for Salmonella when considering that the "right" SNP cluster should be within the top 25). This seems quite relevant, as it’s likely important to users (this is my assumption). I would suggest expanding or rephrasing this part to make it clearer to users what the best use case for Mashpit is (it’s already mentioned in the discussion, but I feel it’s a bit lost there). I would also suggest splitting Figure 1 into two figures, one to explain the "compute performance" (in terms of resources) and the other to show the tool's accuracy.

tongzhouxu commented 1 week ago

Hi @mberacochea , thank you for taking the time to review Mashpit. We will modify the code accordingly based on your feedback.

Best, Tongzhou