[REVIEW]: khmer release v2.1: software for biological sequence analysis

whedon commented 7 years ago

Submitting author: @standage (Daniel S. Standage) Repository: https://github.com/dib-lab/khmer Version: v2.1 Editor: @biorelated Reviewer: @sjackman Archive: 10.5281/zenodo.822478

Status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/14540a3b572265f63247f7d8982ea6e2"><img src="http://joss.theoj.org/papers/14540a3b572265f63247f7d8982ea6e2/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/14540a3b572265f63247f7d8982ea6e2/status.svg)](http://joss.theoj.org/papers/14540a3b572265f63247f7d8982ea6e2)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer questions

Conflict of interest

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: Does the release version given match the GitHub release (v2.1)?
[x] Authorship: Has the submitting author (@standage) made major contributions to the software?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: Have any performance claims of the software been confirmed?

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g. papers, datasets, software)?

whedon commented 7 years ago

Hello human, I'm @whedon. I'm here to help you with some common editorial tasks for JOSS. @sjackman it looks like you're currently assigned as the reviewer for this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As as reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all JOSS reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

arfon commented 7 years ago

👋 @sjackman - please carry out your review in this issue by updating the checklist above and giving feedback in this issue. The reviewer guidelines are available here: http://joss.theoj.org/about#reviewer_guidelines

Any questions/concerns please let @biorelated know.

sjackman commented 7 years ago

@standage Could the DOIs be made into links using http://dx.doi.org ? https://github.com/dib-lab/khmer/blob/master/papers/paper-v2.1.md#references e.g. doi:10.12688/f1000research.6924.1

sjackman commented 7 years ago

I tried to verify the reduced memory usage of --small-count and ran into an issue, reported at https://github.com/dib-lab/khmer/issues/1722

standage commented 7 years ago

Could the DOIs be made into links using http://dx.doi.org?

Sure, I'll take care of that.

sjackman commented 7 years ago

Thanks, Daniel.

sjackman commented 7 years ago

I tried to run examples/stamps/do.sh and encountered the error unrecognized arguments: --savetable, reported at https://github.com/dib-lab/khmer/issues/1725

standage commented 7 years ago

Links fixed in https://github.com/dib-lab/khmer/pull/1727.

sjackman commented 7 years ago

support for k > 32 using the non-reversible hash function MurmurHash3

Is there an upper limit on the value of k?

Consider italicizing each occurrence of k, e.g. k-mer and k > 32.

standage commented 7 years ago

Is there an upper limit on the value of k?

Nope, MurmurHash3 will handle arbitrarily large inputs. Of course, the larger k gets the more likely collisions in k-mer hash space are. But this is offset by two factors: 1) k-mers observed in genomes are sparse with respect to the entire k-mer space, especially as k gets larger; 2) collisions in k-mer hash space are orders of magnitude less frequent than collisions in the probabilistic data structures we use.

Consider italicizing each occurrence of k, e.g. k-mer and k > 32.

Done. Thanks for catching that!

sjackman commented 7 years ago

---
title: 'Review of khmer release v2.1: software for biological sequence analysis'
author: Shaun Jackman
date: 2017-06-15
---

The authors describe version 2.1 of their software khmer for various analyses of sequencing data. khmer implements primarily two data structures, the Bloom filter to store k-mer presence/absence and Count-Min Sketch to store k-mer counts. This release adds new features including variable-coverage trimming, supporting k > 32, reduced memory utilization for the Count-Min Sketch by using 4 bit counters rather than 8, and assembling and compacting a de Bruijn graph.

The software installed easily using pip install khmer. I successfully ran the example stamps analysis of a synthetic metagenomic data set. I verified the 4-bit counter reduces the memory usage of khmer by half. I successfully ran the de Bruijn graph compaction script extract-compact-dbg.py.

The 4-bit counter will be useful in reducing the memory requirements of assembling large genomes and metagenomes. Values of k larger than 32 will be useful in assembling 150-bp reads, where values of k around 100 are common. The de Bruijn graph assembly and compaction scripts are a useful contribution to the field of de Bruin graph assembly using compact data structures.

Although likely out of scope for this paper, I would like to see

the assembly modules optionally output the assembled graph in the Graphical Fragment Assembly (GFA) file format, or provide a converter from the default GML output format to GFA for interoperability with other assembly tools and visualization with Bandage.
a description of the de Bruin graph assembly and compaction algorithms.
a comparison of the assembly feature of khmer to other compact data structure assemblers like Minia, BCALM, and ABySS 2.0.

My best regards, Shaun Jackman

sjackman commented 7 years ago

@biorelated Hi, George. I have completed my review of khmer 2.1.

george-githinji commented 7 years ago

Thank you very much @sjackman for taking time to review. You have raised a couple of points that you would like to see although they might be out of scope of this manuscript. @standage could you confirm whether these are items that you would like to add to the current software package (and manuscript) or it is something that you would like to implement in a future release.

standage commented 7 years ago

Thanks for your comments and your generosity with time, @sjackman and @biorelated!

Support for GFA output is something we've been discussing recently (see https://github.com/dib-lab/khmer/issues/1467), but this didn't get prioritized for the 2.1 release. I'm hoping it's something we can include by version 3.0.

The graph compaction code is convenient but also intentionally unsophisticated; it’s not interesting enough for a research paper. We’ll document it as we expand the Python interface, probably for 3.0.

A description of khmer's assembly algorithm and comparisons to other tools, while potentially insightful from a scientific perspective, is indeed outside the scope of this software paper. At least one member of our research group is actively developing novel assembly methods and applications, and publication of these developments should provide an opportunity to describe khmer's implicit de Bruijn graph implementation and graph traversal algorithms in depth.

Best wishes,
Daniel Standage

george-githinji commented 7 years ago

Thank you Daniel (@standage) for the quick response and thank you again Shaun (@sjackman) for your review. @arfon please could we get this ready for publication? Thank you George

arfon commented 7 years ago

@arfon please could we get this ready for publication?

👍

@standage - Could you move the references you currently have in the paper-v2.1.md file into a paper.bib file and cite them directly please? (You can read how to do that here)

standage commented 7 years ago

Could you move the references you currently have in the paper-v2.1.md file into a paper.bib file and cite them directly please?

Done. See latest commit in maint/2.1 branch.

arfon commented 7 years ago

@standage - there's still some problems here sorry.

Please link to the paper.bib file in the YAML header: bibliography: paper.bib (see the example paper here: https://raw.githubusercontent.com/arfon/fidgit/master/paper/paper.md)
Please add a # References section to the paper-v2.1.md file (this is where the pandoc compilation adds your BibTeX references automatically)

Please make sure you're citing references in your paper that you actually have in your BibTeX. Currently a number are missing:

pandoc-citeproc: reference crusoe2015 not found
pandoc-citeproc: reference zhang2015 not found
pandoc-citeproc: reference pell2012 not found
pandoc-citeproc: reference zhang2014 not found

standage commented 7 years ago

Updated. Thanks @arfon.

The text already cited those references as is. Perhaps the message appeared because I had not properly included the .bib file and/or included a header for the References section?

Hopefully this addresses it. Let me know if anything is still amiss.

arfon commented 7 years ago

The text already cited those references as is. Perhaps the message appeared because I had not properly included the .bib file and/or included a header for the References section?

The issue was still there I'm afraid - case sensitive references not being picked up. I've fixed up the issues in https://github.com/dib-lab/khmer/pull/1739

When you've merged https://github.com/dib-lab/khmer/pull/1739 could you make an archive of the reviewed software in Zenodo/figshare/other service and update this thread with the DOI of the archive? I can then move forward with accepting the submission.

standage commented 7 years ago

Sorry about the case sensitivity issues, and thanks for your PR. The archive is now posted to Zenodo at doi:10.5281/zenodo.822478.

arfon commented 7 years ago

@whedon set 10.5281/zenodo.822478 as archive

whedon commented 7 years ago

OK. 10.5281/zenodo.822478 is the archive.

arfon commented 7 years ago

@sjackman many thanks for your review here and to @biorelated for editing this submission ✨

@standage - your paper is now accepted into JOSS and your DOI is http://dx.doi.org/10.21105/joss.00272 :zap: 🚀 💥

standage commented 7 years ago

Great news. Thanks to @arfon, @biorelated, and @sjackman for you contributions to the review and publishing process!

openjournals / joss-reviews