openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
725 stars 38 forks source link

[REVIEW]: Metagenomic classification with KrakenUniq on low-memory computers #4908

Closed editorialbot closed 1 year ago

editorialbot commented 2 years ago

Submitting author: !--author-handle-->@salzberg<!--end-author-handle-- (Steven Salzberg) Repository: https://github.com/fbreitwieser/krakenuniq Branch with paper.md (empty if default branch): Version: v1.0.3 Editor: !--editor-->@lpantano<!--end-editor-- Reviewers: @Jessime, @MaximLippeveld Archive: 10.5281/zenodo.7459217

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/0558bc421e6167f70639f053a4562d85"><img src="https://joss.theoj.org/papers/0558bc421e6167f70639f053a4562d85/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/0558bc421e6167f70639f053a4562d85/status.svg)](https://joss.theoj.org/papers/0558bc421e6167f70639f053a4562d85)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@Jessime & @MaximLippeveld, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @lpantano know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @MaximLippeveld

📝 Checklist for @Jessime

editorialbot commented 2 years ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 2 years ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.08 s (1172.6 files/s, 217821.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C++                             30            929           1036           4845
C/C++ Header                    32            734           1018           3742
Perl                            13            530            586           2819
Bourne Shell                    11            142            129            707
Markdown                         7            149              0            547
CMake                            2             38             29            154
make                             2             47             45             83
TeX                              1              4              0             54
YAML                             1              1              4             18
-------------------------------------------------------------------------------
SUM:                            99           2574           2847          12969
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 2 years ago

Wordcount for paper.md is 1372

editorialbot commented 2 years ago

Failed to discover a valid open source license

editorialbot commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1186/gb-2014-15-3-r46 is OK
- 10.1186/s13059-018-1568-0 is OK
- 10.1186/s13059-019-1891-0 is OK
- 10.1371/journal.pcbi.1006277 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

MaximLippeveld commented 2 years ago

Review checklist for @MaximLippeveld

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

MaximLippeveld commented 2 years ago

Hi @lpantano

The submitter only has 5 commits on the repository. I assume this is not enough to meet the "Contribution and authorship" check.

I am also not sure about the scholarly effort. The paper describes only one added feature to an existing software. While I am convinced that the added feature is useful, I'm not sure if one feature is substantial enough to warrant a paper.

Could you weigh in on this?

salzberg commented 2 years ago

Hi @lpantano

The submitter only has 5 commits on the repository. I assume this is not enough to meet the "Contribution and authorship" check.

I am also not sure about the scholarly effort. The paper describes only one added feature to an existing software. While I am convinced that the added feature is useful, I'm not sure if one feature is substantial enough to warrant a paper.

Could you weigh in on this?

I am one of the authors. The new feature required substantial re-engineering of the code, and it provides a really, really useful new capability. This software uses a custom database that is huge, typically around 400 GB, and the database must be in RAM for the original version to work. That limits users to finding a server with more than 400 GB of RAM, and for many users they simply can't do it. My own colleagues in the School of Medicine were recently scrambling (unsuccessfully) to find funds to buy a new, large-memory server so they could run this software (KrakenUniq) locally. They are using it for medical data so they can't simply run things on an external cloud server like Amazon. The new "feature" is an elegant way to read in the database in chunks, process all of the other input data, store the partial results in temporary files, and then read in the next chunk. The user can divide the data into as many chunks as they like, and the runtime penalty is really surprisingly modest. This work was done by a postdoc, Christopher Pockrandt (1st author), who was not part of the original papers describing this system. How else can he get credit if this isn't considered "substantial enough to warrant a paper"? I hope you'll agree that we need to make it possible for scientists to publish good software engineering work - otherwise, there is little incentive to maintain research software. -Steven Salzberg

MaximLippeveld commented 2 years ago

Hi Steven,

Thanks for your clarifications. I definitely agree we need to publish good software engineering work; I'm reviewing for JOSS after all :slightly_smiling_face:

I do feel that the complexity of the proposed solution as you describe it now, is not reflected in the paper. Maybe the first paragraph of the section "Database chunking" can be elaborated on? For example, is there anything specific about merging the k-mer lookups from the different chunks? How does the chunked reading from disk work? Adding pseudo code to describe the algorithm could help as well.

salzberg commented 2 years ago

I'll let Christopher know and see if we can add some more details about the temporary data structure and how we merge them.

Jessime commented 2 years ago

Review checklist for @Jessime

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Jessime commented 2 years ago

Hi, everyone!

First, I want to quickly caveat that this is my first review, Github based or otherwise. I'm looking forward to getting into it, and hope you'll bear with me as I figure out all of the specifics.

I'm holding out hope that I'll be able to get up and running with a docker pull, but in the meantime, here's a list of questions/concerns/suggestions based on a read through of the paper/documentation/git history.

alekseyzimin commented 1 year ago

Thank you @MaximLippeveld and @Jessime for your reviews!

@MaximLippeveld , we added more detail as to how the modification works at the beginning of "Database chunking" section: "Under this new algorithm, KrakenUniq loads one chunk of the database into memory at a time. KrakenUniq performs a binary search on the minimizer table to find the largest minimizer such that the chunk of the minimizer table and the corresponding chunk of the k-mer table together use not more than the specified amount of memory. It then loads the chunk of the minimizer table and the corresponding chunk of the k-mer table and iterates over all of the reads provided as input. The code looks up all k-mers in the reads in the currently-loaded chunk of the database. The taxonomic IDs for the k-mers that matched a k-mer in the chunk (or placeholders for k-mers without a hit) are stored in a temporary file on disk for each read. With every chunk iteration, taxonomic IDs of newly found k-mers in the reads are appended to the temporary file. This process is repeated until the entire database has been processed. At this point the temporary file contains taxonomic IDs for all k-mers in the reads that matched any part of the database. KrakenUniq then reads the temporary file, collects k-mer hits for every read, and uses them to classify each read. Classification results will be identical to running in the default mode; i.e., database chunking does not alter the output."

@Jessime, here are our responses to your review:

License: we renamed the LICENSE.GPL-3.0 to LICENSE.

Contribution and Authorship: several students and postdocs have contributed to KrakenUniq sequentially. Once a student or postdoc graduates, they sometimes stop working on the project, and maintenance and improvements have to be picked up by another person. That explains the git history.

Installation: At this time KrakenUniq is available through github and BioConda. We will definitely consider adding a docker image once we have manpower to do it.

Performance: The default behavior (without any preload) the worst one can do. We modified the paragraph describing the performance as follows: "Running times can vary significantly depending on the type of storage (e.g., databases on network storage can take longer to load) and the size of the read dataset (i.e., classifying a small number of reads does not justify preloading the entire database, especially of fast storage). The speed of loading the database is not impacted by the --preload-size option because the database is still read in a linear way.
Saving intermediate files from the chunks is done in the same way as in the original code.
The only difference is that now classification results from all individual chunks are concatenated into a single file, which is read once all chunks are finished. Table 1 shows that in a typical use case, even when the database does fit in RAM, loading the entire database (--preload option) is far faster than memory mapping (14 minutes versus 48 hours). Loading the database by chunks adds overhead because of the need to iterate over the reads multiple times, but is still comparable to pre-loading the entire database and highly recommended when not enough main memory is available. For example, limiting the database to 8G, which means it can be loaded even on a standard laptop computer, increased the running time only about 3.4-fold, even though the database was broken into 49 chunks. For large read datasets we expect that setting the --preload or --preload-size flag will always be faster than the default behavior of memory mapping. The format of the databases used by the new algorithm has not changed, hence all previously built databases for Kraken and KrakenUniq can be used."

Example usage: We would like to keep the --preload flag for compatibility with legacy scripts. We added "overrides --preload flag" to the usage: --preload-size SIZE Loads DB into memory in chunks of SIZE, e.g. 500M or 7G (if RAM is small), overrides --preload flag

Functionality documentation: We added the usage statement to the README.md, in a similar way it was done for the fd example.

Community guidelines: We added CONTRIBUTING.md and CODE_OF_CONDUCT.md to the repository.

MaximLippeveld commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

MaximLippeveld commented 1 year ago

Hi, thanks for the updates. The "database chunking" section is now very clear for me!

A couple more remarks:

alekseyzimin commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

alekseyzimin commented 1 year ago

@MaximLippeveld, thank you for your suggestions!

Installation: we modified the README.md to include -c bioconda

Statement of need: we added the following paragraph: "Typical databases used by KrakenUniq are tens to hundreds of gigabytes in size. The original KrakenUniq code required loading the entire database in RAM, which demanded expensive high-memory servers to run it efficiently. If a user did not have enough physical RAM to load the entire database, KrakenUniq resorted to memory-mapping the database, which significantly increased run times, frequently by a factor of more than 100. The new functionality described in this paper enables users who do not have access to high-memory servers to run KrakenUniq efficiently, with a CPU time performance increase of 3 to 4-fold, down from 100+."

Reproducibility: we added the command line used to generate Table 1 to the table caption:

"Table 1: Running times for classifying 9.4 million reads (from a human eye metagenome, SRR12486990) with 8 threads using KrakenUniq in different modes. The database size was 392 GB, and it consisted of all complete bacterial, archeal, and viral genomes in RefSeq from 2020, 46 selected eukaryotic human pathogens [@lu2018removing]), as well the human genome and a collection of common vector sequences. The database is available for download at https://benlangmead.github.io/aws-indexes/k2 under the name EuPathDB46. The command lines used to measure the runtimes were krakenuniq --db krakendb-2020-08-16-all_pluseupath --threads 24 --report-file report --output classify SRR12486990.fastq with no additional options for default, and with addition of the preload option shown in the table for various preload sizes. In the database chunking experiments (using –preload-size) KrakenUniq loaded the database into RAM in 49, 25, 13 and 7 chunks (respectively)."

References: We added references to GenBank and minimizers.

alekseyzimin commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

MaximLippeveld commented 1 year ago

Great!

Could you also add instructions on downloading SRR12486990.fastq?

salzberg commented 1 year ago

Great!

Could you also add instructions on downloading SRR12486990.fastq?

Yes, just did this. That's an NCBI accession, and we've put the link in the Table caption, pointing to https://www.ncbi.nlm.nih.gov/sra/SRR12486990.

alekseyzimin commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

MaximLippeveld commented 1 year ago

Thanks for the updates. I was able to run the command successfully in 44min using --preload-size 16G and --threads 10 on an Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz. Due to computational constraints, I am not able to reproduce all timing results, unfortunately.

@lpantano My review is completed!

alekseyzimin commented 1 year ago

@lpantano , what are the next steps?

lpantano commented 1 year ago

@Jessime, can you give an estimation for your review after the updates?

alekseyzimin commented 1 year ago

@Jessime thank you for your review, are you satisfied with the changes that we made in response to your review? could you please respond when you have time

Jessime commented 1 year ago

Hey everyone, things are looking good. I just have one more point on the Installation. Namely, adding the following Dockerfile will allow users who don't have access to conda to be able to use your tool:

FROM alpine:latest

RUN apk add --no-cache \
  bash \
  perl \
  make \
  g++ \
  bzip2-dev \
  zlib-dev

WORKDIR /app
COPY . . 
RUN ./install_krakenuniq.sh /usr/local/bin

CMD ["bash"]

It's then a matter of:

docker build -t krakenuniq . 
docker run --rm -it -it krakenuniq krakenuniq --help

to install and verify things look good. Would it be possible to add the Dockerfile and the instructions?

alekseyzimin commented 1 year ago

@Jessime Added! Thank you for your suggestion!!!

lpantano commented 1 year ago

@Jessime, thank you for your comments. Can you update your checklist please, if now all is good, I can go over the last steps. Let me know. Thanks!

Jessime commented 1 year ago

Everything is good! I've updated the checklist and signed off on everything.

alekseyzimin commented 1 year ago

Thank you @Jessime !

lpantano commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/nar/gks1195 is OK
- 10.1093/bioinformatics/bth408 is OK
- 10.1186/gb-2014-15-3-r46 is OK
- 10.1186/s13059-018-1568-0 is OK
- 10.1186/s13059-019-1891-0 is OK
- 10.1371/journal.pcbi.1006277 is OK

MISSING DOIs

- None

INVALID DOIs

- None
lpantano commented 1 year ago

@alekseyzimin , At this point could you:

I can then move forward with recommending acceptance of the submission.

alekseyzimin commented 1 year ago

Hello @lpantano, thank you for letting us know about the next steps. We performed the requested updates, please see below. Make a tagged release of your software, and list the version tag of the archived version here. Response: released v1.0.3 Archive the reviewed software in Zenodo or a similar service (e.g., figshare, an institutional repository) Check the archival deposit (e.g., in Zenodo) has the correct metadata. This includes the title (should match the paper title) and author list (make sure the list is correct and people who only made a small fix are not on it). You may also add the authors' ORCID. Please list the DOI of the archived version here. Response: 10.5281/zenodo.7459217

lpantano commented 1 year ago

@editorialbot set v1.0.3 as version

editorialbot commented 1 year ago

Done! version is now v1.0.3

lpantano commented 1 year ago

@editorialbot set 10.5281/zenodo.7459217 as archive

editorialbot commented 1 year ago

Done! Archive is now 10.5281/zenodo.7459217

lpantano commented 1 year ago

@editorialbot recommend-accept

editorialbot commented 1 year ago
Attempting dry run of processing paper acceptance...
editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/nar/gks1195 is OK
- 10.1093/bioinformatics/bth408 is OK
- 10.1186/gb-2014-15-3-r46 is OK
- 10.1186/s13059-018-1568-0 is OK
- 10.1186/s13059-019-1891-0 is OK
- 10.1371/journal.pcbi.1006277 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 1 year ago

:wave: @openjournals/bcm-eics, this paper is ready to be accepted and published.

Check final proof :point_right::page_facing_up: Download article

If the paper PDF and the deposit XML files look good in https://github.com/openjournals/joss-papers/pull/3839, then you can now move forward with accepting the submission by compiling again with the command @editorialbot accept

alekseyzimin commented 1 year ago

@editorialbot accept

editorialbot commented 1 year ago

I'm sorry @alekseyzimin, I'm afraid I can't do that. That's something only eics are allowed to do.

alekseyzimin commented 1 year ago

@lpantano everything looks good, please proceed with the publication. Thank you!