Closed editorialbot closed 1 year ago
Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.
For a list of things I can do to help you, just type:
@editorialbot commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@editorialbot generate pdf
Software report:
github.com/AlDanial/cloc v 1.88 T=0.08 s (1172.6 files/s, 217821.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
C++ 30 929 1036 4845
C/C++ Header 32 734 1018 3742
Perl 13 530 586 2819
Bourne Shell 11 142 129 707
Markdown 7 149 0 547
CMake 2 38 29 154
make 2 47 45 83
TeX 1 4 0 54
YAML 1 1 4 18
-------------------------------------------------------------------------------
SUM: 99 2574 2847 12969
-------------------------------------------------------------------------------
gitinspector failed to run statistical information for the repository
Wordcount for paper.md
is 1372
Failed to discover a valid open source license
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1186/gb-2014-15-3-r46 is OK
- 10.1186/s13059-018-1568-0 is OK
- 10.1186/s13059-019-1891-0 is OK
- 10.1371/journal.pcbi.1006277 is OK
MISSING DOIs
- None
INVALID DOIs
- None
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Hi @lpantano
The submitter only has 5 commits on the repository. I assume this is not enough to meet the "Contribution and authorship" check.
I am also not sure about the scholarly effort. The paper describes only one added feature to an existing software. While I am convinced that the added feature is useful, I'm not sure if one feature is substantial enough to warrant a paper.
Could you weigh in on this?
Hi @lpantano
The submitter only has 5 commits on the repository. I assume this is not enough to meet the "Contribution and authorship" check.
I am also not sure about the scholarly effort. The paper describes only one added feature to an existing software. While I am convinced that the added feature is useful, I'm not sure if one feature is substantial enough to warrant a paper.
Could you weigh in on this?
I am one of the authors. The new feature required substantial re-engineering of the code, and it provides a really, really useful new capability. This software uses a custom database that is huge, typically around 400 GB, and the database must be in RAM for the original version to work. That limits users to finding a server with more than 400 GB of RAM, and for many users they simply can't do it. My own colleagues in the School of Medicine were recently scrambling (unsuccessfully) to find funds to buy a new, large-memory server so they could run this software (KrakenUniq) locally. They are using it for medical data so they can't simply run things on an external cloud server like Amazon. The new "feature" is an elegant way to read in the database in chunks, process all of the other input data, store the partial results in temporary files, and then read in the next chunk. The user can divide the data into as many chunks as they like, and the runtime penalty is really surprisingly modest. This work was done by a postdoc, Christopher Pockrandt (1st author), who was not part of the original papers describing this system. How else can he get credit if this isn't considered "substantial enough to warrant a paper"? I hope you'll agree that we need to make it possible for scientists to publish good software engineering work - otherwise, there is little incentive to maintain research software. -Steven Salzberg
Hi Steven,
Thanks for your clarifications. I definitely agree we need to publish good software engineering work; I'm reviewing for JOSS after all :slightly_smiling_face:
I do feel that the complexity of the proposed solution as you describe it now, is not reflected in the paper. Maybe the first paragraph of the section "Database chunking" can be elaborated on? For example, is there anything specific about merging the k-mer lookups from the different chunks? How does the chunked reading from disk work? Adding pseudo code to describe the algorithm could help as well.
I'll let Christopher know and see if we can add some more details about the temporary data structure and how we merge them.
Hi, everyone!
First, I want to quickly caveat that this is my first review, Github based or otherwise. I'm looking forward to getting into it, and hope you'll bear with me as I figure out all of the specifics.
I'm holding out hope that I'll be able to get up and running with a docker pull
, but in the meantime, here's a list of questions/concerns/suggestions based on a read through of the paper/documentation/git history.
LICENSE
or perhaps LICENSE.md
. Does JOSS have a convention here?Dockerfile
to be included in the repo, and if at all possible, corresponding Docker and/or Singularity images to be made available. Virtually every organization using KrakenUniq at scale will wrap this repo in an image. Having it provide by the repo maintainers will remove a significant amount of unnecessary re-work. Note: Conda is also a reasonable package manager, but is more niche outside of academia. --preload-size
and --preload
flags interact. Will --preload-size
work without --preload
or does it imply --preload-size
? In fact, I believe the cleaner API is to not have a new flag. --preload
can just be passed either with a size --preload 8GB
, or without --preload
, the latter implying preloading everything. Keeping a tight list of flags a user needs to remember makes for a better user experience.Options
is often the most important section of a README. fd
is a great example. Or, in other words, this function should be in the README. CONTRIBUTING.md
file would get this box checked. There are mainly guidelines available, like this oneThank you @MaximLippeveld and @Jessime for your reviews!
@MaximLippeveld , we added more detail as to how the modification works at the beginning of "Database chunking" section: "Under this new algorithm, KrakenUniq loads one chunk of the database into memory at a time. KrakenUniq performs a binary search on the minimizer table to find the largest minimizer such that the chunk of the minimizer table and the corresponding chunk of the k-mer table together use not more than the specified amount of memory. It then loads the chunk of the minimizer table and the corresponding chunk of the k-mer table and iterates over all of the reads provided as input. The code looks up all k-mers in the reads in the currently-loaded chunk of the database. The taxonomic IDs for the k-mers that matched a k-mer in the chunk (or placeholders for k-mers without a hit) are stored in a temporary file on disk for each read. With every chunk iteration, taxonomic IDs of newly found k-mers in the reads are appended to the temporary file. This process is repeated until the entire database has been processed. At this point the temporary file contains taxonomic IDs for all k-mers in the reads that matched any part of the database. KrakenUniq then reads the temporary file, collects k-mer hits for every read, and uses them to classify each read. Classification results will be identical to running in the default mode; i.e., database chunking does not alter the output."
@Jessime, here are our responses to your review:
License: we renamed the LICENSE.GPL-3.0 to LICENSE.
Contribution and Authorship: several students and postdocs have contributed to KrakenUniq sequentially. Once a student or postdoc graduates, they sometimes stop working on the project, and maintenance and improvements have to be picked up by another person. That explains the git history.
Installation: At this time KrakenUniq is available through github and BioConda. We will definitely consider adding a docker image once we have manpower to do it.
Performance: The default behavior (without any preload) the worst one can do. We modified the paragraph describing the performance as follows:
"Running times can vary significantly depending on the type of storage (e.g., databases
on network storage can take longer to load) and the size of the read dataset (i.e., classifying a small
number of reads does not justify preloading the entire database, especially of fast storage). The speed of loading
the database is not impacted by the --preload-size
option because the database is still read in a linear way.
Saving intermediate files from the chunks is done in the same way as in the original code.
The only difference is that now classification results from all individual chunks are concatenated into a single file, which is read once all chunks are finished. Table 1 shows that in a typical use case, even when the database does fit in RAM, loading the entire database (--preload
option) is far faster than memory mapping (14 minutes versus 48 hours). Loading the database by chunks adds overhead because of the need to
iterate over the reads multiple times, but is still comparable to pre-loading the entire database and
highly recommended when not enough main memory is available. For example, limiting the database
to 8G, which means it can be loaded even on a standard laptop computer, increased the running
time only about 3.4-fold, even though the database was broken into 49 chunks. For large read datasets we expect that setting the --preload
or --preload-size
flag will always be faster than the default behavior of memory mapping. The format of the
databases used by the new algorithm has not changed, hence all previously built databases for Kraken
and KrakenUniq can be used."
Example usage: We would like to keep the --preload flag for compatibility with legacy scripts. We added "overrides --preload flag" to the usage: --preload-size SIZE Loads DB into memory in chunks of SIZE, e.g. 500M or 7G (if RAM is small), overrides --preload flag
Functionality documentation: We added the usage statement to the README.md, in a similar way it was done for the fd
example.
Community guidelines: We added CONTRIBUTING.md and CODE_OF_CONDUCT.md to the repository.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Hi, thanks for the updates. The "database chunking" section is now very clear for me!
A couple more remarks:
conda install -c bioconda krakenuniq
. The repo's README doesn't include -c bioconda
at the moment.@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@MaximLippeveld, thank you for your suggestions!
Installation: we modified the README.md to include -c bioconda
Statement of need: we added the following paragraph: "Typical databases used by KrakenUniq are tens to hundreds of gigabytes in size. The original KrakenUniq code required loading the entire database in RAM, which demanded expensive high-memory servers to run it efficiently. If a user did not have enough physical RAM to load the entire database, KrakenUniq resorted to memory-mapping the database, which significantly increased run times, frequently by a factor of more than 100. The new functionality described in this paper enables users who do not have access to high-memory servers to run KrakenUniq efficiently, with a CPU time performance increase of 3 to 4-fold, down from 100+."
Reproducibility: we added the command line used to generate Table 1 to the table caption:
"Table 1: Running times for classifying 9.4 million reads (from a human eye metagenome, SRR12486990)
with 8 threads using KrakenUniq in different modes. The database size was 392 GB, and it consisted
of all complete bacterial, archeal, and viral genomes in RefSeq from 2020, 46 selected eukaryotic
human pathogens [@lu2018removing]), as well the human genome and a collection of common vector sequences. The database is available for download at https://benlangmead.github.io/aws-indexes/k2 under the name EuPathDB46. The command lines used to measure the runtimes were krakenuniq --db krakendb-2020-08-16-all_pluseupath --threads 24 --report-file report --output classify SRR12486990.fastq
with no additional options for default, and with addition of the preload option shown in the table for various preload sizes. In the database chunking experiments (using –preload-size
) KrakenUniq loaded the database into RAM in 49, 25, 13 and 7 chunks (respectively)."
References: We added references to GenBank and minimizers.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Great!
Could you also add instructions on downloading SRR12486990.fastq
?
Great!
Could you also add instructions on downloading
SRR12486990.fastq
?
Yes, just did this. That's an NCBI accession, and we've put the link in the Table caption, pointing to https://www.ncbi.nlm.nih.gov/sra/SRR12486990.
@editorialbot generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Thanks for the updates. I was able to run the command successfully in 44min using --preload-size 16G
and --threads 10
on an Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz. Due to computational constraints, I am not able to reproduce all timing results, unfortunately.
@lpantano My review is completed!
@lpantano , what are the next steps?
@Jessime, can you give an estimation for your review after the updates?
@Jessime thank you for your review, are you satisfied with the changes that we made in response to your review? could you please respond when you have time
Hey everyone, things are looking good. I just have one more point on the Installation.
Namely, adding the following Dockerfile
will allow users who don't have access to conda to be able to use your tool:
FROM alpine:latest
RUN apk add --no-cache \
bash \
perl \
make \
g++ \
bzip2-dev \
zlib-dev
WORKDIR /app
COPY . .
RUN ./install_krakenuniq.sh /usr/local/bin
CMD ["bash"]
It's then a matter of:
docker build -t krakenuniq .
docker run --rm -it -it krakenuniq krakenuniq --help
to install and verify things look good. Would it be possible to add the Dockerfile
and the instructions?
@Jessime Added! Thank you for your suggestion!!!
@Jessime, thank you for your comments. Can you update your checklist please, if now all is good, I can go over the last steps. Let me know. Thanks!
Everything is good! I've updated the checklist and signed off on everything.
Thank you @Jessime !
@editorialbot check references
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1093/nar/gks1195 is OK
- 10.1093/bioinformatics/bth408 is OK
- 10.1186/gb-2014-15-3-r46 is OK
- 10.1186/s13059-018-1568-0 is OK
- 10.1186/s13059-019-1891-0 is OK
- 10.1371/journal.pcbi.1006277 is OK
MISSING DOIs
- None
INVALID DOIs
- None
@alekseyzimin , At this point could you:
I can then move forward with recommending acceptance of the submission.
Hello @lpantano, thank you for letting us know about the next steps. We performed the requested updates, please see below. Make a tagged release of your software, and list the version tag of the archived version here. Response: released v1.0.3 Archive the reviewed software in Zenodo or a similar service (e.g., figshare, an institutional repository) Check the archival deposit (e.g., in Zenodo) has the correct metadata. This includes the title (should match the paper title) and author list (make sure the list is correct and people who only made a small fix are not on it). You may also add the authors' ORCID. Please list the DOI of the archived version here. Response: 10.5281/zenodo.7459217
@editorialbot set v1.0.3 as version
Done! version is now v1.0.3
@editorialbot set 10.5281/zenodo.7459217 as archive
Done! Archive is now 10.5281/zenodo.7459217
@editorialbot recommend-accept
Attempting dry run of processing paper acceptance...
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1093/nar/gks1195 is OK
- 10.1093/bioinformatics/bth408 is OK
- 10.1186/gb-2014-15-3-r46 is OK
- 10.1186/s13059-018-1568-0 is OK
- 10.1186/s13059-019-1891-0 is OK
- 10.1371/journal.pcbi.1006277 is OK
MISSING DOIs
- None
INVALID DOIs
- None
:wave: @openjournals/bcm-eics, this paper is ready to be accepted and published.
Check final proof :point_right::page_facing_up: Download article
If the paper PDF and the deposit XML files look good in https://github.com/openjournals/joss-papers/pull/3839, then you can now move forward with accepting the submission by compiling again with the command @editorialbot accept
@editorialbot accept
I'm sorry @alekseyzimin, I'm afraid I can't do that. That's something only eics are allowed to do.
@lpantano everything looks good, please proceed with the publication. Thank you!
Submitting author: !--author-handle-->@salzberg<!--end-author-handle-- (Steven Salzberg) Repository: https://github.com/fbreitwieser/krakenuniq Branch with paper.md (empty if default branch): Version: v1.0.3 Editor: !--editor-->@lpantano<!--end-editor-- Reviewers: @Jessime, @MaximLippeveld Archive: 10.5281/zenodo.7459217
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@Jessime & @MaximLippeveld, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @lpantano know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Checklists
📝 Checklist for @MaximLippeveld
📝 Checklist for @Jessime