openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
714 stars 38 forks source link

[REVIEW]: Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. #4296

Closed editorialbot closed 2 years ago

editorialbot commented 2 years ago

Submitting author: !--author-handle-->@althonos<!--end-author-handle-- (Martin Larralde) Repository: https://github.com/althonos/pyrodigal/ Branch with paper.md (empty if default branch): paper Version: v1.0.0 Editor: !--editor-->@mikldk<!--end-editor-- Reviewers: @Gab0, @standage Archive: 10.5281/zenodo.6472583

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b86b61e7d2f4ac0140ae284cfe910a6e"><img src="https://joss.theoj.org/papers/b86b61e7d2f4ac0140ae284cfe910a6e/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b86b61e7d2f4ac0140ae284cfe910a6e/status.svg)](https://joss.theoj.org/papers/b86b61e7d2f4ac0140ae284cfe910a6e)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@Gab0 & @standage, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @mikldk know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @standage

📝 Checklist for @Gab0

editorialbot commented 2 years ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 2 years ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.16 s (479.7 files/s, 188929.2 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
JSON                             3              0              0          11512
SVG                              4              1             83           7787
Cython                          19            706            741           3927
Python                          20            324            231           1700
YAML                             3             22             67            627
Markdown                         4            221              0            543
C                                4             25             39            215
C/C++ Header                     5             38             17            167
TeX                              1              9              0            123
reStructuredText                 9            109            132             93
DOS Batch                        1              8              1             26
CSS                              1              4              7             16
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                            75           1471           1325          26745
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 2 years ago

Wordcount for paper.md is 1246

editorialbot commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1038/s41592-018-0046-7 is OK
- 10.1145/1250734.1250746 is OK
- 10.1093/nar/gkz1002 is OK
- 10.1186/1471-2105-11-119 is OK
- 10.1101/2021.05.03.442509 is OK
- 10.1093/nargab/lqaa109 is OK
- 10.1101/2021.07.21.453177 is OK
- 10.1093/bioinformatics/btz714 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mikldk commented 2 years ago

@Gab0, @standage: Thanks for agreeing to review. Please carry out your review in this issue by first creating a checklist (@editorialbot generate my checklist) and then updating it as the review proceeds. The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. If possible create issues (and cross-reference) in the submission's repository to avoid too specific discussions in this review thread.

If you have any questions or concerns please let me know.

standage commented 2 years ago

Review checklist for @standage

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Gab0 commented 2 years ago

Review checklist for @Gab0

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

standage commented 2 years ago

The author describes Pyrodigal, a software package implementing a Python interface to the widely-adopted gene prediction program Prodigal. The Prodigal source code is bundled with Pyrodigal and compiled as a Python extension, which Pyrodigal accesses directly via a Python interface rather than indirect system calls to a separate executable. In addition to Python bindings, the author describes performance optimizations made to Prodigal’s core scoring procedure.

Prodigal is the de facto standard for prokaryotic gene prediction in the genome informatics community, and Python is the community’s de facto general-purpose scripting language. Exposing Prodigal’s internals comprehensively and efficiently to Python programs is a valuable contribution, as underscored by Pyrodigal's usage statistics and its integration into various published workflows. I am happy to endorse this submission for publication at JOSS.

standage commented 2 years ago

Hi @mikldk, no concerns from me on this submission so I think this concludes my review!

Gab0 commented 2 years ago

Overview

This project consists of Cython bindings to the widely used prokaryotic ORF finder software Prodigal.

The premise is interesting, allowing Prodigal functions to be called directly from Python code as an alternative to binary calls using subprocess.call or similar. This enables a more flexible pipeline design.

The performance claims over CPU time, RAM use and storage use can be useful when analyzing large sets of genomes, a scenario that has become increasingly common over the last years.

The paper includes all the required information. It also includes interesting details about performance optimizations done on top of the Prodigal codebase.

Setup

The README.md describes the following installation methods:

From these I tested the pypi method and it works nicely, but I don't use Bioconda.

There seems to be two undocumented installation methods:

Both methods works nicely. The local build works after pulling all the submodules to the local repository.

Tests

I wrote a quick comparison test for Pyrodigal against the Prodigal binary. The workflow is a basic run using default parameters. The Pyrodigal binary runs much faster (1.8x here) than the original Prodigal executable, reflecting the performance improvements described in the paper.

[ ] However, in this run, calling Pyrodigal routines using the library took more time than with the binary. Execution times are 15.5s against 2.5s. Maybe I did not understand the sample code shown at the README.md and wrote wrong library calls.

I would like to ask the author a few questions about these benchmark results:

  1. Do both snippets shown below have equivalent functionality?

a) pyrodigal -i <genome.fna> -o genes -a proteins.faa

b)

orf_finder = pyrodigal.OrfFinder(meta=True)

record = SeqIO.read(GENOMEfna, "fasta")
genome_input = record.seq.encode("utf-8")

predictions = orf_finder.find_genes(genome_input)
with open("proteins.faa", 'w', encoding="utf-8") as f:
    for i, gene in enumerate(predictions):
        f.write(f">{record.id}_{i+1}\n")
        f.write(gene.translate() + '\n')
  1. If they are equivalent, is this performance difference expected? Maybe it is related to something in my setup. If applicable, explanations about performance differences between library and binary would be a good addition to the documentation.

Suggestions

Summary

  1. The documentation and the paper are well written and contain all the relevant information.

  2. Setup works nicely.

  3. Expected performance gains are observable when comparing Pyrodigal executable against the original Prodigal.

  4. I have a few questions about library performance in the Tests section. These correspond to the remaining checkboxes in my review.

The project is solid overall. My review status is minor revisions.

althonos commented 2 years ago

Hi @Gab0 , thanks for the review! A few answers / updates:

There seems to be two undocumented installation methods:

The AUR package named python-pyrodigal. Build from the git source.

I added the AUR installation method recently, that's why it's not documented, but after your review updated the documentation to mention it, so it will be rendered on the website with the next release. Installing from source is not covered in the README.md (because it's really less convenient than PyPI / Bioconda), but I already have a section in the [Install page}(https://pyrodigal.readthedocs.io/en/stable/install.html) of the documentation.

I wrote a quick comparison test for Pyrodigal against the Prodigal binary. The workflow is a basic run using default parameters. The Pyrodigal binary runs much faster (1.8x here) than the original Prodigal executable, reflecting the performance improvements described in the paper.

However, in this run, calling Pyrodigal routines using the library took more time than with the binary. Execution times are 15.5s against 2.5s. Maybe I did not understand the sample code shown at the README.md and wrote wrong library calls.

I would like to ask the author a few questions about these benchmark results: Do both snippets shown below have equivalent functionality?

If they are equivalent, is this performance difference expected? Maybe it is related to something in my setup. If applicable, explanations about performance differences between library and binary would be a good addition to the documentation.

No, the two snippets are not equivalent, because in the command line invocation you called Pyrodigal/Prodigal in single mode (which trains and predicts on the same sequence) but in the script you used the meta mode (as shown by the first line, orf_finder = pyrodigal.OrfFinder(meta=True)), which predicts genes using the best-scoring pre-trained data. Meta mode is much slower (it can compare up to 50 different parameters to find the best ones, although with a filter on the GC% of the input it will often only compare around 10 different bins; still a significant increase in runtime).

The package could be improved by including complete example workflows, possibly including sample data. The code snippets provided in the README.md are basic demonstrations that require modification to be executed. They also include import statements that won't work with current package versions (e.g. import Bio.SeqIO, import multiprocessing.pool).

Pyrodigal is not thought to perform I/O, which is why I included examples with Biopython and Scikit-Bio, which are two general purpose libraries for bioinformatics, and can be used to load data from FASTA or GenBank files. multiprocessing.pool is a standard library module so it will work in any Python setup. I can however update the examples with paths to real data so that they are immediately testable.

Automatic linters such as autopep8 could be applied to the source code. At least for the pure python files, since I don't know if they are effective for Cython code. That should improve code readability by enforcing line length limits and solving other minor formatting issues.

I have linted the Python sources with black, but to my knowledge there is no source code autoformatting tools for Cython. In addition, I'd rather not reformat the Cython code, because some code blocks have been ported from the C code of Prodigal, and although they don't look very nice, having a 1-to-1 match between the Prodigal code and the Pyrodigal code for this sections improves readability when comparing the two, which also makes the code more future-proof if the Prodigal code changes in the future (because diffing will be easier).

Gab0 commented 2 years ago

@althonos Cool, I'll address each remaining topic:

Ok, so all my concerns have been addressed. My suggestions about working examples and documentation updates can be implemented in future releases if the author feels they will improve the package.

My review is completed, and I endorse this submission for publication at JOSS. (@mikldk)

althonos commented 2 years ago

Thanks for your reviews @Gab0 and @standage !

mikldk commented 2 years ago

@Gab0, @standage: Thank you very much for your effort with the very constructive reviews.

@althonos:

althonos commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

althonos commented 2 years ago

Hi @mikldk ,

I have now made a new stable release (v1.0.0) and finalized the paper. The DOI for the v1.0.0 version on Zenodo is 10.5281/zenodo.6472583.

althonos commented 2 years ago

Please verify that the archive deposit has the correct metadata (title and author list), or edit these if that is not the case.

Does this means the archive needs to be named the same as the JOSS paper title?

EDIT: Guess I do need to, updated the Zenodo record.

mikldk commented 2 years ago

@editorialbot set 10.5281/zenodo.6472583 as archive

editorialbot commented 2 years ago

Done! Archive is now 10.5281/zenodo.6472583

mikldk commented 2 years ago

@editorialbot set v1.0.0 as version

editorialbot commented 2 years ago

Done! version is now v1.0.0

mikldk commented 2 years ago

@editorialbot check references

editorialbot commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1038/s41592-018-0046-7 is OK
- 10.1145/1250734.1250746 is OK
- 10.1093/nar/gkz1002 is OK
- 10.1186/1471-2105-11-119 is OK
- 10.1101/2021.05.03.442509 is OK
- 10.1093/nargab/lqaa109 is OK
- 10.1101/2021.07.21.453177 is OK
- 10.1093/bioinformatics/btz714 is OK

MISSING DOIs

- None

INVALID DOIs

- None
mikldk commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mikldk commented 2 years ago

@althonos Thanks. A few comments:

Please change and recompile (no need to make a new release, just update Zenodo meta data and the paper).

althonos commented 2 years ago

@mikldk

Thanks, done.

althonos commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mikldk commented 2 years ago

@editorialbot recommend-accept

editorialbot commented 2 years ago
Attempting dry run of processing paper acceptance...
editorialbot commented 2 years ago

:wave: @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/3171

If the paper PDF and the deposit XML files look good in https://github.com/openjournals/joss-papers/pull/3171, then you can now move forward with accepting the submission by compiling again with the command @editorialbot accept

danielskatz commented 2 years ago

@editorialbot accept

looks good!

editorialbot commented 2 years ago
Doing it live! Attempting automated processing of paper acceptance...
editorialbot commented 2 years ago

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦

editorialbot commented 2 years ago

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

  1. Check final PDF and Crossref metadata that was deposited :point_right: https://github.com/openjournals/joss-papers/pull/3172
  2. Wait a couple of minutes, then verify that the paper DOI resolves https://doi.org/10.21105/joss.04296
  3. If everything looks good, then close this review issue.
  4. Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? Notify your editorial technical team...

althonos commented 2 years ago

Thank you all! This was really a pleasant process and I'll definitely consider JOSS again when I have something else suitable for publication.

danielskatz commented 2 years ago

Congratulations to @althonos (Martin Larralde)!!

And thanks to @Gab0 and @standage for reviewing, and to @mikldk for editing! We couldn't do this without you

editorialbot commented 2 years ago

:tada::tada::tada: Congratulations on your paper acceptance! :tada::tada::tada:

If you would like to include a link to your paper from your README use the following code snippets:

Markdown:
[![DOI](https://joss.theoj.org/papers/10.21105/joss.04296/status.svg)](https://doi.org/10.21105/joss.04296)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.04296">
  <img src="https://joss.theoj.org/papers/10.21105/joss.04296/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: https://joss.theoj.org/papers/10.21105/joss.04296/status.svg
   :target: https://doi.org/10.21105/joss.04296

This is how it will look in your documentation:

DOI

We need your help!

The Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following: