[REVIEW]: EUKulele: Taxonomic annotation of the unsung eukaryotic microbes

whedon commented 4 years ago

Submitting author: @akrinos (Arianna Krinos) Repository: https://github.com/AlexanderLabWHOI/EUKulele Version: v1.0.2b Editor: @will-rowe Reviewer: @johanneswerner, @jcmcnch Archive: 10.5281/zenodo.4422091

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b6b7999944beedba3e3a4d391fd3180c"><img src="https://joss.theoj.org/papers/b6b7999944beedba3e3a4d391fd3180c/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b6b7999944beedba3e3a4d391fd3180c/status.svg)](https://joss.theoj.org/papers/b6b7999944beedba3e3a4d391fd3180c)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@johanneswerner & @jcmcnch, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @will-rowe know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @johanneswerner

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@akrinos) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @jcmcnch

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@akrinos) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @johanneswerner, @jcmcnch it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 4 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 4 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.7287/peerj.preprints.27295v1 is OK
- 10.1038/s41564-018-0176-9 is OK
- 10.1038/s41467-017-02342-1 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.5281/zenodo.1476236 is OK
- 10.1051/0004-6361/201629272 is OK
- 10.1051/0004-6361/201322068 is OK

MISSING DOIs

- 10.1142/s0219720012500151 may be a valid DOI for title: Metagenomic taxonomic classification using extreme learning machines
- 10.1038/ncomms11257 may be a valid DOI for title: Fast and sensitive taxonomic classification for metagenomics with Kaiju
- 10.1111/1755-0998.13147 may be a valid DOI for title: A metagenomic assessment of microbial eukaryotic diversity in the global ocean
- 10.1038/ismej.2015.30 may be a valid DOI for title: Metatranscriptomic census of active protists in soils
- 10.1038/nrmicro.2016.160 may be a valid DOI for title: Probing the evolution, ecology and physiology of marine protists using transcriptomics
- 10.1093/database/baaa051 may be a valid DOI for title: SAGER: a database of Symbiodiniaceae and Algal Genomic Resource
- 10.1111/jpy.12529 may be a valid DOI for title: Robust Dinoflagellata phylogeny inferred from public transcriptome databases
- 10.1093/database/baaa051 may be a valid DOI for title: SAGER: a database of Symbiodiniaceae and Algal Genomic Resource
- 10.1016/j.tim.2018.10.009 may be a valid DOI for title: Are we overestimating protistan diversity in nature?
- 10.1093/nar/gks1160 may be a valid DOI for title: The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote small sub-unit rRNA sequences with curated taxonomy
- 10.1016/j.tree.2014.03.006 may be a valid DOI for title: The others: our biased perspective of eukaryotic genomes
- 10.1371/journal.pbio.2005849 may be a valid DOI for title: EukRef: Phylogenetic curation of ribosomal RNA to enhance understanding of eukaryotic diversity and distribution
- 10.1093/gigascience/giy158 may be a valid DOI for title: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes
- 10.1007/978-3-319-61510-3_4 may be a valid DOI for title: Functional analysis in metagenomics using MEGAN 6
- 10.1007/978-1-4939-3369-3_13 may be a valid DOI for title: MG-RAST, a metagenomics service for analysis of microbial community structure and function
- 10.1016/j.gpb.2015.08.003 may be a valid DOI for title: The Tara Oceans project: new opportunities and greater challenges ahead
- 10.1038/sdata.2017.203 may be a valid DOI for title: The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans
- 10.1093/bioinformatics/btw445 may be a valid DOI for title: SWORD—a highly efficient protein database search
- 10.1038/nmeth.3176 may be a valid DOI for title: Fast and sensitive protein alignment using DIAMOND
- 10.1101/2020.06.30.180687 may be a valid DOI for title: EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotic life
- 10.1371/journal.pone.0016342 may be a valid DOI for title: How and why DNA barcodes underestimate the diversity of microbial eukaryotes
- 10.1038/ncomms12860 may be a valid DOI for title: Adaptive radiation by waves of gene transfer leads to fine-scale resource partitioning in marine microbes
- 10.1111/gcb.12983 may be a valid DOI for title: Bridging the gap between omics and earth system science to better understand how environmental change impacts marine microbes
- 10.1098/rstb.2015.0331 may be a valid DOI for title: Censusing marine eukaryotic diversity in the twenty-first century
- 10.1007/978-3-030-38281-0_12 may be a valid DOI for title: Eukaryotic Pangenomes
- 10.1038/nature12221 may be a valid DOI for title: Pan genome of the phytoplankton Emiliania underpins its global distribution
- 10.1128/aem.01541-09 may be a valid DOI for title: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities

INVALID DOIs

- None

whedon commented 4 years ago

:wave: @jcmcnch, please update us on how your review is going.

whedon commented 4 years ago

:wave: @johanneswerner, please update us on how your review is going.

johanneswerner commented 3 years ago

Very interesting software package for the analysis of eukaroytes in metagenomes and metatranscriptomes. I like the focus of this tool and the well-written article and documentation, especially the very comprehensive documentation including all explanations and citations.

I have a few comments that might still be addressed.

installation
- [x] I don't know if this can be improved, but the installation via conda (as described here), takes a lot of time.
documentation
- [x] the links :ref:documentation and :ref:Parameters are not working in running-eukulele.rst
- [x] databaseandconfig.rst: there are four not three databases
minimal working example:
- [x] the minimal working example returns errors with BUSCO

EUKulele --config curr_config.yaml 

Running EUKulele with entries from the provided configuration file.
No BUSCO file specified/found; using argument-specified organisms and taxonomy for BUSCO analysis.
Setting things up...
Found database folder for reference_DIR in current directory; will not re-download.
Creating a diamond reference from database files...
Aligning to reference database...
['samples_MAGs/sample_2.faa', 'samples_MAGs/sample_1.faa', 'samples_MAGs/sample_0.faa']
Aligning sample sample_2...
Aligning sample sample_1...
Aligning sample sample_0...
Diamond process exited for sample sample_2.
Diamond process exited for sample sample_1.
Diamond process exited for sample sample_0.
Performing taxonomic estimation steps...
Performing taxonomic visualization steps...
Performing taxonomic assignment steps...
Performing BUSCO steps...
Configuring BUSCO...
Running busco with 2 simultaneous jobs...
BUSCO error log:
Traceback (most recent call last):

  File "/home/ubuntu/miniconda3/envs/EUKulele/bin/busco_configurator.py", line 15, in <module>

    for line in open(sys.argv[1]):

FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.local/bin/../config/config.ini'

sed: can't read test_out_23July/busco/config_sample_1.ini: No such file or directory

sed: can't read test_out_23July/busco/config_sample_1.ini: No such file or directory

sed: can't read test_out_23July/busco/config_sample_1.ini: No such file or directory

ERROR:  Config file test_out_23July/busco/config_sample_2.ini cannot be found

ERROR:  BUSCO analysis failed !

ERROR:  Check the logs, read the user guide, and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues

BUSCO output log:
python3 busco_configurator.py /home/ubuntu/.local/bin/../config/config.ini test_out_23July/busco/config_sample_1.ini

INFO:   ***** Start a BUSCO v4.1.4 analysis, current time: 11/17/2020 10:19:54 *****

INFO:   Configuring BUSCO with test_out_23July/busco/config_sample_2.ini

BUSCO error log:
ERROR:  Config file test_out_23July/busco/config_sample_1.ini cannot be found

ERROR:  BUSCO analysis failed !

ERROR:  Check the logs, read the user guide, and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues

BUSCO output log:
INFO:   ***** Start a BUSCO v4.1.4 analysis, current time: 11/17/2020 10:19:54 *****

INFO:   Configuring BUSCO with test_out_23July/busco/config_sample_1.ini

BUSCO error log:
Traceback (most recent call last):

  File "/home/ubuntu/miniconda3/envs/EUKulele/bin/busco_configurator.py", line 15, in <module>

    for line in open(sys.argv[1]):

FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.local/bin/../config/config.ini'

sed: can't read test_out_23July/busco/config_sample_0.ini: No such file or directory

sed: can't read test_out_23July/busco/config_sample_0.ini: No such file or directory

sed: can't read test_out_23July/busco/config_sample_0.ini: No such file or directory

ERROR:  Config file test_out_23July/busco/config_sample_0.ini cannot be found

ERROR:  BUSCO analysis failed !

ERROR:  Check the logs, read the user guide, and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues

BUSCO output log:
python3 busco_configurator.py /home/ubuntu/.local/bin/../config/config.ini test_out_23July/busco/config_sample_0.ini

INFO:   ***** Start a BUSCO v4.1.4 analysis, current time: 11/17/2020 10:19:55 *****

INFO:   Configuring BUSCO with test_out_23July/busco/config_sample_0.ini

[] is what is in BUSCO directory
BUSCO initial run did not complete successfully.
Please check the BUSCO run log files in the log/ folder.

tests
- [x] pytest tests/ returns one failed test

______________________________________________________ ERROR collecting tests/setupanddownload/test_database.py ______________________________________________________
ImportError while importing test module '/data/EUKulele/tests/setupanddownload/test_database.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/miniconda3/envs/EUKulele/lib/python3.6/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/setupanddownload/test_database.py:8: in <module>
    from EUKuleleconfig import *
E   ModuleNotFoundError: No module named 'EUKuleleconfig'

code quality
- [x] running pylint (pylint $(git ls-files '*.py')) on the repository returns a score of 3.07. I would try to get the pylint score to >=8 (and most warnings are easy to fix).
minor comments
- [x] https://github.com/AlexanderLabWHOI/EUKulele/issues/24
- [x] https://github.com/AlexanderLabWHOI/EUKulele/pull/26
a few comments about the manuscript
- [ ] How would someone deal with a sample that contains both prokaryotes and eukaryotes? Is it also possible to analyze prokaryotes or is this not desired (if it is possible, how)?
- [x] Figure 1: try to avoid arrows that are overlapping the text
- [x] The manuscript lists three databases, however in the documentation four databases are listed (EukZoo is missing in the manuscript, see also in databaseandconfig.rst)
- [x] benchmarking: how long does EUKulele run depending on the size of a dataset/number of sequences? can it be estimated how long a dataset of a certain size will run?
- [x] majority of the references do not have dois linked
- [ ] state of the field: EUKulele is compared to the tools MEGAN and MG-RAST which are as far as I know mostly used for the analyses of prokaryotes (and I believe mostly for metagenomics) - maybe there are better tools to compare EUKulele with

will-rowe commented 3 years ago

Great - thanks @johanneswerner!

Can you please respond to these comments when you get the chance @akrinos.

@jcmcnch - can you let us know how your are getting on please?

akrinos commented 3 years ago

@johanneswerner Thank you so much for the very helpful review!

I will respond to what I have responses for thus far and update as additional comments are addressed.

Installation: The conda installation indeed is quite slow - I am hoping to go through the process of adding it to the bioconda channel after publishing the paper, and am hopeful that that will provide a speedup over my user channel.
Documentation: Thank you for your helpful edits; I have merged those in. Thank you also for pointing out the link issues, which should be fixed now (some of the labels are a bit awkward, which I will fix, but the links I think I converted).
Minimal working example: I will need to explore your BUSCO error. I have not been working in BUSCO 4.1.4 previously. It may be something where I need to specify that the older version of BUSCO
Tests: this test is defunct and no longer run by Travis in that form; I have removed it from the tests folder
Code quality: Working through the formatting issues! Will let you know once I have the numbers up
I have merged both of the pull requests associated with the suggestions, as well as fixed the documentation inconsistency recommended by another user of the repository

Other Questions

One of our databases, phylodb, actually includes prokaryotes, so that is one option, but if prokaryotes were your group of interest, you would probably want to include your own database that is more complete. Beyond that, though, EUKulele should work fine on such a sample, although it has specific things built in (e.g. the databases we've chosen) tailored towards eukaryotes
I will modify the flowchart to fix this and update this thread when that is done
EukZoo was a recent addition; it is not tested on Travis yet, so apologies for the inconsistencies in where it is included!
Benchmarking is tricky because it's heavily dependent on available memory. However, at a given memory allocation, I can update you with some numbers on estimated runtimes
Working on linking the DOIs! I saw that in the initial check on the repository

akrinos commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

PDF failed to compile for issue #2817 with the following error:

Error reading bibliography file paper.bib: (line 461, column 3): unexpected "b" expecting space, ",", white space or "}" Looks like we failed to compile the PDF

akrinos commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

akrinos commented 3 years ago

As an update to the above:

DOIs are now included for all references for which DOIs are available
The score returned by pylint $(git ls-files '*.py') is now 8.26/10
The flowchart arrows have been moved such that they are no longer covering text on the documentation landing page, as well as in the paper
All four default databases are now referenced in the paper

We are working on benchmarking and addressing BUSCO-related issues. Thank you for your patience!

akrinos commented 3 years ago

@johanneswerner I was able to reproduce the error you have been getting from BUSCO when trying to run the sample_EUKulele example folder.

It is indeed related to BUSCO version 4.1.4, in which version the BUSCO configuration file is stored in a different location than it was previously. I have implemented a patch that has been deployed to the conda build of EUKulele that searches for the configuration file in a different way. This solves the issue of the initial BUSCO run, but the storage location of the final BUSCO sequences will still be different due to the version change. For now, EUKulele can be run with BUSCO version 4.0.6 and Biopython 1.77. I will sort out the final versioning issues on the pip install version such that both versions should work; for now at least the BUSCO run itself functions.

Thanks again for your patience!

akrinos commented 3 years ago

@johanneswerner we have provided a graphic below using the DIAMOND alignment tool for various sizes of sequence files. The time is in minutes in terms of how long the full EUKulele run takes to execute. Note that this is for metatranscriptomic (MET) sequences with or without using the TransDecoder tool for translation (as well as colored by two different database selections, the MMETSP and PhyloDB).

We have also uploaded pip and conda-installable revisions of EUKulele which address the issue you encountered with the latest BUSCO version. In recently-uploaded version 1.0.1, these issues should be resolved, and you should be able to fully execute the small test example which you reported a prior test of above.

Thank you!

johanneswerner commented 3 years ago

Dear @akrinos

thank you for your updates. I think I checked the above checkboxes that are taken care of (if I forgot something, please let me know).

Unfortunately, I still encountered errors with the minimum example (run.log) and some of the tests also throw errors on my virtual instance (tests.log). Could you please have a look at them?

Thank you very much for your effort, especially the benchmarking is very interesting.

akrinos commented 3 years ago

Hmm, it looks like you're still getting the same error, which is most likely the cause for the failed tests as well (although I haven't looked carefully at each failure). Did you reinstall via conda @johanneswerner ? It looks like from the error that it is defaulting to using the BUSCO install that you have locally, rather than a BUSCO install via conda. Could you please try running EUKulele --version? The problem is also in the included scripts run_busco.sh and concatenate_busco.sh, so printing the result of cat $(which run_busco.sh) and cat $(which concatenate_busco.sh) to a file would also help me verify that the fix that I have added is present in the files that your install is pulling. One problem I had was needing to remove prior installs.

If this continues to be an issue, I suppose we should move to another thread per the guidelines. Thanks for your persistence!

johanneswerner commented 3 years ago

My apologies @akrinos, I pulled the git repository for the tests, but I forgot to reinstall via conda. Thank you for looking into it. :-)

Test dataset runs accordingly after reinstallation with conda, and the tests also pass. I marked the respective checkboxes above.

akrinos commented 3 years ago

Thank you @johanneswerner! Did you still have one failed test as above (checkbox in initial review)? With regard to the last two remaining checkboxes, for the analysis of prokaryotes, as mentioned in 729306579, we have one default database that includes prokaryotes, and generally users can curate their own datasets including prokaryotes, we have just tailored the tool to eukaryotes. As far as other software to compare ours too, one other tool I found was CCMetagen, published earlier this year. This tool identifies eukaryotes in metagenomic samples, but is not for metatranscriptomes and only uses the NCBI database. It might be useful to point out how our approach is different from this one, which also compares itself to MEGAN. If it helps, I could include both of these explanations in either the text or the documentation, whichever seems more helpful. I think other than that, everything from your review has been addressed.

Thanks again!

will-rowe commented 3 years ago

Thank you for your comprehensive review @johanneswerner - this is shaping up nicely.

Pinging @jcmcnch - are you still able to review this submission? Please let us know either way ASAP

jcmcnch commented 3 years ago

Hi @will-rowe @akrinos sorry for not getting back to you both sooner with this. I have been busy until recently and had unsubscribed from notifications (because I was getting about a dozen notifications from JOSS daily from unrelated reviews - perhaps something can be done by JOSS to prevent this). I am back on the case now, and will provide my comments ASAP, by the end of this week at the latest.

jcmcnch commented 3 years ago

@akrinos , I just ran the test suite and it seems to work fine just by providing the yaml file as you describe, and everything seems to work - the errors mentioned above by Johannes seem to be fixed, except BUSCO generates no output. Is this expected? I checked the logs as recommended by the text printed to the screen but they were all empty. Here's the output I got:

(EUKulele) jesse@kraken:~/EUKulele-review/sample_EUKulele$ EUKulele --config curr_config.yaml
Running EUKulele with entries from the provided configuration file.
No BUSCO file specified/found; using argument-specified organisms and taxonomy for BUSCO analysis.
Setting things up...
Found database folder for reference_DIR in current directory; will not re-download.
Creating a diamond reference from database files...
Aligning to reference database...
Aligning sample sample_2...
Aligning sample sample_0...
Aligning sample sample_1...
Diamond process exited for sample sample_2.
Diamond process exited for sample sample_1.
Diamond process exited for sample sample_0.
Performing taxonomic estimation steps...
Performing taxonomic visualization steps...
Performing taxonomic assignment steps...
Performing BUSCO steps...
Configuring BUSCO...
Running busco with 1 simultaneous jobs...
[] is what is in BUSCO directory
BUSCO run either did not complete successfully, or returned no matches for sample sample_2 . Check busco_run log for details.
BUSCO run either did not complete successfully, or returned no matches for sample sample_0 . Check busco_run log for details.
BUSCO run either did not complete successfully, or returned no matches for sample sample_1 . Check busco_run log for details.
No BUSCO matches found for any sample.  Check BUSCO run log for details. Exiting...
EUKulele run complete

akrinos commented 3 years ago

Hi @jcmcnch - sorry it has taken me a bit to get back to you. I have been trying to reproduce this, and haven't been able to. You are using the sample_EUKulele directory, right? This is what I expect to be printed:

Running busco with 1 simultaneous jobs...
['logs', 'short_summary.specific.eukaryota_odb10.sample_1.txt', 'run_eukaryota_odb10'] is what is in BUSCO directory
At least one BUSCO present in sample sample_1 but 250 missing.
At least one BUSCO present in sample sample_0 but 241 missing.
At least one BUSCO present in sample sample_2 but 245 missing.

Could you tell me (1) what EUKulele --version returns and (2) the contents of your sample directory (it should be samples_MAGs if you're using the tutorial) and BUSCO directory in the output folder via ls? Thanks!!

jcmcnch commented 3 years ago

Hi @akrinos and other coauthors, again sorry for the delay in replying. I've had some time to properly "test drive" EUKulele, and now feel comfortable summarizing them as part of the review. I've noticed from your interactions with Johannes that this process seems quite interactive so I do hope we can discuss further in this thread. As I mentioned to Will at the beginning of this review process I'm somewhere closer to the naive end user and less of a software developer so will concentrate more on how I see your tool being used. These comments come from someone who is very interested in, but less knowledgeable about these "unsung" EUKs so it's kind of an outsider's view.

Overall comments:

From the perspective of microbial oceanography, there seems to be a really strong cultural divide between people studying PROKs and people studying EUKs, despite the fact that the organisms in question interact in a larger system. So any effort to try and bridge this divide is really worthwhile scientifically and I think your tool and approach is a promising way to begin this effort. From my own perspective I had known about the MMETSP but was less confident to find/download the data. With EUKulele, having the database automatically downloaded is already very helpful and on top of that knowing that I'm getting a high-quality curated version of the MMETSP from experts in the field is really reassuring. I also really appreciate all the work that has gone into making EUKulele useful with multiple databases, bioinformatic methods, and providing visualizations.

My main feedback falls into two areas - 1) clarity of writing, code, and visualizations and 2) caveats applying your approach to mixed PROK/EUK metatranscriptomes. For 1), I will provide detailed comments further below, but a more general comment is that some broader concepts can be clarified for the benefit of those less familiar with eukaryotic work. For example, it took me a bit of time to understand what you mean by transcriptome. From the PROK side of the fence, transcriptomes are most often just something you map to your MAGs/contigs to get at expression but I recognize this is something quite different for EUKs - it's basically your metagenome. Clarifying this subtle distinction in the text might help people less familiar with your field understand this. I did see your warning in the readthedocs documentation about using EUK metagenomes which alludes to this issue but I think it could be further clarified and explained in a more prominent location. Otherwise, I think your already extensive documentation could be improved by some re-organization and re-focusing which I'll try to detail in the sections below. Also, I noted that metagenomes/MAGs were used somewhat interchangeably so this can also be clarified to explain what you mean.

Point 2) is a bit more getting at the real-world usage of the software, and a potential pitfall I see when your workflow is employed by a naive end user. It's related to Johannes' question about mixed PROK/EUK communities. To test this, I downloaded a transcriptome assembly from this paper, which can be found here if you want to play with it yourself. Data were downloaded from IMG. From the phyloDB results generated by EUKulele, this is clearly a mixed PROK/EUK transcriptome:

supergroup_transcripts

My main concern is this is not reflected in the MMETSP results. Things that are clearly bacterial contigs (e.g. scaffold_10004_c1 which is a roseobacter) are annotated as EUKs (in this case, as a diatom). The tabular output (i.e. output/taxonomy_estimation/*taxonomy.out) would not give a user an idea that this is the case - the column for "max_pid" says 86.5% in this contig's case (100% for the phyloDB output file), so I wouldn't have assumed there was an issue unless I knew a priori that this sample would be a mixed EUK/PROK transcriptome assembly. How would you address this with your pipeline or in the paper/documentation? Do you mostly work with poly-A tailed transcriptomes in your own work where this would be less of an issue? Or is there a way you could think to address this? Could there be a pre-filtering step to split PROK and EUK? Or could another column be provided to the user to identify these potential issues?

More specific comments

Installation and testing:

This proceeded well after updating conda, but as I mentioned in the bug report you may consider providing mamba as an alternative install method since it's much faster in general.

[ ] Optional: provide mamba as alternative install method

The test suite worked without errors, except the BUSCO error I raised above. FYI I didn't install with pip as mentioned in the readthedocs section, but ran it as recommended:

EUKulele --config curr_config.yaml

In response to your question @akrinos :

Version:

(EUKulele) jesse@kraken:~/EUKulele-review/sample_EUKulele$ EUKulele --version
Running EUKulele with command line arguments, as no valid configuration file was provided.
The current EUKulele version is 1.0.1

Contents of directory:

(EUKulele) jesse@kraken:~/EUKulele-review/sample_EUKulele$ ls
busco_286409437.log   busco_327173586.log  EUKulele-env.yaml  path_test.txt    samples_MAGs
busco_2930947595.log  busco_downloads      free.csv           reference_DIR    tax-cutoffs.yaml
busco_3040169547.log  curr_config.yaml     output-test.txt    references_bins  test_out_23July

Software usage:

Things were quite smooth. Database downloads proceeded as expected, and ran without errors on real-world MT data described above. I tested it with phyloDB and MMETSP using default settings on the MBARI bloom metatranscriptome assembly mentioned above.

Specific suggestions:

[ ] I noticed dropbox is being used for the databases, which doesn't seem like a good long-term storage location. Suggest moving to a repo such as OSF/Zenodo where you can provide some additional documentation.
[ ] Related to this, databases should have some README or info about the version, preprocessing, citation info, etc so this can be easily accessed and cited whenever someone uses your tool. I know this is mentioned in the readthedocs (i.e. Sarah's github repo) but it would be helpful to provide this as a plaintext README as well.
[ ] The help file (output of EUKulele --help) is a little messy, I suggest reorganizing to move like parameters together (e.g. CPU and RAM usage), check for typos, and remove subroutines if it's not being used
[ ] For visualizations, there are some things that could be improved, like making sure legends do not overlap with barplots, removing or improving uninformative labels ('OfInterest')
[ ] Taxonomic output visualization files should have a numerical prefix that reflects the depth of each particular taxonomic level to make it easier to understand and scroll through
[ ] Could busco log files be automatically moved to the output folder?

Writing (paper and readthedocs):

[ ] Please explain more explicitly why your pipeline is reproducible and flexible. You mention these two words and I can guess what you mean but I think it should stated more clearly.
[ ] As mentioned above, the "warning note" about pitfalls using metagenomes should be more prominent in the documentation and explained further. I assume what you are referring to here is the difficulty of calling ORFs in genomic data when exons are interrupted by introns. Again, although this is obvious to you EUK folk it may help to explain in detail for the benefit of PROK people who tend to forget about these major wrinkles.
[ ] I think the ordering of the readthedocs could be improved. Perhaps put "About" first, "Installing" second, and consider merging "Trying out EUKulele" with "EUKulele Quick Start" and putting both after "Installing"
[ ] Consider putting the configuration file information in a section on its own and explain more what a yaml file is to the uninitiated and how it could be used to customize all parameters, not just taxonomic cutoffs.
[ ] I agree with Johannes that more comparison with tools such as MEGAN and CCMetagen would be useful to orient the reader and place your work in a broader context
[ ] There is some redundancy in the text on readthedocs, you could try to pare this down to make it easier to the reader, and mirror some of the bigger picture motivation and context here as well. For example, the first paragraph in "About" is more about functionality than the bigger picture and may be a bit redundant with the "Functionality" paragraph

Other than that, great job, thanks for sharing this powerful tool and your expert knowledge with the whole community. Looking forward to discussing more about the PROK/EUK mixed transcriptome assembly issue and how you see this being potentially addressed.

Jesse

will-rowe commented 3 years ago

Thanks for another comprehensive review @jcmcnch - there is plenty for you to mull over @akrinos! I'm particularly interested in the second of their points re. (mis)reporting of prokaryotic annotations. @jcmcnch has given several helpful suggestions to address this; at a minimum I'd like a comment in the documentation. In my view, it would make sense for the user to first bin their MAGs (or alternative input) into EUK and PROK before running EUKulele - but this functionality would be great to have in your software.

Please let us know your responses to the second review. We are definitely well on track here. I also have a couple of things for you to address in the paper:

Please give full author affiliations/addresses.
The summary is quite long and has some large whitespace (not really your fault, just awkward figure placement by the renderer). In my opinion, the summary should be more of an abstract/single paragraph and followed up quickly by your statement of need. I feel this makes the paper easier to read and I can get to why your software is amazing more quickly. That being said, it is your paper and it fits our publication criteria.
As a follow on point, it would be great to have a bit more detail on the specifics of your implementation. I would suggest taking what you already have (currently within your summary section) and fleshing it out under a new "implementation" section (or similar). Take a look at somer recent JOSS papers for inspiration.

Only the first of these points is a requirement from me!

Cheers,

Will

akrinos commented 3 years ago

Thank you both for your very helpful comments! We are working on addressing the eukaryote/prokaryote mislabeling issue by generating a default database that contains the MMETSP taxonomy (the taxonomy we feel is better to use as a default in most cases for eukaryotic organisms) and also prokaryotic sequences and a domain level to distinguish between the two. We will include the MMETSP alone as an option, as we are a bit biased towards poly-A-selected samples for which you would likely only need to quickly check with a database like PhyloDB for whether contigs were preferentially mapping to bacteria, after which point the MMETSP would be the database of choice.

We will certainly adjust the organization of the summary section somewhat! For the affiliations, is adding the city/state/country names sufficient beyond what we have? It looks like that is what is included in the JOSS papers I looked at. More comments to come later today in regard to addressing more of the housekeeping issues with the repository as well. Thanks again!

will-rowe commented 3 years ago

Thanks for the speedy response @akrinos! Yes - just add the city and country please.

Keep us posted on when you are ready for us to take another look at the submission.

akrinos commented 3 years ago

@jcmcnch thanks so much again for your very thorough and informative review! I have responded to some of the points you raised below, and will complete the process of addressing them and provide a new release tomorrow.

In response to the major issue with the use of MMETSP resulting in prokaryotic sequences being labeled as eukaryotic, I have added a new database to the default options, a combination of MarRef and the MMETSP. This is to enable prokaryotes to be identified, but also to use our preferred eukaryotic sequences. Here is a comparison of the output for each database, MMETSP, PhyloDB, and MarRef, using the sample dataset that you provided:

This also involves adding a separate "Domain" level to the MMETSP and MarRef. For now, I have made the software flexible, such that it will accept a number of labeling options from your database (Domain, Supergroup, Kingdom, Phylum, Class, Order, Family, Genus, Species, with the potential for more to be added), and arrange them based on expected taxonomic ordering. In the future, I plan to relabel the top level of PhyloDB to be "Domain" as well, since it makes more sense to use that label for the highest taxonomic level in PhyloDB, rather than using the MMETSP's "Supergroup" name.

So, the issue persists if you use the MMETSP on prokaryotic sequences - there tend to be spurious matches at the supergroup level, since the percent identity cutoff is quite low at this broad level, but we think it's important to retain the MMETSP option if you wish to only consider eukaryotic matches. However, we will add an additional warning to the documentation, and MarRef-MMETSP has become the default, to avoid this occurring naively. Note that when using the MMETSP reference, many more are "unclassified" at more specific taxonomic levels than when using PhyloDB.

Installation and testing

We have included a note about using mamba to install EUKulele on the installation page
As far as the BUSCO issue, one more question - you installed with conda?

Software usage

We have added the newest database, which contains both eukaryotic and prokaryotic sequences, to Zenodo. The reason we didn't do this before is/was twofold (1) we need to discuss the matter with some of the owners of the databases (e.g. PhyloDB, which is currently hosted on Google Drive and has not been formally published) and (2) using Zenodo requires that an additional dependency, zenodo-get, be included, which somewhat complicates the conda install. However the newly-added database should work fine with Zenodo as of now.
In progress
Subroutine is something we intend to be able to be used, so I've left that in - I will be tweaking the help more shortly and it will be updated in the new release
I removed the legend title and made it so that the legend is outside of the plot
This is a great suggestion! Numerical prefixes will also be included in the next release
Do you mean to have the BUSCO log files specifically outside of the log folder?

Thanks again, and more soon!

akrinos commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

akrinos commented 3 years ago

Again, thank you @jcmcnch and @will-rowe for your help!

As a follow-on: Writing (paper and readthedocs)

I have added a sentence of explanation in the about section of the readthedocs that further elaborates on reproducibility and flexibility
I have added a note to the section of "Running EUKulele" that contains this warning explicitly saying that this is particularly important due to the presence of introns in eukaryotes. I have also added a link to this page of the documentation on the EUKulele landing page, such that users arriving at the documentation can be immediately aware of where the caveats are
Though we recommend that users read the documentation, "EUKulele Quick Start" is meant as a quick landing page that could be used solely to invoke EUKulele, if a user was not interested in further invocation details. So we consider the Quick Start to be a separate resource from any tutorials or from the installation and about pages
I have added a couple of lines to the database and configuration section about YAML files and how you would create one for the purposes of the taxonomic cutoffs. Since providing a full YAML config is not required to run EUKulele, I think that further specifics about YAML files beyond the beginner level are a little confusing to include in the main documentation. However, since users would need to create/modify a YAML file for the taxonomic cutoffs, I've clarified the documentation for that
I have added reference to both of these tools in the paper and expanded that section a bit
I have made some additional formatting changes to the readthedocs, but I expect that this will be a continually evolving process as we receive feedback from other users

As far as the issue of providing metadata for each of the databases, for now I am writing a file with each invocation of EUKulele that contains

the name of the database
the link from which the database was downloaded (if it was downloaded during that invocation)
the date/time of invocation This enables at least a record of what was used, and then to go to the documentation to learn more about the databases and cite those databases appropriately. In the future we will endeavor to provide something more direct.

I have just pushed a new release to both PyPI and conda containing all of the relevant code changes. Please let me know what you think of the various changes!

jcmcnch commented 3 years ago

Hi @akrinos , thanks for your reply, this all looks great. I have just re-downloaded EUKulele using conda (and yes, with the BUSCO issue it was from a conda install but I will try again to make sure I get the same behaviour as before). I do have a few quick questions though:

Is there an easy way I confirm I've got the latest and greatest? After doing a clean install, EUKulele --version returns 1.0.1 which is the same as before so I'm not sure I'm getting the changes you've implemented
It looks like the readthedocs (latest) does not yet include the updates you mention above (e.g. mamba, warnings, usage instructions etc). Please clarify where I can find this info.
I can see the PDF above has been altered to include a lot more explanation and clarification. Is this the final version you want us to treat as a "response to reviewers"? BTW there are a few typos (e.g. prokaryotic spelled wrong, a couple references either missing or misspelled).

akrinos commented 3 years ago

Hi @jcmcnch, thanks so much for the feedback!

Unfortunately, I accidentally didn't update the version number on the conda side, but did on the pip side. I will fix that! But also, the easiest way to check is either to run conda update, or to check the Anaconda Cloud page. At the time that I'm writing this, our page indicates that 1.0.1 was updated ~2 days ago, so that's probably the easiest way to check the version correspondence...but this is my fault for not relabeling as 1.0.2 on conda.
I didn't include the landing page warning accidentally (edited an old file by accident!) and have just committed the mamba section. The update on YAML and the introns were just 1-3 lines or so.
We would love to have that information included, however we know it is a little long for JOSS, so we are happy to also cut if needed. If it can be considered, that would be great, as we felt before that it would benefit from more description of the implementation. Sorry about that typo! I re-spell-checked the manuscript and that (prokaryotic) seemed to be the only spelling error...the missing citation was an unfortunate carry-over from renaming the citation. Thanks for checking both and please let me know if there are others I missed!

Thanks again!

akrinos commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

will-rowe commented 3 years ago

Hi all. I hope everyone who had a break for new year had a good one.

Things are looking good here. @jcmcnch has now ticked all the required review boxes. If we could ask @johanneswerner to do the same, we can then mark this as provisionally accepted and start the ball rolling for publication.

johanneswerner commented 3 years ago

@will-rowe I checked off the missing boxes - those were taken care off before already.

will-rowe commented 3 years ago

Perfect - thanks @johanneswerner and sorry for the box ticking exercise!

will-rowe commented 3 years ago

@whedon generate pdf

will-rowe commented 3 years ago

@whedon check references

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.7287/peerj.preprints.27295v1 is OK
- 10.1038/s41564-018-0176-9 is OK
- 10.1038/s41467-017-02342-1 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.5281/zenodo.1476236 is OK
- 10.1142/S0219720012500151 is OK
- 10.1038/ncomms11257 is OK
- 10.1111/1755-0998.13147 is OK
- 10.1038/ismej.2015.30 is OK
- 10.1038/nrmicro.2016.160 is OK
- 10.1093/database/baaa051 is OK
- 10.1111/jpy.12529 is OK
- 10.1038/s41564-019-0502-x is OK
- 10.1016/j.tim.2018.10.009 is OK
- 10.1093/nar/gks1160 is OK
- 10.1016/j.tree.2014.03.006 is OK
- 10.1371/journal.pbio.2005849 is OK
- 10.1093/gigascience/giy158 is OK
- 10.1093/bioinformatics/btv351 is OK
- 10.17226/4901 is OK
- 10.1007/978-3-319-60156-4_18 is OK
- 10.1101/gr.229202 is OK
- 10.1016/j.gpb.2015.08.003 is OK
- 10.1038/sdata.2017.203 is OK
- 10.1093/bioinformatics/btw445 is OK
- 10.1038/nmeth.3176 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.1371/journal.pbio.1001889 is OK
- 10.1371/journal.pone.0016342 is OK
- 10.1016/j.cub.2017.01.017 is OK
- 10.1038/ncomms12860 is OK
- 10.1111/gcb.12983 is OK
- 10.1098/rstb.2015.0331 is OK
- 10.1007/978-3-030-38281-0_12 is OK
- 10.1038/nature12221 is OK
- 10.1038/nmeth.4197 is OK
- 10.1128/AEM.01541-09 is OK

MISSING DOIs

- 10.1007/978-3-319-61510-3_4 may be a valid DOI for title: Functional analysis in metagenomics using MEGAN 6

INVALID DOIs

- https://doi.org/10.1093/nar/gkx1036 is INVALID because of 'https://doi.org/' prefix
- https://doi.org/10.1186/s13059-020-02014-2 is INVALID because of 'https://doi.org/' prefix

will-rowe commented 3 years ago

Hi @akrinos

Can you please check/fix those references. Looks like the MEGAN one might not need the _4

Once you have done this, please can you tag a new release and then archive it (with zenodo or similar). Then report back here with the DOI and version.

akrinos commented 3 years ago

@whedon check references

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/nar/gkx1036 is OK
- 10.1186/s13059-020-02014-2 is OK
- 10.7287/peerj.preprints.27295v1 is OK
- 10.1038/s41564-018-0176-9 is OK
- 10.1038/s41467-017-02342-1 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.5281/zenodo.1476236 is OK
- 10.1142/S0219720012500151 is OK
- 10.1038/ncomms11257 is OK
- 10.1111/1755-0998.13147 is OK
- 10.1038/ismej.2015.30 is OK
- 10.1038/nrmicro.2016.160 is OK
- 10.1093/database/baaa051 is OK
- 10.1111/jpy.12529 is OK
- 10.1038/s41564-019-0502-x is OK
- 10.1016/j.tim.2018.10.009 is OK
- 10.1093/nar/gks1160 is OK
- 10.1016/j.tree.2014.03.006 is OK
- 10.1371/journal.pbio.2005849 is OK
- 10.1093/gigascience/giy158 is OK
- 10.1007/978-3-319-61510-3_4 is OK
- 10.1093/bioinformatics/btv351 is OK
- 10.17226/4901 is OK
- 10.1007/978-3-319-60156-4_18 is OK
- 10.1101/gr.229202 is OK
- 10.1016/j.gpb.2015.08.003 is OK
- 10.1038/sdata.2017.203 is OK
- 10.1093/bioinformatics/btw445 is OK
- 10.1038/nmeth.3176 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.1371/journal.pbio.1001889 is OK
- 10.1371/journal.pone.0016342 is OK
- 10.1016/j.cub.2017.01.017 is OK
- 10.1038/ncomms12860 is OK
- 10.1111/gcb.12983 is OK
- 10.1098/rstb.2015.0331 is OK
- 10.1007/978-3-030-38281-0_12 is OK
- 10.1038/nature12221 is OK
- 10.1038/nmeth.4197 is OK
- 10.1128/AEM.01541-09 is OK

MISSING DOIs

- None

INVALID DOIs

- None

akrinos commented 3 years ago

Hi @will-rowe (and thank you @johanneswerner!) - I have fixed the DOI issues listed above, and published a release to Zenodo here, with DOI 10.5281/zenodo.4419894 for version 1.0.2, which is also fully updated on Anaconda Cloud and PyPI. I left in the _4 for MEGAN, as that is the most specific DOI for the paper. Thank you so much for your help!

will-rowe commented 3 years ago

Good work - thanks @akrinos. I'm afraid one more thing is needed from my end - can you make sure the zenodo release has an author list that matches the author list in your paper?

akrinos commented 3 years ago

Hi @will-rowe, thanks and no problem! I couldn't figure out how to edit the author list before. I ended up having to modify it to be release 1.0.2b on Zenodo here; hopefully that's okay!

will-rowe commented 3 years ago

@whedon set 10.5281/zenodo.4422091 as archive

whedon commented 3 years ago

OK. 10.5281/zenodo.4422091 is the archive.

openjournals / joss-reviews