openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
707 stars 37 forks source link

[REVIEW]: EUKulele: Taxonomic annotation of the unsung eukaryotic microbes #2817

Closed whedon closed 3 years ago

whedon commented 3 years ago

Submitting author: @akrinos (Arianna Krinos) Repository: https://github.com/AlexanderLabWHOI/EUKulele Version: v1.0.2b Editor: @will-rowe Reviewer: @johanneswerner, @jcmcnch Archive: 10.5281/zenodo.4422091

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b6b7999944beedba3e3a4d391fd3180c"><img src="https://joss.theoj.org/papers/b6b7999944beedba3e3a4d391fd3180c/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b6b7999944beedba3e3a4d391fd3180c/status.svg)](https://joss.theoj.org/papers/b6b7999944beedba3e3a4d391fd3180c)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@johanneswerner & @jcmcnch, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @will-rowe know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @johanneswerner

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @jcmcnch

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @johanneswerner, @jcmcnch it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 3 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.7287/peerj.preprints.27295v1 is OK
- 10.1038/s41564-018-0176-9 is OK
- 10.1038/s41467-017-02342-1 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.5281/zenodo.1476236 is OK
- 10.1051/0004-6361/201629272 is OK
- 10.1051/0004-6361/201322068 is OK

MISSING DOIs

- 10.1142/s0219720012500151 may be a valid DOI for title: Metagenomic taxonomic classification using extreme learning machines
- 10.1038/ncomms11257 may be a valid DOI for title: Fast and sensitive taxonomic classification for metagenomics with Kaiju
- 10.1111/1755-0998.13147 may be a valid DOI for title: A metagenomic assessment of microbial eukaryotic diversity in the global ocean
- 10.1038/ismej.2015.30 may be a valid DOI for title: Metatranscriptomic census of active protists in soils
- 10.1038/nrmicro.2016.160 may be a valid DOI for title: Probing the evolution, ecology and physiology of marine protists using transcriptomics
- 10.1093/database/baaa051 may be a valid DOI for title: SAGER: a database of Symbiodiniaceae and Algal Genomic Resource
- 10.1111/jpy.12529 may be a valid DOI for title: Robust Dinoflagellata phylogeny inferred from public transcriptome databases
- 10.1093/database/baaa051 may be a valid DOI for title: SAGER: a database of Symbiodiniaceae and Algal Genomic Resource
- 10.1016/j.tim.2018.10.009 may be a valid DOI for title: Are we overestimating protistan diversity in nature?
- 10.1093/nar/gks1160 may be a valid DOI for title: The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote small sub-unit rRNA sequences with curated taxonomy
- 10.1016/j.tree.2014.03.006 may be a valid DOI for title: The others: our biased perspective of eukaryotic genomes
- 10.1371/journal.pbio.2005849 may be a valid DOI for title: EukRef: Phylogenetic curation of ribosomal RNA to enhance understanding of eukaryotic diversity and distribution
- 10.1093/gigascience/giy158 may be a valid DOI for title: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes
- 10.1007/978-3-319-61510-3_4 may be a valid DOI for title: Functional analysis in metagenomics using MEGAN 6
- 10.1007/978-1-4939-3369-3_13 may be a valid DOI for title: MG-RAST, a metagenomics service for analysis of microbial community structure and function
- 10.1016/j.gpb.2015.08.003 may be a valid DOI for title: The Tara Oceans project: new opportunities and greater challenges ahead
- 10.1038/sdata.2017.203 may be a valid DOI for title: The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans
- 10.1093/bioinformatics/btw445 may be a valid DOI for title: SWORD—a highly efficient protein database search
- 10.1038/nmeth.3176 may be a valid DOI for title: Fast and sensitive protein alignment using DIAMOND
- 10.1101/2020.06.30.180687 may be a valid DOI for title: EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotic life
- 10.1371/journal.pone.0016342 may be a valid DOI for title: How and why DNA barcodes underestimate the diversity of microbial eukaryotes
- 10.1038/ncomms12860 may be a valid DOI for title: Adaptive radiation by waves of gene transfer leads to fine-scale resource partitioning in marine microbes
- 10.1111/gcb.12983 may be a valid DOI for title: Bridging the gap between omics and earth system science to better understand how environmental change impacts marine microbes
- 10.1098/rstb.2015.0331 may be a valid DOI for title: Censusing marine eukaryotic diversity in the twenty-first century
- 10.1007/978-3-030-38281-0_12 may be a valid DOI for title: Eukaryotic Pangenomes
- 10.1038/nature12221 may be a valid DOI for title: Pan genome of the phytoplankton Emiliania underpins its global distribution
- 10.1128/aem.01541-09 may be a valid DOI for title: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities

INVALID DOIs

- None
whedon commented 3 years ago

:wave: @jcmcnch, please update us on how your review is going.

whedon commented 3 years ago

:wave: @johanneswerner, please update us on how your review is going.

johanneswerner commented 3 years ago

Very interesting software package for the analysis of eukaroytes in metagenomes and metatranscriptomes. I like the focus of this tool and the well-written article and documentation, especially the very comprehensive documentation including all explanations and citations.

I have a few comments that might still be addressed.

EUKulele --config curr_config.yaml 

Running EUKulele with entries from the provided configuration file.
No BUSCO file specified/found; using argument-specified organisms and taxonomy for BUSCO analysis.
Setting things up...
Found database folder for reference_DIR in current directory; will not re-download.
Creating a diamond reference from database files...
Aligning to reference database...
['samples_MAGs/sample_2.faa', 'samples_MAGs/sample_1.faa', 'samples_MAGs/sample_0.faa']
Aligning sample sample_2...
Aligning sample sample_1...
Aligning sample sample_0...
Diamond process exited for sample sample_2.
Diamond process exited for sample sample_1.
Diamond process exited for sample sample_0.
Performing taxonomic estimation steps...
Performing taxonomic visualization steps...
Performing taxonomic assignment steps...
Performing BUSCO steps...
Configuring BUSCO...
Running busco with 2 simultaneous jobs...
BUSCO error log:
Traceback (most recent call last):

  File "/home/ubuntu/miniconda3/envs/EUKulele/bin/busco_configurator.py", line 15, in <module>

    for line in open(sys.argv[1]):

FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.local/bin/../config/config.ini'

sed: can't read test_out_23July/busco/config_sample_1.ini: No such file or directory

sed: can't read test_out_23July/busco/config_sample_1.ini: No such file or directory

sed: can't read test_out_23July/busco/config_sample_1.ini: No such file or directory

ERROR:  Config file test_out_23July/busco/config_sample_2.ini cannot be found

ERROR:  BUSCO analysis failed !

ERROR:  Check the logs, read the user guide, and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues

BUSCO output log:
python3 busco_configurator.py /home/ubuntu/.local/bin/../config/config.ini test_out_23July/busco/config_sample_1.ini

INFO:   ***** Start a BUSCO v4.1.4 analysis, current time: 11/17/2020 10:19:54 *****

INFO:   Configuring BUSCO with test_out_23July/busco/config_sample_2.ini

BUSCO error log:
ERROR:  Config file test_out_23July/busco/config_sample_1.ini cannot be found

ERROR:  BUSCO analysis failed !

ERROR:  Check the logs, read the user guide, and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues

BUSCO output log:
INFO:   ***** Start a BUSCO v4.1.4 analysis, current time: 11/17/2020 10:19:54 *****

INFO:   Configuring BUSCO with test_out_23July/busco/config_sample_1.ini

BUSCO error log:
Traceback (most recent call last):

  File "/home/ubuntu/miniconda3/envs/EUKulele/bin/busco_configurator.py", line 15, in <module>

    for line in open(sys.argv[1]):

FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.local/bin/../config/config.ini'

sed: can't read test_out_23July/busco/config_sample_0.ini: No such file or directory

sed: can't read test_out_23July/busco/config_sample_0.ini: No such file or directory

sed: can't read test_out_23July/busco/config_sample_0.ini: No such file or directory

ERROR:  Config file test_out_23July/busco/config_sample_0.ini cannot be found

ERROR:  BUSCO analysis failed !

ERROR:  Check the logs, read the user guide, and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues

BUSCO output log:
python3 busco_configurator.py /home/ubuntu/.local/bin/../config/config.ini test_out_23July/busco/config_sample_0.ini

INFO:   ***** Start a BUSCO v4.1.4 analysis, current time: 11/17/2020 10:19:55 *****

INFO:   Configuring BUSCO with test_out_23July/busco/config_sample_0.ini

[] is what is in BUSCO directory
BUSCO initial run did not complete successfully.
Please check the BUSCO run log files in the log/ folder.
______________________________________________________ ERROR collecting tests/setupanddownload/test_database.py ______________________________________________________
ImportError while importing test module '/data/EUKulele/tests/setupanddownload/test_database.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/miniconda3/envs/EUKulele/lib/python3.6/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/setupanddownload/test_database.py:8: in <module>
    from EUKuleleconfig import *
E   ModuleNotFoundError: No module named 'EUKuleleconfig'
will-rowe commented 3 years ago

Great - thanks @johanneswerner!

Can you please respond to these comments when you get the chance @akrinos.

@jcmcnch - can you let us know how your are getting on please?

akrinos commented 3 years ago

@johanneswerner Thank you so much for the very helpful review!

I will respond to what I have responses for thus far and update as additional comments are addressed.

Other Questions

akrinos commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

PDF failed to compile for issue #2817 with the following error:

Error reading bibliography file paper.bib: (line 461, column 3): unexpected "b" expecting space, ",", white space or "}" Looks like we failed to compile the PDF

akrinos commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

akrinos commented 3 years ago

As an update to the above:

We are working on benchmarking and addressing BUSCO-related issues. Thank you for your patience!

akrinos commented 3 years ago

@johanneswerner I was able to reproduce the error you have been getting from BUSCO when trying to run the sample_EUKulele example folder.

It is indeed related to BUSCO version 4.1.4, in which version the BUSCO configuration file is stored in a different location than it was previously. I have implemented a patch that has been deployed to the conda build of EUKulele that searches for the configuration file in a different way. This solves the issue of the initial BUSCO run, but the storage location of the final BUSCO sequences will still be different due to the version change. For now, EUKulele can be run with BUSCO version 4.0.6 and Biopython 1.77. I will sort out the final versioning issues on the pip install version such that both versions should work; for now at least the BUSCO run itself functions.

Thanks again for your patience!

akrinos commented 3 years ago

@johanneswerner we have provided a graphic below using the DIAMOND alignment tool for various sizes of sequence files. The time is in minutes in terms of how long the full EUKulele run takes to execute. Note that this is for metatranscriptomic (MET) sequences with or without using the TransDecoder tool for translation (as well as colored by two different database selections, the MMETSP and PhyloDB).

We have also uploaded pip and conda-installable revisions of EUKulele which address the issue you encountered with the latest BUSCO version. In recently-uploaded version 1.0.1, these issues should be resolved, and you should be able to fully execute the small test example which you reported a prior test of above.

Thank you!

image

johanneswerner commented 3 years ago

Dear @akrinos

thank you for your updates. I think I checked the above checkboxes that are taken care of (if I forgot something, please let me know).

Unfortunately, I still encountered errors with the minimum example (run.log) and some of the tests also throw errors on my virtual instance (tests.log). Could you please have a look at them?

Thank you very much for your effort, especially the benchmarking is very interesting.

akrinos commented 3 years ago

Hmm, it looks like you're still getting the same error, which is most likely the cause for the failed tests as well (although I haven't looked carefully at each failure). Did you reinstall via conda @johanneswerner ? It looks like from the error that it is defaulting to using the BUSCO install that you have locally, rather than a BUSCO install via conda. Could you please try running EUKulele --version? The problem is also in the included scripts run_busco.sh and concatenate_busco.sh, so printing the result of cat $(which run_busco.sh) and cat $(which concatenate_busco.sh) to a file would also help me verify that the fix that I have added is present in the files that your install is pulling. One problem I had was needing to remove prior installs.

If this continues to be an issue, I suppose we should move to another thread per the guidelines. Thanks for your persistence!

johanneswerner commented 3 years ago

My apologies @akrinos, I pulled the git repository for the tests, but I forgot to reinstall via conda. Thank you for looking into it. :-)

Test dataset runs accordingly after reinstallation with conda, and the tests also pass. I marked the respective checkboxes above.

akrinos commented 3 years ago

Thank you @johanneswerner! Did you still have one failed test as above (checkbox in initial review)? With regard to the last two remaining checkboxes, for the analysis of prokaryotes, as mentioned in 729306579, we have one default database that includes prokaryotes, and generally users can curate their own datasets including prokaryotes, we have just tailored the tool to eukaryotes. As far as other software to compare ours too, one other tool I found was CCMetagen, published earlier this year. This tool identifies eukaryotes in metagenomic samples, but is not for metatranscriptomes and only uses the NCBI database. It might be useful to point out how our approach is different from this one, which also compares itself to MEGAN. If it helps, I could include both of these explanations in either the text or the documentation, whichever seems more helpful. I think other than that, everything from your review has been addressed.

Thanks again!

will-rowe commented 3 years ago

Thank you for your comprehensive review @johanneswerner - this is shaping up nicely.

Pinging @jcmcnch - are you still able to review this submission? Please let us know either way ASAP

jcmcnch commented 3 years ago

Hi @will-rowe @akrinos sorry for not getting back to you both sooner with this. I have been busy until recently and had unsubscribed from notifications (because I was getting about a dozen notifications from JOSS daily from unrelated reviews - perhaps something can be done by JOSS to prevent this). I am back on the case now, and will provide my comments ASAP, by the end of this week at the latest.

jcmcnch commented 3 years ago

@akrinos , I just ran the test suite and it seems to work fine just by providing the yaml file as you describe, and everything seems to work - the errors mentioned above by Johannes seem to be fixed, except BUSCO generates no output. Is this expected? I checked the logs as recommended by the text printed to the screen but they were all empty. Here's the output I got:

(EUKulele) jesse@kraken:~/EUKulele-review/sample_EUKulele$ EUKulele --config curr_config.yaml
Running EUKulele with entries from the provided configuration file.
No BUSCO file specified/found; using argument-specified organisms and taxonomy for BUSCO analysis.
Setting things up...
Found database folder for reference_DIR in current directory; will not re-download.
Creating a diamond reference from database files...
Aligning to reference database...
Aligning sample sample_2...
Aligning sample sample_0...
Aligning sample sample_1...
Diamond process exited for sample sample_2.
Diamond process exited for sample sample_1.
Diamond process exited for sample sample_0.
Performing taxonomic estimation steps...
Performing taxonomic visualization steps...
Performing taxonomic assignment steps...
Performing BUSCO steps...
Configuring BUSCO...
Running busco with 1 simultaneous jobs...
[] is what is in BUSCO directory
BUSCO run either did not complete successfully, or returned no matches for sample sample_2 . Check busco_run log for details.
BUSCO run either did not complete successfully, or returned no matches for sample sample_0 . Check busco_run log for details.
BUSCO run either did not complete successfully, or returned no matches for sample sample_1 . Check busco_run log for details.
No BUSCO matches found for any sample.  Check BUSCO run log for details. Exiting...
EUKulele run complete
akrinos commented 3 years ago

Hi @jcmcnch - sorry it has taken me a bit to get back to you. I have been trying to reproduce this, and haven't been able to. You are using the sample_EUKulele directory, right? This is what I expect to be printed:

Running busco with 1 simultaneous jobs...
['logs', 'short_summary.specific.eukaryota_odb10.sample_1.txt', 'run_eukaryota_odb10'] is what is in BUSCO directory
At least one BUSCO present in sample sample_1 but 250 missing.
At least one BUSCO present in sample sample_0 but 241 missing.
At least one BUSCO present in sample sample_2 but 245 missing.

Could you tell me (1) what EUKulele --version returns and (2) the contents of your sample directory (it should be samples_MAGs if you're using the tutorial) and BUSCO directory in the output folder via ls? Thanks!!

jcmcnch commented 3 years ago

Hi @akrinos and other coauthors, again sorry for the delay in replying. I've had some time to properly "test drive" EUKulele, and now feel comfortable summarizing them as part of the review. I've noticed from your interactions with Johannes that this process seems quite interactive so I do hope we can discuss further in this thread. As I mentioned to Will at the beginning of this review process I'm somewhere closer to the naive end user and less of a software developer so will concentrate more on how I see your tool being used. These comments come from someone who is very interested in, but less knowledgeable about these "unsung" EUKs so it's kind of an outsider's view.

Overall comments:

From the perspective of microbial oceanography, there seems to be a really strong cultural divide between people studying PROKs and people studying EUKs, despite the fact that the organisms in question interact in a larger system. So any effort to try and bridge this divide is really worthwhile scientifically and I think your tool and approach is a promising way to begin this effort. From my own perspective I had known about the MMETSP but was less confident to find/download the data. With EUKulele, having the database automatically downloaded is already very helpful and on top of that knowing that I'm getting a high-quality curated version of the MMETSP from experts in the field is really reassuring. I also really appreciate all the work that has gone into making EUKulele useful with multiple databases, bioinformatic methods, and providing visualizations.

My main feedback falls into two areas - 1) clarity of writing, code, and visualizations and 2) caveats applying your approach to mixed PROK/EUK metatranscriptomes. For 1), I will provide detailed comments further below, but a more general comment is that some broader concepts can be clarified for the benefit of those less familiar with eukaryotic work. For example, it took me a bit of time to understand what you mean by transcriptome. From the PROK side of the fence, transcriptomes are most often just something you map to your MAGs/contigs to get at expression but I recognize this is something quite different for EUKs - it's basically your metagenome. Clarifying this subtle distinction in the text might help people less familiar with your field understand this. I did see your warning in the readthedocs documentation about using EUK metagenomes which alludes to this issue but I think it could be further clarified and explained in a more prominent location. Otherwise, I think your already extensive documentation could be improved by some re-organization and re-focusing which I'll try to detail in the sections below. Also, I noted that metagenomes/MAGs were used somewhat interchangeably so this can also be clarified to explain what you mean.

Point 2) is a bit more getting at the real-world usage of the software, and a potential pitfall I see when your workflow is employed by a naive end user. It's related to Johannes' question about mixed PROK/EUK communities. To test this, I downloaded a transcriptome assembly from this paper, which can be found here if you want to play with it yourself. Data were downloaded from IMG. From the phyloDB results generated by EUKulele, this is clearly a mixed PROK/EUK transcriptome:

supergroup_transcripts

My main concern is this is not reflected in the MMETSP results. Things that are clearly bacterial contigs (e.g. scaffold_10004_c1 which is a roseobacter) are annotated as EUKs (in this case, as a diatom). The tabular output (i.e. output/taxonomy_estimation/*taxonomy.out) would not give a user an idea that this is the case - the column for "max_pid" says 86.5% in this contig's case (100% for the phyloDB output file), so I wouldn't have assumed there was an issue unless I knew a priori that this sample would be a mixed EUK/PROK transcriptome assembly. How would you address this with your pipeline or in the paper/documentation? Do you mostly work with poly-A tailed transcriptomes in your own work where this would be less of an issue? Or is there a way you could think to address this? Could there be a pre-filtering step to split PROK and EUK? Or could another column be provided to the user to identify these potential issues?

More specific comments

Installation and testing:

This proceeded well after updating conda, but as I mentioned in the bug report you may consider providing mamba as an alternative install method since it's much faster in general.

The test suite worked without errors, except the BUSCO error I raised above. FYI I didn't install with pip as mentioned in the readthedocs section, but ran it as recommended:

EUKulele --config curr_config.yaml

In response to your question @akrinos :

Version:

(EUKulele) jesse@kraken:~/EUKulele-review/sample_EUKulele$ EUKulele --version
Running EUKulele with command line arguments, as no valid configuration file was provided.
The current EUKulele version is 1.0.1

Contents of directory:

(EUKulele) jesse@kraken:~/EUKulele-review/sample_EUKulele$ ls
busco_286409437.log   busco_327173586.log  EUKulele-env.yaml  path_test.txt    samples_MAGs
busco_2930947595.log  busco_downloads      free.csv           reference_DIR    tax-cutoffs.yaml
busco_3040169547.log  curr_config.yaml     output-test.txt    references_bins  test_out_23July

Software usage:

Things were quite smooth. Database downloads proceeded as expected, and ran without errors on real-world MT data described above. I tested it with phyloDB and MMETSP using default settings on the MBARI bloom metatranscriptome assembly mentioned above.

Specific suggestions:

Writing (paper and readthedocs):

Other than that, great job, thanks for sharing this powerful tool and your expert knowledge with the whole community. Looking forward to discussing more about the PROK/EUK mixed transcriptome assembly issue and how you see this being potentially addressed.

Jesse

will-rowe commented 3 years ago

Thanks for another comprehensive review @jcmcnch - there is plenty for you to mull over @akrinos! I'm particularly interested in the second of their points re. (mis)reporting of prokaryotic annotations. @jcmcnch has given several helpful suggestions to address this; at a minimum I'd like a comment in the documentation. In my view, it would make sense for the user to first bin their MAGs (or alternative input) into EUK and PROK before running EUKulele - but this functionality would be great to have in your software.

Please let us know your responses to the second review. We are definitely well on track here. I also have a couple of things for you to address in the paper:

Only the first of these points is a requirement from me!

Cheers,

Will

akrinos commented 3 years ago

Thank you both for your very helpful comments! We are working on addressing the eukaryote/prokaryote mislabeling issue by generating a default database that contains the MMETSP taxonomy (the taxonomy we feel is better to use as a default in most cases for eukaryotic organisms) and also prokaryotic sequences and a domain level to distinguish between the two. We will include the MMETSP alone as an option, as we are a bit biased towards poly-A-selected samples for which you would likely only need to quickly check with a database like PhyloDB for whether contigs were preferentially mapping to bacteria, after which point the MMETSP would be the database of choice.

We will certainly adjust the organization of the summary section somewhat! For the affiliations, is adding the city/state/country names sufficient beyond what we have? It looks like that is what is included in the JOSS papers I looked at. More comments to come later today in regard to addressing more of the housekeeping issues with the repository as well. Thanks again!

will-rowe commented 3 years ago

Thanks for the speedy response @akrinos! Yes - just add the city and country please.

Keep us posted on when you are ready for us to take another look at the submission.

akrinos commented 3 years ago

@jcmcnch thanks so much again for your very thorough and informative review! I have responded to some of the points you raised below, and will complete the process of addressing them and provide a new release tomorrow.

In response to the major issue with the use of MMETSP resulting in prokaryotic sequences being labeled as eukaryotic, I have added a new database to the default options, a combination of MarRef and the MMETSP. This is to enable prokaryotes to be identified, but also to use our preferred eukaryotic sequences. Here is a comparison of the output for each database, MMETSP, PhyloDB, and MarRef, using the sample dataset that you provided:

image

This also involves adding a separate "Domain" level to the MMETSP and MarRef. For now, I have made the software flexible, such that it will accept a number of labeling options from your database (Domain, Supergroup, Kingdom, Phylum, Class, Order, Family, Genus, Species, with the potential for more to be added), and arrange them based on expected taxonomic ordering. In the future, I plan to relabel the top level of PhyloDB to be "Domain" as well, since it makes more sense to use that label for the highest taxonomic level in PhyloDB, rather than using the MMETSP's "Supergroup" name.

So, the issue persists if you use the MMETSP on prokaryotic sequences - there tend to be spurious matches at the supergroup level, since the percent identity cutoff is quite low at this broad level, but we think it's important to retain the MMETSP option if you wish to only consider eukaryotic matches. However, we will add an additional warning to the documentation, and MarRef-MMETSP has become the default, to avoid this occurring naively. Note that when using the MMETSP reference, many more are "unclassified" at more specific taxonomic levels than when using PhyloDB.

Installation and testing

Software usage

Thanks again, and more soon!

akrinos commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

akrinos commented 3 years ago

Again, thank you @jcmcnch and @will-rowe for your help!

As a follow-on: Writing (paper and readthedocs)

As far as the issue of providing metadata for each of the databases, for now I am writing a file with each invocation of EUKulele that contains

I have just pushed a new release to both PyPI and conda containing all of the relevant code changes. Please let me know what you think of the various changes!

jcmcnch commented 3 years ago

Hi @akrinos , thanks for your reply, this all looks great. I have just re-downloaded EUKulele using conda (and yes, with the BUSCO issue it was from a conda install but I will try again to make sure I get the same behaviour as before). I do have a few quick questions though:

  1. Is there an easy way I confirm I've got the latest and greatest? After doing a clean install, EUKulele --version returns 1.0.1 which is the same as before so I'm not sure I'm getting the changes you've implemented
  2. It looks like the readthedocs (latest) does not yet include the updates you mention above (e.g. mamba, warnings, usage instructions etc). Please clarify where I can find this info.
  3. I can see the PDF above has been altered to include a lot more explanation and clarification. Is this the final version you want us to treat as a "response to reviewers"? BTW there are a few typos (e.g. prokaryotic spelled wrong, a couple references either missing or misspelled).
akrinos commented 3 years ago

Hi @jcmcnch, thanks so much for the feedback!

  1. Unfortunately, I accidentally didn't update the version number on the conda side, but did on the pip side. I will fix that! But also, the easiest way to check is either to run conda update, or to check the Anaconda Cloud page. At the time that I'm writing this, our page indicates that 1.0.1 was updated ~2 days ago, so that's probably the easiest way to check the version correspondence...but this is my fault for not relabeling as 1.0.2 on conda.
  2. I didn't include the landing page warning accidentally (edited an old file by accident!) and have just committed the mamba section. The update on YAML and the introns were just 1-3 lines or so.
  3. We would love to have that information included, however we know it is a little long for JOSS, so we are happy to also cut if needed. If it can be considered, that would be great, as we felt before that it would benefit from more description of the implementation. Sorry about that typo! I re-spell-checked the manuscript and that (prokaryotic) seemed to be the only spelling error...the missing citation was an unfortunate carry-over from renaming the citation. Thanks for checking both and please let me know if there are others I missed!

Thanks again!

akrinos commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

will-rowe commented 3 years ago

Hi all. I hope everyone who had a break for new year had a good one.

Things are looking good here. @jcmcnch has now ticked all the required review boxes. If we could ask @johanneswerner to do the same, we can then mark this as provisionally accepted and start the ball rolling for publication.

johanneswerner commented 3 years ago

@will-rowe I checked off the missing boxes - those were taken care off before already.

will-rowe commented 3 years ago

Perfect - thanks @johanneswerner and sorry for the box ticking exercise!

will-rowe commented 3 years ago

@whedon generate pdf

will-rowe commented 3 years ago

@whedon check references

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 3 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.7287/peerj.preprints.27295v1 is OK
- 10.1038/s41564-018-0176-9 is OK
- 10.1038/s41467-017-02342-1 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.5281/zenodo.1476236 is OK
- 10.1142/S0219720012500151 is OK
- 10.1038/ncomms11257 is OK
- 10.1111/1755-0998.13147 is OK
- 10.1038/ismej.2015.30 is OK
- 10.1038/nrmicro.2016.160 is OK
- 10.1093/database/baaa051 is OK
- 10.1111/jpy.12529 is OK
- 10.1038/s41564-019-0502-x is OK
- 10.1016/j.tim.2018.10.009 is OK
- 10.1093/nar/gks1160 is OK
- 10.1016/j.tree.2014.03.006 is OK
- 10.1371/journal.pbio.2005849 is OK
- 10.1093/gigascience/giy158 is OK
- 10.1093/bioinformatics/btv351 is OK
- 10.17226/4901 is OK
- 10.1007/978-3-319-60156-4_18 is OK
- 10.1101/gr.229202 is OK
- 10.1016/j.gpb.2015.08.003 is OK
- 10.1038/sdata.2017.203 is OK
- 10.1093/bioinformatics/btw445 is OK
- 10.1038/nmeth.3176 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.1371/journal.pbio.1001889 is OK
- 10.1371/journal.pone.0016342 is OK
- 10.1016/j.cub.2017.01.017 is OK
- 10.1038/ncomms12860 is OK
- 10.1111/gcb.12983 is OK
- 10.1098/rstb.2015.0331 is OK
- 10.1007/978-3-030-38281-0_12 is OK
- 10.1038/nature12221 is OK
- 10.1038/nmeth.4197 is OK
- 10.1128/AEM.01541-09 is OK

MISSING DOIs

- 10.1007/978-3-319-61510-3_4 may be a valid DOI for title: Functional analysis in metagenomics using MEGAN 6

INVALID DOIs

- https://doi.org/10.1093/nar/gkx1036 is INVALID because of 'https://doi.org/' prefix
- https://doi.org/10.1186/s13059-020-02014-2 is INVALID because of 'https://doi.org/' prefix
will-rowe commented 3 years ago

Hi @akrinos

Can you please check/fix those references. Looks like the MEGAN one might not need the _4

Once you have done this, please can you tag a new release and then archive it (with zenodo or similar). Then report back here with the DOI and version.

akrinos commented 3 years ago

@whedon check references

whedon commented 3 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/nar/gkx1036 is OK
- 10.1186/s13059-020-02014-2 is OK
- 10.7287/peerj.preprints.27295v1 is OK
- 10.1038/s41564-018-0176-9 is OK
- 10.1038/s41467-017-02342-1 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.5281/zenodo.1476236 is OK
- 10.1142/S0219720012500151 is OK
- 10.1038/ncomms11257 is OK
- 10.1111/1755-0998.13147 is OK
- 10.1038/ismej.2015.30 is OK
- 10.1038/nrmicro.2016.160 is OK
- 10.1093/database/baaa051 is OK
- 10.1111/jpy.12529 is OK
- 10.1038/s41564-019-0502-x is OK
- 10.1016/j.tim.2018.10.009 is OK
- 10.1093/nar/gks1160 is OK
- 10.1016/j.tree.2014.03.006 is OK
- 10.1371/journal.pbio.2005849 is OK
- 10.1093/gigascience/giy158 is OK
- 10.1007/978-3-319-61510-3_4 is OK
- 10.1093/bioinformatics/btv351 is OK
- 10.17226/4901 is OK
- 10.1007/978-3-319-60156-4_18 is OK
- 10.1101/gr.229202 is OK
- 10.1016/j.gpb.2015.08.003 is OK
- 10.1038/sdata.2017.203 is OK
- 10.1093/bioinformatics/btw445 is OK
- 10.1038/nmeth.3176 is OK
- 10.1101/2020.06.30.180687 is OK
- 10.1371/journal.pbio.1001889 is OK
- 10.1371/journal.pone.0016342 is OK
- 10.1016/j.cub.2017.01.017 is OK
- 10.1038/ncomms12860 is OK
- 10.1111/gcb.12983 is OK
- 10.1098/rstb.2015.0331 is OK
- 10.1007/978-3-030-38281-0_12 is OK
- 10.1038/nature12221 is OK
- 10.1038/nmeth.4197 is OK
- 10.1128/AEM.01541-09 is OK

MISSING DOIs

- None

INVALID DOIs

- None
akrinos commented 3 years ago

Hi @will-rowe (and thank you @johanneswerner!) - I have fixed the DOI issues listed above, and published a release to Zenodo here, with DOI 10.5281/zenodo.4419894 for version 1.0.2, which is also fully updated on Anaconda Cloud and PyPI. I left in the _4 for MEGAN, as that is the most specific DOI for the paper. Thank you so much for your help!

will-rowe commented 3 years ago

Good work - thanks @akrinos. I'm afraid one more thing is needed from my end - can you make sure the zenodo release has an author list that matches the author list in your paper?

akrinos commented 3 years ago

Hi @will-rowe, thanks and no problem! I couldn't figure out how to edit the author list before. I ended up having to modify it to be release 1.0.2b on Zenodo here; hopefully that's okay!

will-rowe commented 3 years ago

@whedon set 10.5281/zenodo.4422091 as archive

whedon commented 3 years ago

OK. 10.5281/zenodo.4422091 is the archive.