taxpasta: TAXonomic Profile Aggregation and STAndardisation

Midnighter commented 1 year ago

Submitting Author: Moritz E. Beber (@Midnighter) All current maintainers: (@Midnighter, @sofstam, @jfy133) Package Name: taxpasta One-Line Description of Package: TAXonomic Profile Aggregation and STAndardisation Repository Link: https://github.com/taxprofiler/taxpasta Version submitted: 0.2.1 Editor: @ctb
Reviewer 1: @snacktavish
Reviewer 2: @bluegenes
Archive: https://github.com/taxprofiler/taxpasta/releases/tag/0.4.0
JOSS DOI: Version accepted: 0.4.0 Date accepted (month/day/year): 07/05/2023

Code of Conduct & Commitment to Maintain Package

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
[x] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

The main purpose of taxpasta is to standardise taxonomic profiles created by a range of bioinformatics tools. We call those tools taxonomic profilers. They each come with their own particular, tabular output format. Across the profilers, relative abundances can be reported in read counts, fractions, or percentages, as well as any number of additional columns with extra information. We therefore decided to take the lessons learnt to heart and provide our own solution to deal with this pasticcio. With taxpasta you can ingest all of those formats and, at a minimum, output taxonomy identifiers and their integer counts.

Taxpasta can not only standardise profiles but also merge them across samples for the same profiler into a single table. In future, we also intend to offer methods for forming a consensus for the same sample analyzed by different profilers.

Scope

Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [ ] Data visualization **
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo

Community Partnerships

If your package is associated with an existing community please check below:

[ ] Pangeo
- [ ] My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

* Please fill out a pre-submission inquiry before submitting a data visualization package.

For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?
Taxpasta is a tool for anyone working with taxonomic profiles from metagenomic sequencing experiments. Mostly that means ecologists, bioinformaticians, statisticians. Taxpasta's main application is to standardise profiles from a range of different tools. Having a singular format facilitates downstream analyses. Taxpasta is used, for example, in the upcoming taxprofiler pipeline implemented in nextflow. There, it also serves to combine the profiles of many samples into a single file.
- Are there other Python packages that accomplish the same thing? If so, how does yours differ?
The BIOM format was created with the intention of standardizing a storage format for microbiome analyses. However, creating this format was entirely left to the user. Taxpasta conveniently knows how to read profiles from a range of tools and can also produce BIOM output.

Some of the taxonomic profilers also come with scripts to convert their output into another format but none of them support such a wide range of tools as taxpasta does.
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] uses an OSI approved license.
[x] contains a README with instructions for installing the development version (development version is described in CONTRIBUTING.rst).
[x] includes documentation with examples for all functions.
[x] contains a tutorial with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [x] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [x] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

[x] I have read the author guide.
[x] I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

[x] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The [editor template can be found here][Editor Template].

The [review template can be found here][Review Template].

lwasser commented 1 year ago

hey there 👋 @Midnighter welcome to pyopensci and thank you for this submission! i just wanted to say hello so you know that we see this submission and will be getting back to you shortly! i also am looking into forms for the submission template as well - many thanks for that suggestion! More soon!

NickleDave commented 1 year ago

Welcome @Midnighter @sofstam @jfy133! I'm adding initial checks for this package.

Glad to see this as someone who develops a similar tool (for a different domain!).

Looks like most everything is there.

I do want to request one thing before we proceed with a review though: can you please add at least one full tutorial?

Something like the "Examples" section in the Pynteny docs here (which also has a CLI interface): https://robaina.github.io/Pynteny/examples/example_cli/

(These docs were in place before we began the Pynteny review, which is why I'm providing them as an example of what we need to see before we start.)

I would suggest a sort of walkthrough of the main use case for taxpasta. This might be a simplified version of something you're using it for in research already.

I have provided some other feedback below but adding an initial tutorial is the only thing that's necessary at this time.

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements below.

[x] Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
- @Midnighter given the domain you're targeting, maybe it be good to publish a distribution package on bioconda and/or conda forge?
- [x] The package imports properly into a standard Python environment import package-name.
[x] Fit The package meets criteria for fit and overlap.
- yes this is clearly a data munging package
[x] Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
- [x] User-facing documentation that overviews how to install and start using the package.
- I would strongly suggest adding a more general description of the package right at the top of the index page, that even a non-generalist (like me) can understand. What do people use taxonomic profiles for in your field? Why would it help me to be able to convert between formats? Links to Wikipedia pages, review articles, etc., would be helpful here. Ideally with a brief example of a table, like those shown in the command docs ("taxpasta is an interoperability tool for working with taxonomic profiles, that look something like this:"
- There are install instructions but no real examples on the landing/index page. At a minimum the output of taxpasta -h could be shown and additionally a brief snippet. If the main usage is the CLI, then perhaps demos of usage could be shown with a tool like asciicinema
  - Also the output of taxpasta -h was truncated for me. See below. Not sure it would be clear to a new user what consensus and merge commands do because of the ellipsis. From the docs I guess this is because your cli package imports text from somewhere else? Is there a way to not truncate? Maybe just give a briefer summary?
```
$ taxpasta -h
Usage: taxpasta [OPTIONS] COMMAND [ARGS]...
```
TAXonomic Profile Aggregation and STAndardisation

Commands: consensus Form a consensus for the same sample but from different... merge Standardise and merge two or more taxonomic profiles into... standardise Standardise a taxonomic profile (alias: 'standardize').


  - [x] Short tutorials that help a user understand how to use the package and what it can do for them.
    - There are some snippet-like examples in documentation for commands of the command-line interface but I do not find more extensive tutorials
  - [x] API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format. *We suggest using the [Numpy](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) docstring format*.
- [x] Core GitHub repository Files
  - [x] **README** The package has a `README.md` file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
  - [x] **Contributing File** The package has a `CONTRIBUTING.md` file that details how to install and contribute to the package.
    - @Midnighter is there a link to the CONTRIBUTING.rst in the README? I'm not finding it. If not, I'd suggest adding it.
  - [x] **Code of Conduct** The package has a `Code of Conduct` file.
    - inside the .github directory 
  - [x] **License** The package has an [OSI approved license](https://opensource.org/licenses).
NOTE: We prefer that you have development instructions in your documentation too.
- [x] **Issue Submission Documentation** All of the information is filled out in the `YAML` header of the issue (located at the top of the issue template).
- [x] **Automated tests** Package has a testing suite and is tested via GitHub actions or another Continuous Integration service.
- [x] **Repository** The repository link resolves correctly.
- [x] **Package overlap** The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
- [ ] **Archive** (JOSS only, may be post-review): The repository DOI resolves correctly.
- [ ] **Version** (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

---
- [ ] [Initial onboarding survey was filled out ](https://forms.gle/F9mou7S3jhe8DMJ16)
We appreciate each maintainer of the package filling out this survey individually. :raised_hands:
Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:
---

*******

## Editor comments

This is not required in checks but I would suggest adding example data to taxpasta.  
Since you are working with tabular data that I'd guess can be failry lightweight, it should be relatively painless to add and it will really help people understand package usage. (Unless I'm wrong and the taxonomic profiles are like 100 MB each. In that case, see next paragraph.)
Again see the Pynteny example where they work with example data (not suggesting you need to add a download command but just showing you why built-in data is useful: https://robaina.github.io/Pynteny/examples/example_cli/#download-pgap-profile-hmm-database)

For my own library I just add the files directly to the package and then access with `importlib.resources`, see this issue: https://github.com/vocalpy/crowsetta/issues/90
Some libraries have adopted the [`pooch` library](https://github.com/fatiando/pooch) for data, e.g. scikit-image:  https://scikit-image.org/docs/stable/api/skimage.data.html. Might be worth snooping their issues to help you figure out how to implement if you decide to adopt it.

NickleDave commented 1 year ago

Also @sofstam @jfy133 could you both please fill out the pre-review survey?
I have a reply for @Midnighter but not you I think. https://forms.gle/F9mou7S3jhe8DMJ16 It's just a brief (5-10m) survey to help us understand how we're doing as an org. Thank you! :pray:

Midnighter commented 1 year ago

Hi @NickleDave,

Thank you for your review and comments. Adding a tutorial and improving the docs mostly makes sense to me. I will work on adding those.

The tables are indeed small, still I'm a little hesitant to distribute them with the package. However, pandas should be able to just load a table from a URL so that's maybe a way to go.

Please note that taxpasta is already on bioconda https://bioconda.github.io/recipes/taxpasta/README.html#package-package%20'taxpasta' We need to document this, of course.
I will also publish the package on Zenodo.

jfy133 commented 1 year ago

Also @sofstam @jfy133 could you both please fill out the pre-review survey? I have a reply for @Midnighter but not you I think. https://forms.gle/F9mou7S3jhe8DMJ16 It's just a brief (5-10m) survey to help us understand how we're doing as an org. Thank you! pray

oops, sorry - from the second page most of the questions seemed to be about the package and I assume Moritz had already provided all that info.

I've filled it out, but left pretty much all the optional questions empty as either you have that info from Moritz, and/or submitting via pyOpenSci is Moritz' initative so I'm not that familiar/involved with python etc.

One comment though: I noticed that the options in your 'What background best describes you cultural identity?' question is extremely N. American focused. Unfortunately I don't have a good solution for you, but you may not be getting a particularly good overview of this - e.g. Asian spans half the world with many equally large sub-divisons as the Pacific Islanders, and also grouping 'Black' and African-American is also arguably unfair as they also can have a large difference in backgrounds/problems etc. For example see what official surveys from UK use: https://www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups

Midnighter commented 1 year ago

@NickleDave regarding your comments on taxpasta -h the truncation is automatically created by typer. It looks better if you also install rich. I'll look into shortening the summaries such that they are not truncated by typer in the default view.

NickleDave commented 1 year ago

Thank you for your review and comments. Adding a tutorial and improving the docs mostly makes sense to me. I will work on adding those.

Great, thank you @Midnighter. Your other comments re: tables, bioconda, Zenodo, and the truncation all sound good. Would it be worth making rich a dependency so you don't have to special case typer behavior?

oops, sorry - from the second page most of the questions seemed to be about the package and I assume Moritz had already provided all that info.

I've filled it out, but left pretty much all the optional questions empty as either you have that info from Moritz, and/or submitting via pyOpenSci is Moritz' initative so I'm not that familiar/involved with python etc.

One comment though: I noticed that the options in your 'What background best describes you cultural identity?' question is extremely N. American focused.

Thank you @jfy133 I hear you and I will relay this to our executive director @lwasser. Your feedback is helpful; this is definitely a version 0.1 of the survey and I'm sure we can improve it to not be US-centric, and to encompass people who are core contributors to a project even if they are not primarily Python developers.

Midnighter commented 1 year ago

Would it be worth making rich a dependency so you don't have to special case typer behavior?

At the moment, it's an extra dependency that can be installed with taxpasta[rich]. (Come to think of it, we should document all of those.) The reason that it's not default is that where taxpasta originated, as a CLI tool for a nextflow pipeline, fancy terminal output is really not needed. We should definitely recommend installing that for interactive work, though.

NickleDave commented 1 year ago

where taxpasta originated, as a CLI tool for a nextflow pipeline, fancy terminal output is really not needed.

Understood!

Btw @Midnighter could I ask you to link this review issue on any issues you raise to make changes before review? By adding a link to the issue on the taxpasta repo, so that GitHub cross references them.

Just to help us track. Thank you :pray:

Midnighter commented 1 year ago

Dear @NickleDave,

We think that we've implemented everything that you've noticed. Please take another look and let us know your thoughts.

There's now a detailed tutorial
We've shortened the command summaries so hopefully they won't be truncated (except on extremely small windows).
There's a short usage section right in the readme and on the docs' main page.
Extras are documented.
A number of other improvements to the documentation.

NickleDave commented 1 year ago

Hi @Midnighter I did check out these additions -- looks great!

Thank you for addressing all of those comments.
I will move ahead with finding an editor, will reply back here introducing them ASAP.

NickleDave commented 1 year ago

Hi again @Midnighter very happy to say that @ctb has very kindly agreed to take on the editor role for this review. Thanks for your patience while we found someone who knows your domain and the software community around it.

I will let @ctb take it from here and start tagging in reviewers.

Midnighter commented 1 year ago

I guess we'll have to add support for sourmash then 😆 Good to hear and I look forward to your review @ctb. Thank you for your time.

ctb commented 1 year ago

I guess we'll have to add support for sourmash then 😆 Good to hear and I look forward to your review @ctb. Thank you for your time.

welcome! ironically we have invested time in generating the same report formats as you, for other consumers of taxonomy, so I think we are well positioned to engage productively ;).

lwasser commented 1 year ago

Also @sofstam @jfy133 could you both please fill out the pre-review survey? I have a reply for @Midnighter but not you I think. https://forms.gle/F9mou7S3jhe8DMJ16 It's just a brief (5-10m) survey to help us understand how we're doing as an org. Thank you! pray

oops, sorry - from the second page most of the questions seemed to be about the package and I assume Moritz had already provided all that info.

I've filled it out, but left pretty much all the optional questions empty as either you have that info from Moritz, and/or submitting via pyOpenSci is Moritz' initative so I'm not that familiar/involved with python etc.

One comment though: I noticed that the options in your 'What background best describes you cultural identity?' question is extremely N. American focused. Unfortunately I don't have a good solution for you, but you may not be getting a particularly good overview of this - e.g. Asian spans half the world with many equally large sub-divisons as the Pacific Islanders, and also grouping 'Black' and African-American is also arguably unfair as they also can have a large difference in backgrounds/problems etc. For example see what official surveys from UK use: https://www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups

hey y'all just wanted to note that i'm working on the survey questions being a LOT more inclusive. Many thanksf or your patience there and THANK YOU for calling this to our attention! It's very much american-centric

ctb commented 1 year ago

(finally got around to asking second reviewer; response soon :))

ctb commented 1 year ago

Editor comments

:wave: Hi @bluegenes and @snacktavish! Thank you for volunteering to review for pyOpenSci!

The following resources will help you complete your review:

Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due: April 21st, 2023.

Reviewers: @bluegenes @snacktavish Due date: April 21, 2023.

ctb commented 1 year ago

ok, trying tagging in @bluegenes and @snacktavish again. Or do we need to invite them to this repo also?

NickleDave commented 1 year ago

Thank you @ctb, I think we're good to go from here. Just the editor needs repo access to be able to edit metadata in the authors' original post--sorry again for not adding you before.

ctb commented 1 year ago

no worries, mostly concerned about whether they're being notified!

ctb commented 1 year ago

(they are! or at least tessa is)

Midnighter commented 1 year ago

I have a question about the version submitted. We've already made new releases since then and I'm planning another round of smaller fixes. It'd make sense to me that the latest version will be reviewed rather than what we started with. Should I just edit the original post or is that against your procedure?

NickleDave commented 1 year ago

Hi @Midnighter thank you for checking, I know it's not clear.

We want to track what the version was when you submitted.
So please don't edit that.

We of course also want to make sure reviewers are reviewing the latest version, e.g. if you have made changes related to editor checks.

@bluegenes @snacktavish please review the version @Midnighter replies to us with and asks you to review.

@Midnighter I know you probably need to take development time wherever you can find it, but if there's any way you can either finish those small fixes in the next couple days, or hold off for now, that would be appreciated so we're not reviewing something that's changing while we review it.

Midnighter commented 1 year ago

Thank you for the answer. I'll see that I can make the changes either today or tomorrow and then give you a go ahead.

NickleDave commented 1 year ago

Thank you! I know that @snacktavish is still off-line during spring break this week so if you want to reply with the version to review Monday that would be fine.

Midnighter commented 1 year ago

I've just released version 0.3.0 of taxpasta. Please use that for review. I look forward to it.

snacktavish commented 1 year ago

Hi all, Sorry for the delay - spring break followed by illness has created a series of setbacks, but I should be able to look at this later this week.

ctb commented 1 year ago

no worries @snacktavish - and thanks for reviewing!

ctb commented 1 year ago

hi folks, how are the reviews coming along?

bluegenes commented 1 year ago

hi @ctb - good, just need to find a bit more time to finish up. Will have it for y'all by friday!

snacktavish commented 1 year ago

Work in progress! But review may be a little late :grimacing:

snacktavish commented 1 year ago

OK - I am revealing my slow start, but as far as I can tell, the tutorial is not rendering correctly https://taxpasta.readthedocs.io/en/latest/tutorials/getting-started.md, and it's not really follow-able at https://github.com/taxprofiler/taxpasta/blob/dev/docs/tutorials/getting-started.md Is there a better link I should be using?

Thanks, Emily Jane

jfy133 commented 1 year ago

Hi @snacktavish !

No worries :).

It think we've made a link in the website with the wrong extension:

https://taxpasta.readthedocs.io/en/latest/tutorials/getting-started/

Should work. I'll try find all instances of that mistake and fix them.

However I'mn not sure why it doens't render at all on the dev branch, I'll look into that now. But I don't think there is much difference in that file compared the master branch.

Cheers,

bluegenes commented 1 year ago

Hi folks, here's my review.

Great tool - having tried to provide some standardized formats for our profiler, I definitely appreciate the utility provided here!

Package Review

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
- I see Contributing info in the documentation, but it would be good to link to this page from the README
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).
  - I see a pypi version badge, but not conda. table displays badges nicely.

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
  - yes, command usage examples exist for both merge and standardise, but fully functional examples with file download links would be much better
  - in the Usage section, use-cases just links to the about section higher in the README.
  - in the Merge section, test-datalinks to a single file example. Two files are needed for themerge` command.
[x] Link to your documentation website.
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[ ] Citation information
- I don't see any citation info

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 4-5

Review Comments

The taxpasta software tackles an important issue in metagenomic taxonomic profiling by ingesting the output from a number of popular profilers and producing standardized outputs that can be used in downstream workflows.

Tutorial comments

I find the back-and-forth between CLI and R/python distracting in the tutorial -- It would be clearer to see the entire command line + execution for taxpasta, with brief explanations around each command together, since the tool is CLI-based and the other sections are just demonstrating the justification for the tool.

In a "Getting Started" tutorial, including code for the failures (including partial solutions) of standard reading methods might confuse folks who are newer to coding/bioinfo. People may end up confusing which lines they should run on their own data and which they shouldn't.

I think the R/python parts are more appropriate for a follow-up section or tutorial, entitled "Why taxpasta" or similar to indicate that you're digging into the rationale and issues that prompted building taxpasta. That way, beginners who just want to run the cli are comfortable, and intermediate/etc can dig in afterwards. Note, I only tested the python code parts (not R), but those worked great.

If kept as is, tutorial at least needs a bolded line calling attention to the switch between command line and R/Python sections, -- e.g.

The following section uses R/Python.

and similar when you switch back.

I think you're already be aware, but several links are broken from the Getting Started tutorial:

The adding taxa names how-to tutorial could usefully be directly added to the main Getting Started cli tutorial if you switch up the bash/python structure (I do see it linked at the bottom, though!).

Commands and Documentation

In the README, you include a full command for standardise and state that an example file can be found in test data. Can you instead include small test files with the package or add a curl command to download the exact file, so that command is actually executable? Same for the next command in the README (add curl to dl a second file to allow merge). Same comment for documentation pages for these.
There were several screens of terminal output upon command failure (specifically, running the standardise README command without downloading the file). Might be confusing for new folks (even though it's very pretty! :) and the only line that matters is the very last one, FileNotFoundError: [Errno 2] No such file or directory: 'MOCK_002_Illumina_Hiseq_3000_se_metaphlan3-db.metaphlan3_profile.txt Might be out of scope, just wanted to mention.
It would be helpful to add a general note in the README that users can see additional tool help via taxpasta <command> -h, e.g. taxpasta merge -h.
It would be nice to note somewhere that setting output format explicitly overrides use of the file extension for determining output format (I noticed I could write a arrow format with .tsv extension)
Definitely out of scope, but wanted to suggest an extension: have you considered adding a filter functionality, e.g. by % of total dataset reported in the file? There's some evidence that filtering e.g. Kraken2 output might help reduce false positives (e.g. here )

Midnighter commented 1 year ago

Dear @bluegenes,

Thank you very much for your kind review. I'll try to respond to all your points now and note where we will take action.

A repostatus.org badge,

We're adding a repo status badge

If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.

Agreed that this would be a good addition. I think we will write a JOSS paper after seeing the second review and copy this kind of information from there into the documentation.

I don't see any citation info

Technically, the Zenodo DOI already makes this citeable but we'll add more info if taxpasta gets accepted here and at JOSS.

JOSS's requirements

As mentioned above, we plan to write the paper after the initial review.

Tutorial comments

Thank you for the detailed observations on the tutorial. We agree with your points and @jfy133 will work on splitting this up and improving the getting started tutorial. He already fixed the URLs that were wrong.

In the README, you include a full command for standardise and state that an example file can be found in test data. Can you instead include small test files with the package or add a curl command to download the exact file, so that command is actually executable? Same for the next command in the README (add curl to dl a second file to allow merge). Same comment for documentation pages for these.

Good point, we will do so.

There were several screens of terminal output upon command failure (specifically, running the standardise README command without downloading the file). Might be confusing for new folks (even though it's very pretty! :) and the only line that matters is the very last one, FileNotFoundError: [Errno 2] No such file or directory: 'MOCK_002_Illumina_Hiseq_3000_se_metaphlan3-db.metaphlan3_profile.txt Might be out of scope, just wanted to mention.

We will consider this when looking at the tutorial and the readme for your other points above.

It would be helpful to add a general note in the README that users can see additional tool help via taxpasta -h, e.g. taxpasta merge -h.

We thought that we had covered this in the Usage section of the readme. Do you think that it needs to be stated there that this also works for sub commands?

It would be nice to note somewhere that setting output format explicitly overrides use of the file extension for determining output format (I noticed I could write a arrow format with .tsv extension)

Good call.

Definitely out of scope, but wanted to suggest an extension: have you considered adding a filter functionality, e.g. by % of total dataset reported in the file? There's some evidence that filtering e.g. Kraken2 output might help reduce false positives (e.g. here )

We agree that, although this is useful in general, it is out of scope for taxpasta and should happen downstream. We rather follow the UNIX philosophy here.

Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".

Thus, we can see a case for two new packages taxpasta-stats and taxpasta-viz (or something like that) in the future that take advantage of these standardised tables but haven't done any scoping work or thought about all the desired capabilities yet.

Midnighter commented 1 year ago

A bit of a tangent but since you're all experts @ctb, @bluegenes, and @snacktavish I wanted to ask your opinions on the following issue: Currently, taxpasta converts all relative abundances to integers. This design decision was based on my experience with some statistical tools requiring this. I think, it was vegan::adonis but I can't even remember properly. The other decent option is to convert everything to fractions. A third option is to provide a flag that lets users choose between either output. Do you have any opinions on this?

jfy133 commented 1 year ago

Hi @bluegenes, your comments on the Tutorial have (hopeully) been addressed here: https://github.com/taxprofiler/taxpasta/pull/95 once merged!

snacktavish commented 1 year ago

@ctb I am delayed (and have 2 yr old sick out of day care for the next few days...) but am still hoping to finish my review this week. I plan to use the revised tutorial, unless it is important that I stick to the original release one.

ctb commented 1 year ago

thx @snacktavish - take the time you need, and your efforts are much appreciated! I think you should use the revised tutorial.

jfy133 commented 1 year ago

The revised tutorials have been merged!

snacktavish commented 1 year ago

Hi all, I have run through the tutorial, and added some comments. I think it is a neat package, that does handy data organization, but could use some further documentation and input checks. I didn't carefully go through the checkboxes in the first section, as I was already quite late on my review.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[ ] Function Documentation: for all user-facing functions.
[ ] Examples for all user-facing functions.
[ ] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for:
- [ ] Continuous integration and test coverage,
- [ ] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [ ] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[ ] Link to your documentation website.
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[ ] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests:
- [ ] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [ ] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[ ] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

In the tutorial, a little more biological context pn the meaning of the results would be useful.

E.g. in the step head 2612_pe-ERR5766176-db1_kraken2.tsv

These results look a bit concerning to me! What are soemtings things in taxon 0? Are all of the results not id'd to taxon? (This may be a limitation of this my own knowledge of taxonomic profiling, but including some biological context with the technical info can be helpful for users.)

I tried going off-script in the tutorial to see if I could merge the Kraken file with the motus files (this may not make a lot of sense to do, biologically, but I was curious) using: taxpasta merge -p motus -o dbMOTUs_motus.tsv 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out 2612_pe-ERR5766176-db1.kraken2.report.txt

It appeared to work... but the results were odd. At the end it showed a taxonomy id of 152 with 152 samples in 2612_pe-ERR5766176-db1.kraken2.report, whereas in the original table it looks like it should be taxon id 9606 with 152 samples. Looking at the docs I think the software is only intended to merge profiles generated by the same tool, but the way that the description is set up, I thought merging profiles across tools was a key advantage of the software. (This is user error - but user error is one thing you can always rely on!!) This looks like a bug? or maybe the software should have thrown an error instead of running? But users will definitely try to run things that are not quite right, and the software should do some checks on input file format rather than chugging along and outputting seemingly sensible but incorrect outputs. (This may connect to the "Danger message" https://taxpasta.readthedocs.io/en/latest/commands/merge/#why here, but it does not even appear to be merging on identifier...)

I wanted to add taxon names to try to make the outputs make more sense to me, but I wasn't able to read the documentation on that at https://taxpasta.readthedocs.io/how-tos/how-to-add-names

The error messages in other places could also use some work - I ran: taxpasta merge -p motus -o dbMOTUs_motus.tsv 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out 2612_pe-ERR5766176-db1_kraken2.tsv Which I think is just not a reasonable file format to try to merge, but the error did not really convey that (although it did point to the problematic file).

I was not able to find the paper.md file to review for JOSS, but I see that you are planning on that later.

Thanks, Emily Jane

Midnighter commented 1 year ago

Dear @snacktavish,

Thank you for your review. As a quick comment before answering in more detail, where did you find the URL to the how-to that is not found? The proper link is https://taxpasta.readthedocs.io/en/latest/how-tos/how-to-add-names/

snacktavish commented 1 year ago

Hi @Midnighter - It was the link from here: https://taxpasta.readthedocs.io/en/latest/tutorials/getting-started/#additional-functionality to here https://taxpasta.readthedocs.io/how-tos/how-to-add-names

The how to is there I think - just formatting issues.

bluegenes commented 1 year ago

Hi folks, looks like you've addressed my concerns, so I'm happy to sign off.

A bit of a tangent but since you're all experts @ctb, @bluegenes, and @snacktavish I wanted to ask your opinions on the following issue: Currently, taxpasta converts all relative abundances to integers. This design decision was based on my experience with some statistical tools requiring this. I think, it was vegan::adonis but I can't even remember properly. The other decent option is to convert everything to fractions. A third option is to provide a flag that lets users choose between either output. Do you have any opinions on this?

No strong opinions -- my personal preference is to not drop information (decimals) when possible, but I think it's fine to stick with integers if there are tools that require this.

Midnighter commented 1 year ago

Dear @snacktavish,

Thank you again for your review. I will respond below to your points (quoted).

In the tutorial, a little more biological context pn the meaning of the results would be useful.

E.g. in the step head 2612_pe-ERR5766176-db1_kraken2.tsv

These results look a bit concerning to me! What are soemtings things in taxon 0? Are all of the results not id'd to taxon? (This may be a limitation of this my own knowledge of taxonomic profiling, but including some biological context with the technical info can be helpful for users.)

From my point of view, providing interpretation or meaning is down to the individual profilers. Taxpasta is a pure utility and I would assume that its only users are those who have already chosen to run a particular profiler on their data. (For the specific example, taxon 0 is indeed unclassified. That's not surprising since it comes from a very small sequencing file that we run for testing purposes only.)

I tried going off-script in the tutorial to see if I could merge the Kraken file with the motus files (this may not make a lot of sense to do, biologically, but I was curious) using: taxpasta merge -p motus -o dbMOTUs_motus.tsv 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out 2612_pe-ERR5766176-db1.kraken2.report.txt

It appeared to work... but the results were odd. At the end it showed a taxonomy id of 152 with 152 samples in 2612_pe-ERR5766176-db1.kraken2.report, whereas in the original table it looks like it should be taxon id 9606 with 152 samples. Looking at the docs I think the software is only intended to merge profiles generated by the same tool, but the way that the description is set up, I thought merging profiles across tools was a key advantage of the software. (This is user error - but user error is one thing you can always rely on!!) This looks like a bug? or maybe the software should have thrown an error instead of running? But users will definitely try to run things that are not quite right, and the software should do some checks on input file format rather than chugging along and outputting seemingly sensible but incorrect outputs. (This may connect to the "Danger message" https://taxpasta.readthedocs.io/en/latest/commands/merge/#why here, but it does not even appear to be merging on identifier...)

Thank you for doing that. Your experience here prompted us to do some fairly major refactoring, which made the whole profile validation part much more strict. We have thus ensured that providing an input from another than the chosen profiler will always result in an error. (Except for centrifuge and kraken2 which use the same six column layout.) We have also added tests to verify that errors occur for all profilers. Due to that work our response is also a bit later than hoped.

I wanted to add taxon names to try to make the outputs make more sense to me, but I wasn't able to read the documentation on that at https://taxpasta.readthedocs.io/how-tos/how-to-add-names

Thank you for noticing this. We have fixed links in the documentation.

The error messages in other places could also use some work - I ran: taxpasta merge -p motus -o dbMOTUs_motus.tsv 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out 2612_pe-ERR5766176-db1_kraken2.tsv Which I think is just not a reasonable file format to try to merge, but the error did not really convey that (although it did point to the problematic file).

This is the same point as above, as far as I can tell.

I was not able to find the paper.md file to review for JOSS, but I see that you are planning on that later.

We hope that we addressed your major points and will begin work on this paper. Please note that if you wanted to test the changes, you will have to install the dev branch, since we have not released these changes yet.

jfy133 commented 1 year ago

Just a reminder we believe we have addressed all the comments, and waiting for final sign off so we can complete the JOSS manuscript :)

ctb commented 1 year ago

Hi @jfy133 sounds good to me - I think the JOSS manuscript is indeed the last thing remaining! I think I can review that all on my own without @snacktavish or @bluegenes so lmk when I should take a look :).

lwasser commented 1 year ago

hi friends! just chiming in here. @ctb you are more than welcome to look at the JOSS manuscript BUT normally that part of this review happens on the JOSS side of things as they need to actually implement the "Accept to JOSS" part. Instructions for wrapping things up (IF this review is at this stage based upon Titus' evaluation of reviews / responses etc) can be found here! . there is a template that you paste into the issue if all of the reviewer elements have been addressed in this review to your satisfaction - then this can move on to JOSS. I hope that helps ✨ i know this partnership can be confusing and sometimes even the JOSS editors need a bit of guidance (ie that they don't have to review the code only the paper through our partnership!).

jfy133 commented 1 year ago

I think we meant it's not present a la the checklist in the OP, but indeed the more the merrier in terms of :eyes:!

lwasser commented 1 year ago

ahhh! if i follow you correctly you can absolutely work on the JOSS paper now even if the review is not quite wrapped up. AND i see that note about not submitting to JOSS separately. I need to fix that as the process has changed. JOSS now DOES want a submission but you do NOT need to go through another review . You need to link to and mention this review and tell them it was accepted by us. They will and should ONLY review the paper once this review is complete. does that help? NOTE to myself to update the reviewer template and remove that note

pyOpenSci / software-submission