pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/
93 stars 36 forks source link

`sourmash` submission #129

Closed bluegenes closed 2 months ago

bluegenes commented 1 year ago

Submitting Author: Tessa Pierce-Ward (@bluegenes) All current maintainers: @ctb, @luizirber, @bluegenes Package Name: sourmash One-Line Description of Package: sourmash is a command line tool and Python library for sketching collections of DNA, RNA, and amino acid k-mers for biological sequence search, comparison, and analysis. Repository Link: https://github.com/sourmash-bio/sourmash Version submitted: 4.8 Editor: @snacktavish Emily Jane McTavish EiC: @NickleDave Reviewer 1: @LilyAnderssonLee Lili Andersson-Li Reviewer 2: @elais Elais Player Archive: DOI Version accepted: v4.8.8 JOSS: DOI Date accepted (month/day/year): 4/4/2024


Code of Conduct & Commitment to Maintain Package

Description

Large collections of genomes, transcriptomes, and raw sequencing data sets are readily available in biology. With the scale of data now available, the field needs lightweight computational methods for searching and summarizing the content of both public and private collections. sourmash implements FracMinHash sketching, a lightweight technique that supports accurate estimation of overlap and containment between two sequencing data sets. sourmash provides a flexible set of programmatic functionality for sequence search and comparison, together with a robust and well-tested command-line interface.

Scope

Domain Specific & Community Partnerships

Community Partnerships

If your package is associated with an existing community please check below:

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

The target audience for sourmash is biologists looking to compare biological sequencing datasets of any kind. sourmash's FracMinHash sketching produces a small, irreversible representation of each dataset, allowing users to search and compare without compromising private data. We provide a user-friendly CLI with mature functionality as well as a python API and experimental Rust API for computational biologists with advanced use cases.

  • Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are a number of sketching tools for genomic data (e.g. Mash, kmindex, Dashing), each of which implements a slightly different sketching technique. sourmash's FracMinHash enables accurate comparisons of sets of very different sizes (unlike standard MinHash implemented in, e.g. Mash), enabling analysis of genomes contained within metagenomes (Irber et al. 2022), which supports taxonomic profiling (Portik et al., 2022) and content-based search of publicly-available metagenomes. sourmash now additionally supports estimation of Average Nucleotide Identity from FracMinHash (Rahman Hera et al., 2023). The sourmash team has focused on maintaining a robust user interface and continually improving functionality and user experience, with ~ monthly software releases.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication Options

JOSS Checks - [x] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [x] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [x] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [x] The package is deposited in a long-term repository with the DOI: 10.21105/joss.00027 *Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Confirm each of the following by checking the box.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

NickleDave commented 1 year ago

Thank you @bluegenes for this detailed submission! I am adding initial checks here:


NickleDave commented 1 year ago

@bluegenes @ctb @luizirber when you have a chance can all three of you please fill out the pre-survey review linked above?

In the meantime I will proceed with finding an editor 🙂

snacktavish commented 1 year ago

Sorry for the delay! I am signed on a guest editor and will proceed with finding reviewers. Looking forward to checking out your package.

snacktavish commented 1 year ago

I have contacted some reviewers! Hope to have reviewers onboard soon.

NickleDave commented 1 year ago

Awesome, thank you for the updates @snacktavish, we really appreciate you acting as editor for this one. Ok I'll let you take over now :zipper_mouth_face:

snacktavish commented 11 months ago

Waiting on responses from reviewers - got some No's, but also some recommendations of possible reviewers. Asking more folks and hoping for some yes's!

snacktavish commented 11 months ago

Emailing some more potential reviewers today! @bluegenes @ctb @luizirber do you have any reviewer suggestions?

lwasser commented 11 months ago

@snacktavish would it be helpful for us to post about this specific review on social? And then we also have a connection with the single cell community if that might be helpful. Finally, we have a reviewer spreadsheet that might be helpful - people sign up to review for us. if you join us in our pyOpenSci slack we can chat and provide some additional resources that may help in this search for reviewers. you can email me at leah at pyopensci.org ✨

snacktavish commented 11 months ago

Great! We have @LilyAnderssonLee signed on as a reviewer, and she will get started next week. I also have a few more emails out to other folks, hopefully will have a second on-boarded shortly as well!

snacktavish commented 10 months ago

Editor comments

:wave: Hi @LilyAnderssonLee and @elais! Thank you for volunteering to review for pyOpenSci!

Please fill out our pre-review survey

Before beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.

The following resources will help you complete your review:

  1. Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
  2. Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due: Friday, December 8

snacktavish commented 10 months ago

Hi @LilyAnderssonLee and @elais - we would appreciate getting this review at the end of next week! Thank you,

LilyAnderssonLee commented 10 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

@bluegenes Outstanding work with sourmash! Your commitment to creating a package that's both easily maintainable and well-documented truly shines through. The code is impressively organized, accompanied by clear comments explaining each section, making it easy to comprehend the purpose of each file and function. Furthermore, the documentation for high-level files and directories is exceptionally clear — I intend to adopt this clarity in my own work. I've noted a few minor comments for your consideration.

def test_compare_serial(runtmp): """Test compare command serially.""" c = runtmp

testsigs = utils.get_test_data('genome-s1*.sig')
testsigs = glob.glob(testsigs)

c.run_sourmash('compare', '-o', 'cmp', '-k', '21', '--dna', *testsigs)

cmp_outfile = c.output('cmp')
assert os.path.exists(cmp_outfile)
cmp_out = numpy.load(cmp_outfile)

sigs = load_signatures(testsigs)

cmp_calc = numpy.zeros([len(sigs), len(sigs)])
for i, si in enumerate(sigs):
    for j, sj in enumerate(sigs):
        cmp_calc[i][j] = si.similarity(sj)

    sigs = load_signatures(testsigs)

assert (cmp_out == cmp_calc).all()
 - 5.  I may have overlooked it in your documentation. It would be helpful if you could provide some recommendations for the selection of `k-mers` and `scales`.
 - 6. What are the consequences of using different scales? Do you recommend using the same scale to generate the signatures for genomes or databases? 

sourmash gather ecoli-genome.sig inter/genbank-k31.lca.json.gz WARNING: final scaled was 10000, vs query scaled of 1000


- 7. Is it possible to add explanations and default values to `--threshold` in the help message of `sourmash lca classify -h`
ctb commented 10 months ago

thanks for this review! We'll get back to you shortly ;)

snacktavish commented 9 months ago

Thanks @LilyAnderssonLee!

elais commented 9 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

@bluegenes This is one of the best maintained packages I've seen. The development environment is easy to setup and work in, each commit has meaningful messages attached, and the documentation is extensive and up to date. I plan on showing this repository to my own team concerning how projects should look when maintained well. I ran into a few hiccups along the way however

snacktavish commented 9 months ago

Thank you reviewers! I’ll look over these when I’m back at work next week.

snacktavish commented 8 months ago

OK! I looked over the reviews, which were both very positive. Congratulations @bluegenes on your very nice package! Post here when the reviews have been addressed and the reviewers will look over the changes. Thanks @elais and @LilyAnderssonLee for your thoughtful reviews!

ctb commented 8 months ago

a tracking update - not yet a response ;) - because life does not proceed without github issues!

@elais review issues:

  • The nix flake and envrc needs to be updated to make sure the entire environment is available with a simple call to direnv allow.
  • There are unused imports in benchmarks/benchmarks.py
  • Not all tests ran out of the box in the dev environment via the nix flake but were easily made to work and seem to work in the GitHub actions.
  • Flake8 revealed a lot of errors, it might not be the worst idea to run code through a linter and fix everything that can be automatically fixed, this may help with the unused imports as well. These are none critical but it helps.
  • [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
  • [ ] Python versions supported,

If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.

* [ ]  **References:** With DOIs for all those that have one (e.g. papers, datasets, software).
ctb commented 8 months ago

tracking links for @LilyAnderssonLee review -

* [ ]  Badges for:
  * [ ]  A [repostatus.org](https://www.repostatus.org/) badge,
  * [ ]  Python versions supported,

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

* [ ]  If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
* [ ]  **Packaging guidelines**: The package conforms to the pyOpenSci [packaging guidelines](https://www.pyopensci.org/python-package-guide).
  * [ ]  Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

* [ ]  **References:** With DOIs for all those that have one (e.g. papers, datasets, software).
* 1. `mamba` was not callable after installing `miniforge3` on my Mac OS.
* 2. Is it possible to include arguments to allow users to adjust the legends of both matrix and dendrogram plots in the function mentioned in the example: `sourmash plot --pdf --lables ecoli_cmp` ?
* 3. Regarding `benchmarks/benchmarks.py` :

     * 1. Unused libraries: `os` & `Path`
     * 2. Magic numbers: There are some magic numbers in the code (e.g., 500, 21, 3000, 10000, 1000, 20). Is it possible to define these numbers as constants with meaningful names or add the comments to make the code more self-explanatory?
* 4. The download link doesn't work in the section [#Building your own LCA database](https://sourmash.readthedocs.io/en/latest/tutorials-lca.html). The correct link should be `curl -L https://osf.io/bw8d7/download -o delmont-subsample-sigs.tar.gz`
* 4. Regarding `test/test_sourmash.py` : Refactoring duplicate code into helper functions to avoid redundancy and improve maintainability. 
* 5. I may have overlooked it in your documentation. It would be helpful if you could provide some recommendations for the selection of `k-mers` and `scales`.
* 6. What are the consequences of using different scales? Do you recommend using the same scale to generate the signatures for genomes or databases?
sourmash gather ecoli-genome.sig inter/genbank-k31.lca.json.gz 
WARNING: final scaled was 10000, vs query scaled of 1000
luizirber commented 8 months ago
  • The nix flake and envrc needs to be updated to make sure the entire environment is available with a simple call to direnv allow.

re https://github.com/sourmash-bio/sourmash/issues/2905: @elais, are you per chance running this on ARM macOS? This works on Linux and I tried on x86_64 macOS before, but I don't have access to an ARM macOS to try it out...

elais commented 8 months ago

Yes I am running it on an M1 Mac. I could try running it in a virtual machine to see if I get better results.

luizirber commented 8 months ago

Yes I am running it on an M1 Mac. I could try running it in a virtual machine to see if I get better results.

I think I found the underlying problem on an x86_64 macOS, will work on fixing it. Thanks for raising the issue!

bluegenes commented 7 months ago

Hi @LilyAnderssonLee , @elais and @snacktavish,

Thank you so much for your reviews and your enthusiasm for sourmash! We have put a lot of effort into maintenance over the past few years, and it's fantastic to get some external feedback for improvement.

@ctb, @luizirber, and I (the sourmash maintainers) have addressed each of your comments via the issues posted above. As those address each point in-line, I will summarize here and provide text and explanations where needed.

We have cut a new release with these changes (v4.8.6), and plan to release version v4.9.0 when accepted and we have a new DOI.

JOSS manuscript fixes

README updates

Badges added:

Comparison to similar packages:

Our goal with sourmash is to maintain a swiss-army knife of k-mer based sequence analysis tools for lightweight sequence analysis. There are a myriad of software tools that conduct one or a few sourmash functions, such as pairwise sequence comparisons (e.g. Mash) or taxonomic classification (Kraken, sylph, etc), albeit with different approaches. To avoid a laundry list of comparisons, we have instead added the following text to the README, which we hope helps distinguish sourmash:

sourmash is a k-mer analysis multitool, and we aim to provide stable, robust programmatic and command-line APIs for a variety of sequence comparisons. Some of our special sauce includes:

  • FracMinHash sketching, which enables accurate comparisons (including ANI) between data sets of different sizes
  • sourmash gather, a combinatorial k-mer approach for more accurate metagenomic profiling Please see the sourmash publications for details.

Documentation Improvements

k-mer and scaling recommendations

User Experience Improvements

Code format standardization

Duplicated code in test_sourmash.py

While we absolutely agree that helper functions improve maintainability, the duplicated code in our tests is a result of a specific testing philosophy: tests should be simple, "flat" (not call a lot of test-specific functions), have low code complexity (minimize if and for statements), and otherwise not be viewed as primary code to be maintained and refactored regularly.

This has indeed led to a lot of repeated code in the tests (which currently number 2753!), but often these tests must load or modify test data in a unique manner in order to trigger edge cases and check bugfixes we've completed. So we want to avoid editing the tests and potentially reducing branch coverage and their coverage of edge cases.

For the future, we would appreciate any suggestions on large-scale ways to discover and/or fix instances of widely-shared code blocks and update some of the legacy items (e.g. decorators being used rather than our updated text fixtures).

LilyAnderssonLee commented 7 months ago

@bluegenes @ctb, @luizirber Thank you for your effort in addressing those comments. I am very pleased with your response. Well done!

ctb commented 7 months ago

hi folks, please let me know if there's anything more we need to do! thanks!

snacktavish commented 6 months ago

Hi all, Sorry for the delay! @bluegenes @ctb, @luizirber great work!!

@elais - were your comments sufficiently addressed? Any further comments to add on the updated version?

Thanks everyone!

lwasser commented 6 months ago

hi friends! 👋 thank you everyone for all of the work you've done here. it looks like we could use some input from @elais on whether comments were addressed. @elais can you kindly respond here and let us know so we can keep things moving ✨ many many thanks! 👐

snacktavish commented 6 months ago

yay!! I heard back, and you are officially good to go! Congratulations and I apologize for the slow review process. I will post the wrap up checklist now!!

snacktavish commented 6 months ago

🎉Sourmash has been approved by pyOpenSci! Thank you @bluegenes @luizirber and @ctb for submitting sourmash and many thanks to @elais and @LilyAnderssonLee for reviewing this package! 😸

Author Wrap Up Tasks

There are a few things left to do to wrap up this submission:

It looks like you would like to submit this package to JOSS. Here are the next steps:

Editor Final Checks

Please complete the final steps to wrap up this review. Editor, please do the following:


If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.

ctb commented 5 months ago

thanks all! really happy! :tada:

question: if we want to go for JOSS, should we do a new release now, for pyopensci, and then another, assuming JOSS accepts? or can we wait and just do a release if/when JOSS accepts? we don't have a real reason for a new release just yet (we've been doing them regularly anyway!) so could go either way. Just curious.

(we have citation instructions in the software that we would change for the JOSS release.)

snacktavish commented 5 months ago

I checked in with @lwasser and we say - Do a release now for pyos then another bump for the JOSS paper which than can be linked to the DOI that joss provides!

Thank you!

bluegenes commented 5 months ago

Thanks @snacktavish @lwasser!!

for tracking:

Author Wrap Up Tasks

There are a few things left to do to wrap up this submission:

It looks like you would like to submit this package to JOSS. Here are the next steps:

lwasser commented 3 months ago

hey there everyone! i'm just checking in on editor wrap up tasks here!! a few things

@bluegenes @LilyAnderssonLee @elais - if you have time, can you kindly complete our post-review survey? we really appreciate your input as it helps us improve our review process and it also helps us when we write writeups for funders, etc demonstrate impact. It looks like titus did complete it

ctb commented 3 months ago
  • So the only reason why it wouldn't be accepted is that this is a second review so you have to demonstrate significant updates to the package (i suspect there are because of that @ctb has mentioned in slack). but this is JOSS' call not ours!

I misread this the first time to be "did you make substantial changes b/t pyopensci and JOSS reviews", and was like, nope!

And then I reread it and was like, oh, yes, in the ~5 years and 3 major versions of sourmash, we have indeed made substantial updates to the package 😆

ctb commented 3 months ago

It looks like titus did complete it

a statement rarely found in the wild

lwasser commented 3 months ago

Ok a few more items

@bluegenes @ctb

We'd LOVE for you and your team to write a blog post on SourMash for us to promote your work! if you are interested - here are a few examples of other blog posts:

you can direct any questions to @kierisi on the blog post if that is of interest!

here is a markdown example that you could use as a guide when creating your post.

it can even be a tutorial like post that highlights what your package does. Then we can share it with people to get the word out about your package.

If you are too busy for this no worries. But if you have time - we'd love to spread the word about your package!

Finally - EVERYONE involved in this review (authors, editors, reviewers, etc) is invited to join our pyOpenSci slack. If you are interested, please reply and @kierisi can ensure that you get an invite! Thank you everyone. You may here from me again as i just cross additional t's and dot i's in these last steps of the review! 🚀

I really appreciate ALL of the effort htat went into this review from EVERYONE here in this issue. Thank you all for being a part of the wonderful pyOpenSci community. You all make it the supportive community that it is 💟

lwasser commented 3 months ago

It looks like titus did complete it

a statement rarely found in the wild

@ctb 🤣 🤣 but we have it now IN WRITING! (and under version control at that!)

bluegenes commented 3 months ago

@bluegenes @LilyAnderssonLee @elais - if you have time, can you kindly complete our post-review survey? we really appreciate your input as it helps us improve our review process and it also helps us when we write writeups for funders, etc demonstrate impact. It looks like titus did complete it

I filled it out - do you not have mine? I can redo if you need

bluegenes commented 3 months ago

We'd LOVE for you and your team to write a blog post on SourMash for us to promote your work!

we'd love to -- I'll reach out again when we have something.

Finally - EVERYONE involved in this review (authors, editors, reviewers, etc) is invited to join our pyOpenSci slack. If you are interested, please reply and @kierisi can ensure that you get an invite! Thank you everyone. You may here from me again as i just cross additional t's and dot i's in these last steps of the review! 🚀

yes - I'd be happy to join the slack!

lwasser commented 3 months ago

@bluegenes @LilyAnderssonLee @elais - if you have time, can you kindly complete our post-review survey? we really appreciate your input as it helps us improve our review process and it also helps us when we write writeups for funders, etc demonstrate impact. It looks like titus did complete it

I filled it out - do you not have mine? I can redo if you need

i believe you - i wonder what happened. i see titus but not anyone else from this review. i just tested out the form and it seems to work... if you have the bandwidth to take 5-10 minutes to fill it out i'd greatly appreciate it! thank you so much. maybe something odd happened in the google-form-verse?

kierisi commented 3 months ago

@bluegenes sending the invite now!

lwasser commented 3 months ago

thank you @kierisi !! 🚀 also @bluegenes if you have any questions about the blog side of things - jesse can help!!

lwasser commented 2 months ago

Closing this as it was accepted by joss in June!! thank you everyone!

snacktavish commented 2 months ago

Yay!!! Congrats to all!!

lwasser commented 2 months ago

hey there @snacktavish 👋🏻 just saying hello!! 💖 thank you for the work that you did on this review!!