Robaina commented 1 year ago

Submitting Author: Semidán Robaina (@Robaina) All current maintainers: @Robaina Package Name: Pynteny One-Line Description of Package: Query sequence database by HMMs arranged in predefined synteny structure Repository Link: https://github.com/Robaina/Pynteny Version submitted: v0.0.5 Editor: @arianesasso Reviewer 1: @Batalex Reviewer 2: @c-thoben
Archive: JOSS DOI: Version accepted: 1.0.0 Date accepted (month-day-year): 03-10-2023

Description

Include a brief paragraph describing what your package does:j

Pynteny is Python tool to search for synteny blocks in (prokaryotic) sequence data through HMMs of the ORFs of interest and HMMER. By leveraging genomic context information, Pynteny can be employed to decrease the uncertainty of functional annotation of unlabelled sequence data due to the effect of paralogs. Pynteny can be accessed (i) through the command line, (ii) as a Python module, or (iii) as a (locally served) web application.

Scope

Please indicate which category or categories this package falls under:
- [x] Data retrieval
- [x] Data extraction
- [ ] Data munging
- [ ] Data deposition
- [ ] Data visualization
- [ ] Reproducibility
- [ ] Geospatial
- [x] Education
- [ ] Unsure/Other (explain below)

Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

Pynteny's main objective is to provide a means to query NGS (unannotated) sequence databases, such as metagenomic/metatranscriptomic datasets using syntenic blocks (i.e. spatial arrangements of genes) rather than single target genes/protein domains. In this sense, I would classify Pynteny within Data Extraction.

Who is the target audience, and what are the scientific applications of this package?

Pynteny was designed to be used by researchers working with large, unannotated sequence databases, such as those typically encountered in metagenomic analyses. It can be accessed through a command line interface or easily integrated into pipelines as a Python package. Pynteny can also be used through a graphical interface running locally in the browser, which is more suitable for educational purposes.

Are there other Python packages that accomplish similar things? If so, how does yours differ?

To the extent of my knowledge, there isn't any Python package that provides the functionality provided by Pynteny.

If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussions or @tag the editor you contacted: https://github.com/pyOpenSci/software-review/issues/65

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others: GitHub Action

Publication options

I had submitted this package for publication at JOSS prior to pyOpenSci. The submission is currently under consideration for scope in this issue: https://github.com/openjournals/joss-reviews/issues/4978

[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [x] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [x] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [x] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [x] The package is deposited in a long-term repository with the DOI: [![DOI](https://joss.theoj.org/papers/10.21105/joss.05289/status.svg)](https://doi.org/10.21105/joss.05289) *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text-based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Please fill out our survey

[x] Last but not least, please fill out our pre-review survey. This helps us track submissions and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

[Editor and review templates can be found here]: https://www.pyopensci.org/software-peer-review/how-to/editors-guide.html and https://www.pyopensci.org/software-peer-review/how-to/reviewer-guide.html

NickleDave commented 1 year ago

Thank you @Robaina for submitting Pynteny to pyOpenSci. As you noted, we have already discussed on #65 and you have done extensive development to prepare for submission.

Here are the initial editor checks:

[x] Fit: The package meets criteria for fit and overlap.
- As discussed in #65 scope here includes data extraction and I think also data retrieval, and the package can and has been used for educational purposes. I additionally checked "retrieval" and "education" above
[x] Documentation The package has sufficient documentation available online (README, sphinx docs) to allow us to evaluate package function and scope without installing the package. This includes:
- [x] tutorials or vignettes that help a user understand how to use the package and what it can do for them (often these have a name like "Getting started")
- see e.g. https://robaina.github.io/Pynteny/examples/example_cli/
- [x] API documentation - this includes clearly written doc strings with variables defined using a standard docstring format
- [x] README that clearly articulates the function of the package
- [x] Contributing documentation that details how to install a development environment and how to contribute to the package
- there is not a separate contributing page yet but there is a section in the README: environment
[x] Issue Submission Documentation: All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
- Will need to update version, editor, reviewers
[x] Automated tests: Package has a testing suite and is tested via GitHub actions or another Continuous Integration service.
[x] License: The package has an OSI approved license
- Uses Apache license
[x] Repository: The repository link resolves correctly
[x] README: The package has a README with clear explanation of what the package does and instructions on how to install it along with development instructions.
[x] Contributing statement: The package has a contributing.md file that details how to contribute to the package.
- Not yet, see above
[x] Package overlap: That package doesn't fully overlap with functionality of other packages that have already been submitted to pyOpenSci
[x] Archive (JOSS only, may be post-review): The repository DOI resolves correctly
[x] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)? JOSS in version 1.1.0

[ ] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Editor comments

The author has done a ton of work to get ready for submission. Pynteny is looking really good. We will need a contributing.md before the review finishes but that's the only thing I really see missing right now.

Thank you again @Robaina. We will get an editor and reviewers as soon as possible.
I need to confirm timelines with @lwasser but I would expect it might take a bit longer because of the holidays, my guess is two weeks. Will let you know for sure by the end of this week, but wanted to make sure you knew we have seen your submission and we are beginning the review.

Robaina commented 1 year ago

Hi @NickleDave,

thanks for accepting the submission! I have included a "CONTRIBUTING.md" in the root directory of the repo, also moved instructions on how to set a developer environment to that file. Looking forward to the review process -- Btw, I think I did fill out the initial onboarding survey, please confirm.

Best, Semidán

NickleDave commented 1 year ago

Perfect, thank you Semidán!

Following up on my last message: we do expect to get review started within the next two weeks. People are off-line because of the holidays but we will get the ball rolling now.

arianesasso commented 1 year ago

Dear @Robaina, nice to meet you!

I just wanted to say hi 👋🏼 and inform you that I am pleased to be the editor for this review. I see that you and @NickleDave already did great work during the pre-submission phase! We are recruiting the reviewers now, and I will update you once they are onboard.

Robaina commented 1 year ago

Hi @arianesasso, nice to meet you too!

Awesome. Thanks for letting me know and for being the editor. Looking forward to the review! (und guten Rutsch ins neue Jahr)

arianesasso commented 1 year ago

Dear Robaina,

I hope you had a great start to the year as well! I am sorry for the delay. We are still looking for our second reviewer.

However, I would like to introduce you to @Batalex, who is already on the job as the first reviewer 😄 . We have previously talked about the package, and he already has access to the reviewer's guide, the reviewer template, and the Python packaging standards detailed in the packaging guide. So, we will be able to continue moving 🚀 .

And please feel free to ping me at any time if you need help or have questions!

Robaina commented 1 year ago

Hi @arianesasso,

no problem, thanks for the update :) Hi @Batalex, thanks the review!

Batalex commented 1 year ago

Hey there! I am almost done with my review. No promises, but I think I can deliver this weekend. Looking forward to discussing it with you! 🙌

Batalex commented 1 year ago

Here comes a big one! @Robaina @arianesasso

Foreword

I want to thank both the author and the editors from pyOpenSci for the opportunity to tackle this package review. The following comments reflect my experience as I maintain private packages, and all remarks come with pynteny's best interest at heart.

Some comments are regrouped at a "general" level if they concern the project structure or several files. If I encountered a situation that I had already pointed out on another file, I did not comment again. However, I encourage taking into account the remarks from one file to the others.

Several comments I made are only valid for the submitted version (0.0.5) and were fixed in the meantime. Nonetheless, I will mark off what has been done later.

With that being said, let us get started!

I used the black cat🐈‍⬛ to highlight my personal thoughts in the review template below.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.

🐈‍⬛ The submitted version uses a personal conda channel in README, which is something I would not recommend. However, it uses bioconda in the main branch, so it is all good.

[ ] Vignette(s) demonstrating major functionality that runs successfully locally.

🐈‍⬛ There are some improvements to be made regarding running the examples locally, see below.

[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.

🐈‍⬛ In Pynteny 0.0.5, contributions guidelines are included in the README file. A more detailed CONTRIBUTING file has been added since then.

[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

🐈‍⬛ I noticed that two of the three links to the Pynteny repo use the HTTP scheme, and the third uses HTTPS.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for:
[x] Continuous integration and test coverage,
[x] Docs building (if you have a documentation website),
[x] A repostatus.org badge,
[x] Python versions supported,
[x] Current package version (on PyPI / Conda).

🐈‍⬛ Additional badges have been added since the 0.0.5 release.

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
[x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

🐈‍⬛ I ran the package in WSL since Apple silicon is not supported. The installation went smoothly, except for the really long resolution time. I am glad to see that the main branch encourages to use mamba.

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.

🐈‍⬛ As far as I can tell, it seems to do what it claims.

[x] Performance: Any performance claims of the software been confirmed.

🐈‍⬛ NA, no performance claim to be found in the documentation.

[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

🐈‍⬛ Tests covers 45% of pynteny. The coverage is heterogenous: the application and CLI parts of the package are less tested than the rest.

[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)

🐈‍⬛ Unit tests did not run on CI as of 0.0.5. This has changed since then. The documentation website is built and deployed using CI, nice!

[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
[x] Package supports modern versions of Python and not End of life versions.
[x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

🐈‍⬛ The code base does not fully follow PEP 8. See below for more information.

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software
[x] Authors: A list of authors with their affiliations
[x] A statement of need clearly stating problems the software is designed to solve and its target audience.
[x] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: ~8 hours

Review Comments 🐈‍⬛

General comments:

The package is very well laid out and easy to navigate.
I would encourage the author to follow the stylistic style guide defined by the PEP8. The most immediate benefit this would provide to the code base is consistency with the rest of the Python ecosystem and a lower entry bar for potential contributors. Two easy fixes to apply in this code base are:
Using a code formater such as black.
Using snake_case to name functions and methods rather than camelCase
Another suggestion would be to use a .editorconfig file. There again, this is a simple way to enforce consistency in the code base by defining a few options to control the files’ format. This is especially true for Python projects since mixing tabs and spaces would crash the program. There are other benefits as well, e.g., removing trailing spaces and adding new lines at the end of files.
There are a few discrepancies in types, some of them could be fixed by converting argparse.ArgumentParser instances to CommandArgs before using them
Empty parentheses in class definitions without inheritance are not necessary.
“magic values” (such as lists of columns) could be defined at the module level rather than in the class/function definition, but I am nitpicking.
My personal recommendation is to define the text encoding whenever a text file descriptor is opened. Even if this is not feasible at the moment, this is a tiny step toward Windows compatibility, and it removes a little bit of uncertainty.
There are a lot of places where the author uses classes with only static methods and no attributes. This adds an unnecessary layer of complexity to the code base, where one could simply define functions in their dedicated file if keeping the namespace clean is important.
With the author's permission, I would be glad to provide a simple, sensitive configuration to get started with a few quality code tools.

CLI

The defaults paths are quite inconvenient if I run the cli from the installed package rather than the source directory, the config file as well as the downloaded database ends up in /home/<user>/miniconda3/envs/Pynteny/lib/python3.10/site-packages/

Documentation
- I recommend adding "Installation" & "Getting Started" sections in the documentation. The rationale is as follows: any part of the project may be a potential user's first contact. This includes the README file, the package description on Pypi (not applicable here) / a conda channel, and the documentation.
- I would advise adding instructions on how to locally build the documentation in CONTRIBUTING.md in the section 3
- The following mention is incorrect in the example notebook:
  
  To follow this example, you don't need to download _E. coli's_ genome, since it has been already downloaded during Pynteny's installation.
The MG1655.gb file is stored in the tests folder, which is not included in the package. Therefore, when the package is installed using conda, it is not downloaded.

pynteny/init.py
- import * are usually frowned upon in Python, but IMO this is ok in this case
pynteny/api.py
- importlib.metadata is only available in the standard library from 3.8 and onwards (3.7 is technically allowed in 0.0.5). This is partially fixed in the main branch. For good measure, I would increase the minimum Python version in the pyproject.toml file as well, even if poetry is only used as a build backend.
- The author could use a simple type hint in the Command class to avoid a linter warning in the update method:
```
class Command():
"""Parent class for Pynteny command
"""
_args: CommandArgs
```
- Empty __init__ methods are not necessary, the smaller the code base the better!
- 🔥There are really nice uses of the package’s metadata in _repr_html_ and cite
- I am pleased to see that arguments with None default value are correctly typed as optional
- There is no need to wrap Path(__file__).parent in another Path() in the getTestDataPath method since it returns a Path object as well. This comment applies to utils.py as well.
pynteny/cli.py
- args is not used in the Pynteny.app method, so there is no need to parse it.
- I believe getHelpStr could be simpler by writing to an io.StringIO instead of a temp file.
- f-strings without placeholder are considered code smells
- Overall, this is a really nice example of a cli with subcommands, congrats!
- I saw a few unused required and pynteny assignment
pynteny/filter.py
- typo L40 specified
- As far as I can tell, the hmm_order_dict is not needed in SyntenyPatternFilters init method.
- The lambda expr + map is quite difficult to read. A simpler alternative could be:
```
strand_map = dict(neg=-1, pos=1) # could be set as a module-wide constant
self.strand_order_pattern = [strand_map.get(strand, 0) for strand in parsed_structure["strands"]]
```
- contains_hmm_pattern, contains_distance_pattern, and contains_strand_pattern could use Python’s idioms and return boolean values instead of integers, this is not an issue with their use in pandas rolling method. Boolean values can be aggregated numerically in pandas. This would simplify those methods. In contains_strand_pattern, there is a True equality comparison.
- Using inplace=True in pandas operation is not recommended and could be deprecated: https://github.com/pandas-dev/pandas/issues/16529.
- The author could use the size method instead of a filter with a lambda in getAllHMMhits, but this is a matter of preference.
- There again, preferences, but the loop in filterHitsBySyntenyStructure could be made on a pandas’ groupby object. This is not a prominent feature, but one may do the following:
```
for group_key, group_df in df.groupby(key):
pass
```
- the form len(contig_hits.hmm.unique()) could be simplified to contig_hits.hmm.nunique()
- Using iterrows is quite inefficient if one only need to loop over the index. The author may use for i in matched_rows.index
- Idiomatic Python does not usually use getter methods such as getSyntenyHits, a public attribute would work just fine.
- I have not profiled the code to see if there is a performance bottleneck there, but I would use itertuples instead of iterrows in addHMMmetaInfoToHits. itertuples is much faster if you need to iterate over data frames one row at a time
- I suggest using isinstance rather than type equality in filterFASTAbySyntenyStructure. isinstance is a more robust (and marginally faster) solution because it allows nominal subtyping: any subtype of str or list would pass the test conditions. Using isinstance is therefore considered more pythonic than type equality.
pynteny/hmm.py
- Nice use of properties
- Since the author is using pathlib, they might use / or Path.joinpath instead of os.path.join
- The author may use the usecols argument for pd.read_csv
- Why is thetry-except block needed in getHMMnamesByGeneSymbol? (+ bare except is considered a code smell). The code would work without this. The same goes for getHMMgeneID, with a little twist to return None, and for getMetaInfoForHMM
pynteny/parser.py
- The two classes in this file only use static methods and do not hold any attribute. Is there a reason for not replacing all those methods with simple functions? Edit: even preprocessing.py does isLegitSequence = RecordSequence.isLegitPeptideSequence
- There are too many statements in the try-except block in LabelParser.parse. I do not know what could be failing (a split, an int cast?), and the exception caught is too broad. Tip: I suggest extracting meta.split("_") in a variable since it appears in four lines in a row
- Typing tuples is quite unintuitive: tuple[str] means “a tuple of exactly one str.” This means that splitStrandFromLocus is incorrectly typed
pynteny/preprocessing.py
- In FASTA class, we could replace the getter/setter with a property, which is more “pythonic”. Moreover, we could also make _input_file_str a read-only property.
- I would prefer if the inner function unique_records was defined outside the removeDuplicates method. I understand that the purpose is to generate the records lazily, which could still work as intended.
- number_prodigal_record_fields could be a global constant
- There again with fromGenBankData, I think the three inner functions would be better off defined outside the method. Contrary to the previous time, they do not use self and are even easier to extract. This change would reduce fromGenBankData complexity and make it even easier to debug.
- If pynteny targets python ≥3.10 as the main branch suggests, then it may use https://docs.python.org/3/whatsnew/3.10.html#parenthesized-context-managers in GeneAnnotator
pynteny/subcommands.py
- There is a piece of code duplicated four times. The following could be extracted in its own function.
```
if args.logfile is None:
args.logfile = Path(os.devnull)
elif not Path(args.logfile.parent).exists():
Path(args.logfile.parent).mkdir(parents=True)
logging.basicConfig(format='%(asctime)s | %(levelname)s: %(message)s',
handlers=[
logging.FileHandler(args.logfile),
logging.StreamHandler(sys.stdout)
],
level=logging.NOTSET)
logger = logging.getLogger(__name__)
```
In addition, this would limit the complexity of synteny_search
- wget is quite old and does not seem to be active. How about replacing it with a more maintained alternative? e.g. httpx, requests
- As I said in the general comments, I would prefer if the default dirs were not relative to the package files. When installing pynteny inside a virtual env, this means that the database could be downloaded inside the venv, in a totally different place from the current working dir. I suggest using ~/.pynteny
pynteny/utils.py
- As I said in the comments above, using the file location is quite inconvenient because when installed by conda it ends up in the venv directory
- Is there a reason for not using https://docs.python.org/3/library/tarfile.html?highlight=tar#tarfile.is_tarfile?
pynteny/wrappers.py
- ✅
pynteny/app
- The embedded streamlit app is really cool 🔥
- There again, it seems a bit odd to have classes with only static methods

Edits

crossed items to mark progression toward completion, then uncrossed for ease of read once everything was addressed

Robaina commented 1 year ago

Hey @Batalex ,

thanks so much for your very detailed review! I quickly run through your comments and I see that they will be very useful. I'll address them carefully in the following days. But wanted to thank you now.

arianesasso commented 1 year ago

@Batalex, what a beautiful and thoughtful review! I really enjoyed the 🐈‍⬛ haha and, of course, all the meaningful recommendations you made! Thank you for all your effort 🙏🏼.

@Robaina, hopefully, we will have the second reviewer onboarded soon. I will keep you posted!

arianesasso commented 1 year ago

Hi @Robaina, I want to introduce you to our second reviewer @c-thoben. She will be starting her first review with us 🥳 . Keep tuned for more updates soon!

Robaina commented 1 year ago

Hi @arianesasso, great, thanks for letting me know. Hi @c-thoben, thanks for offering to review!

Batalex commented 1 year ago

Hey @Robaina, if you would like, I can free some time in case you want a reviewer for your upcoming PRs on Pynteny. This may be a little easier to discuss the suggested changes.

Robaina commented 1 year ago

Hi @Batalex ,

thanks for the offer! Will do that for the ones I'm currently working on.

arianesasso commented 1 year ago

Great job @Robaina! Thank you also @Batalex, for offering more time to help with the PRs 🙏🏼

arianesasso commented 1 year ago

We are onboarding @c-thoben and soon you will have your second review 🥳 .

Robaina commented 1 year ago

Hi @Batalex,

thanks again for your review. I noticed that some of your comments were not addressed in the previous PRs. I'll clarify some of them here (preparing a PR for the remaining):

There are a few discrepancies in types, some of them could be fixed by converting argparse.ArgumentParser instances to CommandArgs before using them

I understand what you mean here: Subcommand functions in pynteny.subcommands currently accept both an argparse.ArgumentParser object or a pynteny.utils.CommandArgs object. This flexible type in the argument was made on purpose so subcommand could be called both from the CLI and from within Python. I don't understand why this is undesirable. I could, like you pointed out transform argparse objects to CommandArgs, but find it unnecessary (unless there is a good reason behind that I'm missing).

Additionally, type hints in pynteny.subcommands could be fixed using typing's union

What do you think?

The defaults paths are quite inconvenient if I run the cli from the installed package rather than the source directory, the config file as well as the downloaded database ends up in /home//miniconda3/envs/Pynteny/lib/python3.10/site-packages/

This has been changed already in this PR.

Since the author is using pathlib, they might use / or Path.joinpath instead of os.path.join

This was changed in this PR

I have not profiled the code to see if there is a performance bottleneck there, but I would use itertuples instead of iterrows in addHMMmetaInfoToHits. itertuples is much faster if you need to iterate over data frames one row at a time

This was changed in this PR

(Will get rid of getattr in the next PR)

My personal recommendation is to define the text encoding whenever a text file descriptor is opened. Even if this is not feasible at the moment, this is a tiny step toward Windows compatibility, and it removes a little bit of uncertainty.

Alright, will add that in the new PR.

The MG1655.gb file is stored in the tests folder, which is not included in the package. Therefore, when the package is installed using conda, it is not downloaded.

Right, will modify this in the new PR.

Batalex commented 1 year ago

Thanks, @Robaina; I needed to make a second pass at my comment, but you have already taken care of it!

Aside from the remaining getattr to take care of, I am pleased to see that you found interest in my comments. While we wait for the second review, can I make some PR to add a few quality tools to ease the maintenance in the future?

I am thinking of:

adding isort to sort imports consistently
adding flake8, a very common linter in the python ecosystem
if you are feeling fancy, using ruff which can replace the two above
adding pre-commit hooks for black, isort, flake8 / ruff
adding CI jobs to check the tools above before merging a PR

Robaina commented 1 year ago

Sure @Batalex, go ahead with this(these) PR(s). I'll be glad to help. These changes should make the codebase very robust and pretty.

c-thoben commented 1 year ago

Here's my review! @Robaina @arianesasso

Thanks for onboarding me :) I am not that experienced, but I enjoyed looking into the review process. You did some really nice work @Robaina and I hope my comments are helpful.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.The README should include, from top to bottom:
[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [x] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
[x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[ ] Installation: Installation succeeds as documented. -> I had some trouble installing the software via conda. I have described them below.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements. Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software
[x] Authors: A list of authors with their affiliations
[x] A statement of need clearly stating problems the software is designed to solve and its target audience.
[x] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

*Estimated hours spent reviewing: 10 h

Review Comments

When trying to install the software, I had some issues with anaconda resolving the package dependencies if i used the recommended command "conda install -c pynteny". I got some help from the community and with "conda create -n pynteny -c conda-forge -c bioconda python pynteny" the installation worked fine. To test Pynteny, I used a VM with Ubuntu 22.04.1 LTS. The anaconda version was 23.1.10 and python3 inside the pynteny environment 3.10.6. If you need furhter information I am happy to provide them.
I have seen that if the PGAP database is downloaded, it is marked in the configuration file and the download is skipped at future exections. This can lead to problems if e.g. the user has deleted the database locally and wants to redownload it. Maybe a force option can be helpful.
All tests passed for me. However, I have seen that in the scripts "test_database_build.py" and "test_integration_search.py", the unittest main is not called so the tests are not executed if the module is run locally. Is this right?
I enjoyed the quotes in the CLI :)

Robaina commented 1 year ago

Thank you so much for your review @c-thoben! I really appreciate the time you have taken to do this review.

Find below my answers to your comments:

When trying to install the software, I had some issues with anaconda resolving the package dependencies if i used the recommended command "conda install -c pynteny". I got some help from the community and with "conda create -n pynteny -c conda-forge -c bioconda python pynteny" the installation worked fine. To test Pynteny, I used a VM with Ubuntu 22.04.1 LTS. The anaconda version was 23.1.10 and python3 inside the pynteny environment 3.10.6. If you need furhter information I am happy to provide them.

The recommended way to install pynteny is via conda, in its own environment:

conda create -n pynteny
conda activate pynteny
conda install -c bioconda pynteny

as indicated in the README. It looks like this is what you did at the end. Python should be included in the environment by default.

I have to update the package in bioconda. There have been a lot of changes in pynteny's codebase since then, but did not want to bother the bioconda community with further updates until this review was over.

The new features can be tested installing the developers version, following instructions in the contributing file.

I have seen that if the PGAP database is downloaded, it is marked in the configuration file and the download is skipped at future exections. This can lead to problems if e.g. the user has deleted the database locally and wants to redownload it. Maybe a force option can be helpful.

I had the exact same thought. This feature is already included in the newer version of Pynteny (not yet in bioconda).

All tests passed for me. However, I have seen that in the scripts "test_database_build.py" and "test_integration_search.py", the unittest main is not called so the tests are not executed if the module is run locally. Is this right?

Right, I missed that. Thanks! It's been changed.

I enjoyed the quotes in the CLI :)

I'm glad you enjoyed the quotes :D

c-thoben commented 1 year ago

Hey @Robaina, thanks for your answer! It seems that most of my comments are already resolved. Onto the installation issue. I had the installation problems when I followed the recommended way in the README:

conda create -n pynteny
conda activate pynteny
conda install -c bioconda pynteny

The error I got is due to conda not being able to resolve the dependencies, I am also able to reproduce it. The user who helped me had the same incompatibility on fedora37 shipping python3.11 with conda environments with python 3.10, 3.9 and 3.8. She found out that the installation works if the dependencies are resolved by conda at the same time:

conda create -n pynteny
conda activate pynteny
conda install -c conda-forge -c bioconda python pynteny

If you need more informations I am happy to provide them!

Robaina commented 1 year ago

Hi @c-thoben,

thank you so much for spotting and clarifying this issue. I have added some lines to README to point out this issue: try your solution and also install with mamba instead of conda (which is usually much faster to resolve dependencies). This should be enough for now. However, I'm thinking of further restricting dependency versions in the next update on bioconda, this should also help reduce installation time / solving dependencies.

lwasser commented 1 year ago

@Robaina i'm using a pure mamba forge installation and this install is not working for me on a MAC in a clean environment. something is up with the dependencies that i think needs to be resolved in a way that users can understand.

@c-thoben had a difficult time installing on linux as well and her points need to be addressed!

can you explain a few things to us?

conda install -c conda-forge -c bioconda python pynteny

why do you need to specify both conda-forge and bioconda channels when installing? are there some specific dependencies that are getting in the way?

the instructions in the readme are confusing because there are now 3 installation options as follows: there should be a single way to install a package that works.

Pynteny can be install with the following command:

conda install -c bioconda pynteny
If conda takes a long time to solve the dependencies (e.g., more than a few minutes) or fails to solve it, you may try:

conda install -c conda-forge -c bioconda python pynteny
Or install it with [mamba](https://github.com/mamba-org/mamba) instead of conda:

mamba install -c bioconda pynteny

finally there is a personal channel in your environment here. . we want to ensure that authors are in fact using public channels. can please explain why a personal channel is listed in the environment?

Many thanks. It's critical that you address this installation issue as so many of us are running into confusion installing this package. While i understand the complexity of dealing with specific channels and complex installs, our goal here it so support usability in all of our vetted packages.

Robaina commented 1 year ago

Hi @lwasser,

thanks for your comments.

@Robaina i'm using a pure mamba forge installation and this install is not working for me on a MAC in a clean environment. something is up with the dependencies that i think needs to be resolved in a way that users can understand.

I agree. For what @c-thoben and you tell me, it seems like something is wrong with the package (now outdated) version in bioconda. Pynteny is built and installed without problems in the tests GitHub Action, both in Ubuntu and Mac.

can you explain a few things to us?

conda install -c conda-forge -c bioconda python pynteny

why do you need to specify both conda-forge and bioconda channels when installing? are there some specific dependencies that are getting in the way?

bioconda channel needed for pynteny, conda-forge channel needed for python (solution suggested by @c-thoben). This was meant as a provisional fix until the new version was ready in bioconda. I have no trouble installing pynteny from bioconda in my Ubuntu 20.04 machine, so really could not know that this fix wouldn't work in other settings.

the instructions in the readme are confusing because there are now 3 installation options as follows: there should be a single way to install a package that works.
Pynteny can be install with the following command:

conda install -c bioconda pynteny
If conda takes a long time to solve the dependencies (e.g., more than a few minutes) or fails to solve it, you may try:

conda install -c conda-forge -c bioconda python pynteny
Or install it with [mamba](https://github.com/mamba-org/mamba) instead of conda:

mamba install -c bioconda pynteny

I agree. Ideally only conda install -c bioconda pynteny, perhaps suggesting usage of mamba only to accelerate installation. Will update pynteny on bioconda and try installing it in the GitHub Action (Ubuntu and Mac) directly from bioconda once updated as a way to check installation on MacOS (I don't have access to a Mac machine)

finally there is a personal channel in your environment here. . we want to ensure that authors are in fact using public channels. can please explain why a personal channel is listed in the environment?

Thanks for spotting this. This channel was a remanent of a previous version in which pynteny was hosted in my personal conda channel. It was requested for me to upload pynteny to bioconda prior to review. Already removed.

Many thanks. It's critical that you address this installation issue as so many of us are running into confusion installing this package. While i understand the complexity of dealing with specific channels and complex installs, our goal here it so support usability in all of our vetted packages.

Of course. I totally agree. Thanks for raising these issues. I will update pynteny in bioconda, test installation again, and come back to you all.

Robaina commented 1 year ago

@lwasser

Sorry, I just realized. There is still one dependency, Streamlit-Aggrid, that is hosted in my conda channel. That's why my channel was still listed in the environment.

This is a non-critical dependency, it is employed to make an interactive results table in the web application. While installable from pip, I had to upload it to conda to be able to list it as a dependency in bioconda (pip dependencies not allowed).

Since this dependency is not hosted in a public channel, I will first try to contact Streamlit-Aggrid developer to suggest uploading it to bioconda, if unsuccessful, I will remove this dependency altogether and make a static results table in the web application.

lwasser commented 1 year ago

no worries @Robaina -Many thanks for your prompt reply. i know that this review has been a LOT of work and you've been incredibly committed throughout the process! I just want us to figure out the install items and we can do that together if need be! just to be sure i've pinged a colleague who works on conda-forge to help us sort this out as well. but it ALSO sounds like part of this issue are things you've fixed that haven't been released yet which is very understandable.

i have seen packages (such as geopandas) that did have a dependency that you had to install manually. but i think it's not ideal for users. perhaps there is some way for you to provide instructions for using the interactive approach (if you wish) if you can't find a way to have this on conda.

however, If the package is on pypi already, it's actually quite simple to move it to conda forge using greyskull. please let us know if you need resources to share with the authors. i think this is quite doable and might then be less work for you assuming the authors are ok with it. then every time they deploy to pypi they also merge a pr on conda forge a few hours later. Then it's in both places. i've done this with several packages and it's easy to maintain.

Robaina commented 1 year ago

Thank you for the support, @lwasser. I'm sure we will figure this out.

Alright, so I went to Streamlit-Aggrid's repo and found out that I had already contacted the authors in October 2022 (so many things going on, I totally forgot). Would be nice if we could set the conda-forge pipeline if the author agrees (contacted him again). I let you all know.

lwasser commented 1 year ago

ok awesome! yup, we will figure this out! i'll follow your issue referenced above as well. it looks like the maintainers have a lot of issues and not enough time to address them all which is very understandable and common in this space!

Robaina commented 1 year ago

Hi @lwasser,

I've been thinking about removing the Streamlit web application from pynteny and making it a separate pip-installable package pynteny-app which would contain this functionality. This package would work as an optional add-on to pynteny (which would not be submitted to PyOpenSci)

This change would solve the issue with Streamlit-Aggrid, and probably also help solve dependencies by conda. All the functionality would still be accessible via CLI and within Python as a module (which are the main ways to access pynteny, anyway)

Would this change be acceptable with respect to the status of the current submission?

Thanks!

lwasser commented 1 year ago

@Robaina i think so. but let's check in with @arianesasso and our reviewers @Batalex @c-thoben so they are aware of the update. i see that this dependency is really not core to your functionality so it could be a nice add on as you mention and would streamline your installation significantly! Ari, Alex and Corinna - does this seem reasonable? i don't want to take over this review i just want to help ensure things are installable. many thanks. i can stop complicatin gthe review once we have decided :)

lwasser commented 1 year ago

hey @Robaina it looks like that package may get added to conda-forge by a maintainer. i can keep you posted here if that will save you some time.

lwasser commented 1 year ago

it actually was already there - https://anaconda.org/conda-forge/streamlit-aggrid @Robaina can you confirm that this is what you need? if so it should just be a simple environment update to make everything work.

Robaina commented 1 year ago

Yes, awesome, thank you @lwasser !

Robaina commented 1 year ago

Hi @arianesasso, @Batalex, @c-thoben and @lwasser,

after testing the new packages Streamlit and Streamlit-Aggrid in conda-forge, I am now facing a new issue with Streamlit, basically this one.

While I like the Streamlit app, I would like to prioritize other issues regarding core functionality of pynteny (e.g. adding support for the PFAM database). Therefore, I am now considering the idea previously commented of removing the web interface from pynteny and instead create a new addon pip package containing it.

Please, confirm that this would be an acceptable change to this submission. Here is the branch without the app.

Thanks for all the help :)

Batalex commented 1 year ago

On the one hand, I think that the streamlit app is orthogonal to the package itself. I mean that removing it does not change the package's "elevator pitch." On the other hand, the web app is mentioned in the paper.md (https://github.com/Robaina/Pynteny/blob/main/ms/paper.md) submitted to JOSS.

I would be +1 on moving it to another package, if it means that a larger pool of users can use the core package, but I wonder if it could be an issue with JOSS.

lwasser commented 1 year ago

Thanks @Batalex or maybe just removing it from the JOSS paper or mentioning it as a plugin / add on? it sounds like this won't impact our review at all - i think we can just retest installation once all of this is added on our end. And then on the joss end you can always edit the paper to clarify that component.

Robaina commented 1 year ago

Great, I agree. I will edit the paper. I should be fine since it is not even submitted yet. I'll let you know once I test installation after these new changes.

Robaina commented 1 year ago

Hi there @arianesasso, @c-thoben @Batalex ,

I have updated pynteny on bioconda, and tested installation directly from bioconda within a GitHub Action (under ubuntu-lastest and macos-latest).

I also updated setup instructions in the README, and modified python version, since now pynteny depends on python 3.10.

Please, could you verify that you can install pynteny in your settings? (perhaps @lwasser could also try installing it in MacOS?)

Thanks!

lwasser commented 1 year ago

sure thing i'll try to install on macOS @c-thoben can you please try to install again too following the new readme instructions? @Robaina thank you for your dedication and promptness in getting all of these things sorted out and fixed! it's much appreciated.

Batalex commented 1 year ago

I am not able to install on MacOS (M2)

> mamba create -n pynteny -c bioconda -c conda-forge python=3.10 pynteny

Looking for: ['python=3.10', 'pynteny']

bioconda/osx-arm64                                 129.0 B @ 296.0 B/s  0.4s
bioconda/noarch                                    @   2.3MB/s  2.0s
conda-forge/noarch                                 @   4.2MB/s  3.2s
conda-forge/osx-arm64                              @   1.7MB/s  3.5s
Could not solve for environment specs
Encountered problems while solving:
  - nothing provides requested pynteny

The environment can't be solved, aborting the operation

lwasser commented 1 year ago

i can confirm the same issue and response on my mac!

whats odd is i think the error is it can't find pynteny on bioconda :

PackagesNotFoundError: The following packages are not available from current channels:

  - pynteny

but it is there!

@Robaina do you need help with this? i think we can get you some help. can i invite you to our slack? we can talk there? if so please email me at leah @ pyopensci.org and i'll send you an invite so we can try to figure this out together.

Robaina commented 1 year ago

Sure thing @lwasser, will send you an email for that.

I think the issue might be related to the "new" ARM64 processor architecture in MAC. The machine in which @Batalex tried to install pynteny seems to have one (and all the new apple machines starting November 2020, I think).

Several dependencies of pynteny don't provide support for ARM64, e.g. hmmer, prodigal, pyfastx. Thus, pynteny itself doesn't have support for ARM64 architectures as of now. (To compare, here is another dependency psutil which does have ARM64 support on bioconda).

If that's the case, there is a workaround, basically telling conda to construct an environment using x86 architecture on ARM64 machines to install dependencies not yet supporting ARM64. Here is a blog post with instructions, for what I read, something like:

CONDA_SUBDIR=osx-64 conda create -n pynteny_x86 python=3.10
conda activate pynteny_x86
conda config --env --set subdir osx-64

and then, I guess:

(pynteny_x86) conda install -c bioconda pynteny

That's something new I learned today... I have no way to test if this workaround will be sufficient, though. Will send you the email now, anyway.

Thanks!

arianesasso commented 1 year ago

@Robaina Thank you for going the extra mile and making all those changes!

Batalex commented 1 year ago

The joys of novelty 😬

c-thoben commented 1 year ago

@Robaina I am able to install Pynteny with the README instructions on my Linux VM and also on MacOS without problems. As I have a 2018 MacBook, this may support the ARM64 processor theory. Thanks for looking into this issue @Robaina and thanks for your support @lwasser!

Robaina commented 1 year ago

Awesome! then perhaps I could add that workaround solution to the setup instructions for the people working on ARM64 machines.

lwasser commented 1 year ago

Ok i just installed in my M1 - the sequence below worked!

CONDA_SUBDIR=osx-64 conda create -n pynteny_x86 python=3.10
conda activate pynteny_x86
conda config --env --set subdir osx-64
conda install -c bioconda pynteny

it did take it a moment to import into python but it did import successfully NOTE: i talked with a conda forge maintainer about this - IF those three packages on bioconda could be also added to conda-forge then they can actually be added to a list where a bot tries to build the appropriate arch wheels (arm64 for osx in this case). of course there might be some reason that they aren't already there so this could be a non trivial task. but i think for our purposes if the package installs on everyone's computer who is here in this review, i think we'd done due diligence. and all that is needed is a documentation update for how to install.

Robaina commented 1 year ago

Excellent. Thanks for consulting this with conda-forge @lwasser . I have added these extra lines to README and also the documentation. This issue can always be updated in future releases.

Thanks so much everyone for the review!

pyOpenSci / software-submission

Pynteny: a Python package to perform synteny-aware, profile HMM-based searches in sequence databases #67

Description

Scope

Technical checks

Publication options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Code of conduct

Please fill out our survey

Editor and Review Templates

[ ] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Editor comments

Foreword

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments 🐈‍⬛

General comments:

CLI

Documentation

pynteny/init.py

pynteny/api.py

pynteny/cli.py

pynteny/filter.py

pynteny/hmm.py

pynteny/parser.py

pynteny/preprocessing.py

pynteny/subcommands.py

pynteny/utils.py

pynteny/wrappers.py

pynteny/app

Edits

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments