pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/

92 stars 36 forks source link

harmonize-wq #157

Open jbousquin opened 6 months ago

jbousquin commented 6 months ago

Submitting Author: Justin Bousquin (@jbousquin) All current maintainers: (@jbousquin) Package Name: harmonize-wq One-Line Description of Package: Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats Repository Link: https://github.com/USEPA/harmonize-wq Version submitted: 0.4.0 Editor: @Batalex Reviewer 1: @rcaneill
Reviewer 2: @jacqui-123
Archive: TBD JOSS DOI: TBD Version accepted: TBD Date accepted (month/day/year): 08/10/2024

Code of Conduct & Commitment to Maintain Package

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
[x] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

Include a brief paragraph describing what your package does: The US EPA's Water Quality Portal (WQP) is a data warehouse that facilitates access to data stored in large water quality databases in a common format. There are tools to facilitate both publishing data to and retrieving data from WQP, harmonize-wq is focused on retrieved data (1) cleaning to ensure it meets the required quality standards, and (2) wrangling to get it in a more analytic-ready format. Although there are many examples where this has been done, standardized tools to perform this task could make it less time-intensive, more standardized, and more reproducible.

Scope

Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [ ] Data visualization[^1]
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo

Community Partnerships

If your package is associated with an existing community please check below:

[ ] Pangeo
- [ ] My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?
  Water quality domain experts trying to synthesize available data in a stream, bay, estuary, etc.. More standardized data cleansing and wrangling allows outputs to be integrated into other tools in the water quality data pipeline, e.g., for integration into dashboards for visualization (Beck et al., 2021) or decision support tools (Booth et al., 2011).
- Are there other Python packages that accomplish the same thing? If so, how does yours differ? No python packages to my knowledge, there is in R: USEPA/TADA
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted: #132

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] uses an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a tutorial with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [x] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [x] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [x] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

[x] I have read the author guide.
[x] I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

[x] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

isabelizimm commented 6 months ago

Hello there @jbousquin, thank you for submitting this issue--welcome to the pyOpenSci community! Just wanted to let you know we've seen your issue. The next step is for us to run some initial checks, we will give that first feedback soon.

In the meantime, if you have any questions you can ask here or in our discourse.

isabelizimm commented 6 months ago

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements below.

[X] Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
- [X] The package imports properly into a standard Python environment import package.
[X] Fit The package meets criteria for fit and overlap.
[X] Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
- [X] User-facing documentation that overviews how to install and start using the package.
- [X] Short tutorials that help a user understand how to use the package and what it can do for them.
- [X] API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
[X] Core GitHub repository Files
- [X] README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
- [X] Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
- [x] Code of Conduct The package has a CODE_OF_CONDUCT.md file.
- [X] License The package has an OSI approved license. NOTE: We prefer that you have development instructions in your documentation too.
[X] Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
[X] Automated tests Package has a testing suite and is tested via a Continuous Integration service.
[X] Repository The repository link resolves correctly.
[X] Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
[x] Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
[X] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

[ ] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Editor comments

As a Floridian, I do appreciate your tutorial locations 🐊

A few quick fixes:

For the CODE_OF_CONDUCT file, it is optimal to have it at the root of the repository. Right now, it looks like yours is in docs/source/Code of Conduct.rst. I'd recommend moving that file, since that is the typical place people look for a CoC. Also, if it is in the root, it will show up as a "tab" next to your README, sort of how the MIT License is shown here 🎉

Second, pending some sort of tool that requires it, you shouldn't need a separate [metadata] section in your pyproject.toml.

In the meantime, I'll start hunting for an editor to facilitate a review for you!

jbousquin commented 6 months ago

Thanks @isabelizimm - made those suggested changes on pyOpenSci-review branch. Let me know if there is anything else while we wait.

isabelizimm commented 6 months ago

No other tasks yet! That should be good to start. I think I've got an editor just about figured out, I will let you know for sure mid-next week.

isabelizimm commented 5 months ago

Update: @Batalex will be the editor for harmonize-wq, guiding you through the review process. He will be the point of contact for things from here on out (although I am still happy to answer any questions if you need me!), and I've updated the Editor field in the initial comment on this issue.

Batalex commented 5 months ago

Hey @jbousquin, I am Alex, and I am delighted to be the editor for harmonize-wq!
During the coming week(s), I'll be looking into harmonize-wq's codebase and reaching out to potential reviewers. Meanwhile, feel free to address me any question you might have.

jbousquin commented 5 months ago

Thanks @Batalex. No questions so far, let me know if anything comes up.

Batalex commented 5 months ago

:wave: Hi @rcaneill and @jacqui-123! Thank you for volunteering to review for pyOpenSci!

Please don't hesitate to introduce yourselves. @jbousquin, I am pleased to announce that we found our A-team to proceed with the review.

Please fill out our pre-review survey

Before beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.

[x] @rcaneill survey completed.
[ ] @jacqui-123 survey completed.

The following resources will help you complete your review:

Here is the reviewers guide. This guide contains all the steps and information needed to complete your review.
Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due: April 8th

Reviewers: @rcaneill, @jacqui-123 Due date: 2024/04/08

rcaneill commented 5 months ago

@rcaneill survey completed.

I just filled the survey

rcaneill commented 5 months ago

Hi @jbousquin I am happy to review this package and will start soon :)

jbousquin commented 5 months ago

Thanks @rcaneill! Let me know as things come up :)

rcaneill commented 5 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work.

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[ ] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
- The package name is located after the badges, I guess that it is not an issue
[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [x] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[ ] Link to your documentation website.
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[ ] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[ ] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- branch new_release_0_4_0
- branch main at commit 81448a9
- [ ] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
  - is WIP https://github.com/USEPA/harmonize-wq/pull/58

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software
[x] Authors: A list of authors with their affiliations
[x] A statement of need clearly stating problems the software is designed to solve and its target audience.
[x] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 8-10

Review Comments

Missing some instructions for the devs: https://github.com/USEPA/harmonize-wq/issues/63
request: automatic versions locking: https://github.com/USEPA/harmonize-wq/issues/65
missing link to doc: https://github.com/USEPA/harmonize-wq/issues/68
missing citation: https://github.com/USEPA/harmonize-wq/issues/69
add badges: https://github.com/USEPA/harmonize-wq/pull/70
https://github.com/USEPA/harmonize-wq/pull/72
https://github.com/USEPA/harmonize-wq/issues/73

Batalex commented 4 months ago

Please find below a list of comments, with my own format (editor's privilege 🐈‍⬛ ) I tried to rank them so that you can prioritize your work. I'll complete this list as I revisit the package.

Praises

praise (general): The code and the docs are extra clean.
praise (general): Whenever I see pint, I'm happy!

Typos

typo (readme.md): l7 on package name
typo (readme.md, contributing.rst): double spaces

Nitpicks

nitpick (general): I recommend adding a new line at each full stop in a markdown or rst paragraph. This way, we keep the lines short in git (easier to spot diffs in PR, easier to pinpoint a line with an issue). No worries, a single new line is not rendered.
nitpick (domain.py): there is no need for a raw string for TADA_DATA_URL

Discussions

discussion (convert.py): About the TODO - both points of view (regrouping constants in a single place, or having them defined near their place of use to avoid jumping around the code base) are valid. I am usually in favor of the former.

Suggestions

suggestion (domain.py): In harmonize_TADA_dict, we could use a groupby operation to avoid looping through the dataframe using python. TOCHECK
suggestion (domain.py): We could replace the following pattern for x in list(set(pandas_series)) by using the .unique method
suggestion (domain.py, basic.py): out_col_lookup does not need to be a function. Same for all other functions returning a dict. If we make those simple module-level dicts, we can still list the sources in the module docstring.
suggestion (convert.py): We could add "references" sections in the docstrings so that the sources are present in the website and not only in the source code.
suggestion (basis.py, general): By using pandas' methods, we could streamline a little some operations. The choice is ultimately yours; I prefer using existing methods over rolling my own implementations, even if that means that other folks need to go to the documentation website to understand what is going on. For instance, here is my proposition for set_basis
```
def set_basis(df, mask, basis, basis_col):
return df.assign(**{basis_col: np.where(mask, basis, np.nan)})
```
I find this implementation easier to read (but I understand that this is debatable), but it is also more efficient. I have noticed that you use this pattern quite a few time throughout the code base, so I figured this might interests you.

Todos

todo (pyproject.toml): We should remove the metadata section.
todo (__init__.py): importlib.metadata was added in python 3.8, which is the minimal version supported by the package according to its pyproject.toml. The try .. except block should not be needed, even more so considering that importlib_metadata is not listed in the project requirements.
todo (basis.py): We could regroup the conditions branches in update_result_basis
todo (contributing.rst): To lower the cost of entry for potention contributors, let's make sure that we provide all the information they need. Consider adding a section describing how to setup their development environment (e.g. installing the test and docs dependencies).

Issues

issue (general): code quality (see below)
issue (domain.py): requests should be listed in the project's dependencies. The rationale is as follows: we should not import in our code any transitive dependency, because we have no guarantee that the primary dependency will not drop the former in a future update. As far as we know, dataretrieval could replace requests by httpx without notice in a patch release, which would break new harmonize-wqinstallations. The same can be said about pandas, though I agree it is unlikely that geopandas will change its backend dataframe lib.
issue (domain.py): We should specify what kind of exception we are expecting in re_case. Making a try except block too wide can lead to hard-to-debug issues.
issue (general): It seems that there are circular dependencies: harmonize -> visualize -> wrangle -> harmonize or clean -> wrangle -> clean as well. They do not raise an exception for now, but they will if any imported object is used at the module level. I strongly advise that we rework the project structure so that the files get imported in an acyclic fashion. It is also way easier to get familiar with the code base as a new contributor if the structure is predictable and linear.

General recommendations

Code quality is important in a public package. It is obvious that a great amount of care went in making harmonize_wq, but what I mean by code quality is having tools enforcing conventions across the code base. Such conventions usually cover code format, and catching simple anti patterns.

To do so, I would advise you to use both a linter and a formatter. I usually recommend:

black for formatting the code
ruff to validate that the code follows good practices, and do quick fixes.

This is up to debate of course, some people might prefer one tool over another, but the point is that a project using such tools:

is more welcoming to external contributors
needs less time dedicated to low-value maintainance.

If you are ok with everything I said so far, I'd be happy to propose a PR to help you setup everything.

jbousquin commented 4 months ago

I'll start addressing these on a pyOpenSciReview branch (I'll try to be better about merging to main so other reviewers aren't running into the same things). Will generate a issue task list w/ any that are more involved. Let me know if there is anything else that I should be doing for review/edit tracking.

Would love a PR for black & ruff setup - have been running a linter and code analysis locally and definitely see the value for contributors/maintenance. Only concern is being able to easily ignore certain conventions when appropriate.

jbousquin commented 4 months ago

@Batalex fixing issue (general): circular dependencies - will be a breaking change. To resolve I moved functions from harmonize, df_checks()/add_qa_flag() to clean, convert_unit_series() to convert and units_dimension() to wq_data (to become a method). These seemed as logical a place to find them as harmonize. Now importing specific functions from other modules where practical. This breaks docs - before addressing that I wanted to confirm this is what you had in mind?

Batalex commented 4 months ago

@jbousquin Based on a quick look through the PR, yes that's exactly what I had in mind

Jacqui-123 commented 4 months ago

Great package! I hope these comments are helpful. This was my first package review so please let me know if there is anything I missed or if I was misguided with any of my comments.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[ ] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for:
- [ ] Continuous integration and test coverage,
- [ ] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [x] Current package version (on PyPI / Conda).

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[ ] Link to your documentation website.
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[ ] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality (Skipped this)

[ ] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests:
- [ ] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [ ] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[ ] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [ ] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software
[x] Authors: A list of authors with their affiliations
[x] A statement of need clearly stating problems the software is designed to solve and its target audience.
[x] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

approximately 8

Review Comments

1) Harmonize_Pensacola.Rmd: -Small language changes suggested to make the installation process more user-friendly and clear: -make it clear when something is an option to run and when it's step-by-step instruction, as it switches back an forth in this demo. For example, could add "# Install the harmonize-wq package... [#option 1] package install... [#option 2] development version..." -Clearer separation of code chunks by task, so each code chunk focuses on a specific task. This makes debugging/error message interpretation easier. Ie a new code chunk after options(reticulate.conda_binary = "..."), new code chunks after conda_install() section (lines 72, 81). (For good examples see the .ipynb demo files for this package). -I think use_condaenv("wq_harmonize") should be use_condaenv("wq-reticulate") (line 90)

2) Comments for Harmonize_CapeCod_Simple.ipynb -easy to follow and clearly documented -attribute errors for harmonize_all(df, errors='ignore'): AttributeError: 'float' object has no attribute 'upper' (these attribute errors happened a few times in the other demos, too.)

3) usability: -"All functions have documentation and associated examples for use" -> I wasn't completely clear on exactly each function did, particularly some of the cleaning/tidying ones and how they changed the resulting dataframe. For example, what are all the flag options in the QA_flag column and what do each of them mean? The overall package was really clear though in terms of what it was doing and how, but some of the nuances were less clear to me.

4) I am curious to know if the package looks at or flags the different method detection limits (mdl) that different analytical laboratories often use, or if that is an issue with this dataset? I tend to run into this issue in my work but I don't typically work with EPA datasets.

Batalex commented 3 months ago

Hey @jbousquin, I just want to give you a brief update. @rcaneill privately reached out to me, and needs some more time to proceed with the review due to personal reasons. Meanwhile, you can proceed with the two reviews you have here so that we avoid staling this issue for @Jacqui-123. Does this arrangement work for you?

jbousquin commented 3 months ago

Hey @Batalex - that works for me. I've already been working through issues/suggestions as received/as I can.

rcaneill commented 3 months ago

Hi @Batalex and @jbousquin, I finished my review (cf https://github.com/pyOpenSci/software-submission/issues/157#issuecomment-2015045269) I have 0 knowledge about the water quality field, but I found the doc quite clear :)

Batalex commented 2 months ago

Hey @jbousquin, I noticed that this review has been quite stale lately, and so has harmonize_wq's codebase.

Would you mind giving us a rough rundown on how and when you plan to address the reviews? My goal here is to set the proper expectations for everyone and manage our reviewers' time effectively.

jbousquin commented 2 months ago

Hey @Batalex - yes a couple PRs in the pipeline I need to check tutorials on but had to back-burner with the holiday and field season coming up. Hoping to get those merged this week and that should resolve most of the major changes. I've been sitting on the ruff PR to see if I can work it out as a pre-commit, trying to avoid contributors having to have the dev depends where possible.

Batalex commented 1 month ago

Hello @jbousquin,

I sent you a reminder a month ago about the review going stale, and I have not seen any public activity on the repository ever since.

As I said before, the deadlines in our process are more like guidelines as to when we expect things to move forward. However, as the editor for this submission, I have a responsibility to the volunteers who gave their personal time to do an in-depth review of harmonize-wq, and even submitted PRs. It is okay to be late, but I expect you to be transparent and committed to moving forward with the review. Per our review policy, I am putting this submission on hold, and will close it one month from now on if I do not see any change. Thank you for your understanding.

jbousquin commented 1 month ago

Hello @Batalex,

Apologies - I was hoping to have gotten the tutorials checked against changes and summaries of changes/responses copied over here before getting buried in field work in June. My intent is not to be non-transparent, just hopeful I would have had a chance to do those small tasks by now. Should see some movement this week. Thank you for your understanding.

jbousquin commented 1 month ago

Hey @Jacque-123, Thanks again for your review. Several changes over on the package repo I wanted to draw your attention to/responses to comments:

_1. Harmonize_Pensacola.Rmd: -Small language changes suggested to make the installation process more user-friendly and clear: -make it clear when something is an option to run and when it's step-by-step instruction, as it switches back an forth in this demo. For example, could add "# Install the harmonize-wq package... [#option 1] package install... [#option 2] development version..." -Clearer separation of code chunks by task, so each code chunk focuses on a specific task. This makes debugging/error message interpretation easier. Ie a new code chunk after options(reticulate.conda_binary = "..."), new code chunks after conda_install() section (lines 72, 81). (For good examples see the .ipynb demo files for this package). -I think use_condaenv("wq_harmonize") should be usecondaenv("wq-reticulate") (line 90)

Two PRs (67 & 78 from branch 62) were used to make these suggested updates to the example for setting up and running the python package in R. The second PR focused on CI/CD tests via git actions that will render the rmd to help ensure there are not errors. One of the runners generates an artifact for easier inspection (e.g., https://github.com/USEPA/harmonize-wq/actions/runs/9811884248/artifacts/1672085755). Hopefully that will help make it easier to identify any further text edit suggestions you have.

_2. Harmonize_CapeCod_Simple.ipynb -attribute errors for harmonizeall(df, errors='ignore'): AttributeError: 'float' object has no attribute 'upper' (these attribute errors happened a few times in the other demos, too.)

Generated an issue for this, hard to reproduce but I have a feeling it has to do with dependency management and how you installed the package. I'm hopeful changes to the pyproject file will fix it (https://github.com/USEPA/harmonize-wq/commit/b125f65631d2395dcf3c15a3b3444afdaafd7389), but if not we can try to dig into this error a bit more.

_3 usability: -"All functions have documentation and associated examples for use" -> I wasn't completely clear on exactly each function did, particularly some of the cleaning/tidying ones and how they changed the resulting dataframe. For example, what are all the flag options in the QAflag column and what do each of them mean? The overall package was really clear though in terms of what it was doing and how, but some of the nuances were less clear to me.

I suspect the issue here is having something that goes deeper than the function documentation, which is what the tutorials are meant to do. There are a lot of functions (50+), each should currently be documented in numpy style (with input/return parameter types/descriptions and examples). clean.add_qa_flag() is meant to be used by higher level functions to add QA_flags as the data are cleaned and harmonized, i.e., to make changes/assumptions that might have quality issues more transparent to the user and allow them to filter/remove on them if it doesn't meet their QA standards. Those higher-level functions should document the specific flag string used, e.g., basis.basis_from_unit() provides an example where speciation was updated by conflicting meta-data. However, the add_qa_flag() function is exposed to the user because we can't anticipate all data they may want to flag. The example shows a custom mask and flag text (it's a bit of a spam and eggs type example, simplified to show how it works) whereas examples in the tutorials are more 'real-world'. In e.g., Harmonize_Pensacola_Detailed.ipynb, code-block 11 we show how the docstring for harmonize_locations can be displayed, and that references how it implements 'QA_flag' to identify 'any row that has location based problems like limited decimal precision or an unknown input CRS'. In code-block 15 of the same demo we examine what QA_flags were assigned. In code-block 27-29 we look to this flag to help explain why ResultMeasure/MeasureUnitCode is NaN. There are several additional examples of this in that notebook and all of the detailed notebooks should follow a similar structure. Please let me know if any of the functions are missing documentation or examples in the docs, if you have suggestions for improving any of those descriptions/examples, if you have suggestions for improving the detailed tutorials to make the use of QA_flags clearer, etc.

4. I am curious to know if the package looks at or flags the different method detection limits (mdl) that different analytical laboratories often use, or if that is an issue with this dataset? I tend to run into this issue in my work but I don't typically work with EPA datasets.

This is 100% the direction of some future feature adds. Specifically, 17 plans to address detection limits. It is a multi-part problem though. The existing function will pull in detection limits from that specific meta-data table, but then it needs to be compared against results to determine if the result value was under it and a QA_flag needs to be assigned. If the result value is under the limit there are several alternatives to estimate values statistically (user would have actively choose to alter results in this way but we could port the functionality from USEPA/EPATADA). However, as you've also identified the data-provider may have specified a method with a standard MDL, in which case the detection limit might not be in the meta-data table and might have to be inferred from those methods. Methods filtering 37 is the first step for that, where we start to develop a table/dict of standard methods and try to recognize them (a lot of differences in how they are entered). MDL could be associated with each as a col in that table/lookup or in a related table/lookup.

jbousquin commented 1 month ago

@Batalex - weird I commented your responses a couple weeks ago, but just came back to make sure I hadn't missed anything from you and don't see that comment here... I'll try to re-create, mainly just copying over month old status from the repo (there is also follow-up on your draft PR that I'd written after as follow-up in case you didn't see it here)

jbousquin commented 1 month ago

@Batalex If you would like additional links/line numbers just let me know:

Typos should be resolved as suggested

Nitpicks nitpick (general) should be resolved as suggested

_nitpick (domain.py): there is no need for a raw string for TADA_DATAURL This url is only used once at the moment, but is currently a raw string (1) to allow it to be easily integrated into feature adds (i.e., intend to use it more places, especially w/ WQX 2->3), and (2) for easier maintenance given the repo is still underdevelopment (e.g., like when the url recently changed).

Discussions Kept it in convert module because fewer module references made ensuring no circular references easier. Already importing registry_adds_list from domains so there isn't a strong reason not to move it there if the need arises in the future.

Suggestions _suggestion (domain.py): In harmonize_TADAdict, we could use a groupby operation to avoid looping through the dataframe using python. TOCHECK should be resolved as suggested, was there more to the TOCHECK?

_suggestion (domain.py): We could replace the following pattern for x in list(set(pandasseries)) by using the .unique method should be resolved as suggested

_suggestion (domain.py, basic.py): out_collookup does not need to be a function. Same for all other functions returning a dict. If we make those simple module-level dicts, we can still list the sources in the module docstring. These have been updated to be module-level dicts, but I'm not sure on how you are proposing the docstrings could be included. Hate to lose all the examples etc. on these, have you seen this in documentation for other projects you could point me to?

suggestion (convert.py): We could add "references" sections in the docstrings so that the sources are present in the website and not only in the source code. When a conversion function has equation or methods references the documentation has a reference section for that (e.g., conductivity_to_PSU). However, if the information is for code/checks then it goes in as a comment in the code (e.g., the url in DO_concentration get to a converter written in JS). In those cases is it adequate/suggested to add contextual comments, e.g., # To check compare against:

suggestion (basis.py, general): By using pandas' methods, we could streamline a little some operations. The choice is ultimately yours; I prefer using existing methods over rolling my own implementations, even if that means that other folks need to go to the documentation website to understand what is going on.

I agree on using existing methods, I really tried to implement this suggestion but ran into issues. In the provided example if there are existing values in columns those need to be preserved. That can be done with an if/else. Additionally, numpy.where will coerce the other values (y) to the dtype which is problematic for nan. Do-able, but more complex than the current solution.

Todos pyproject.toml & init should be resolved as suggested _basis.py: regroup conditions in update_resultbasis Admittedly these additional basis columns haven't received much attention yet (not frequently leveraged by those entering data), and it was coded this way to make it easy to come back to and write additional specific handling. For now we combined weight/time, left particuleSize as is with added notes specific to it's handling.

contributing.rst Added dev section

Issues domain.py: dependencies Added the suggested dependencies (stop short of pandas but did include numpy). pyproj.toml should populate depends from requirements now - decreasing maintenance/risk of differences. _domain.py: specify exception expected by recase Resolved as suggested Circular dependencies should be resolved as suggested

General recommendations

To summarize, working on implementing black. All the code changes are sitting on the pyOpenSci-review branch. It runs locally as suggested in your PR. I'm trying to get my head around pre-commits so that contributors will have style/format checks without having to run it locally.

jbousquin commented 1 month ago

@rcaneill - Really appreciate your doing issues/PRs over on the repo (saves steps!). I think we resolved everything over there (leaving the citation issue open so it gets resolved after), but let me know if I missed anything from your review here.

Batalex commented 3 weeks ago

@jbousquin, here is some quick feedback.

nitpick (domain.py): there is no need for a raw string for TADA_DATA_URL This url is only used once at the moment, but is currently a raw string (1) to allow it to be easily integrated into feature adds (i.e., intend to use it more places, especially w/ WQX 2->3), and (2) for easier maintenance given the repo is still underdevelopment (e.g., like when the url recently changed).

I am not sure how using a raw string is relevant to the reasons you mentioned. Maybe we are not talking about the same thing: I am speaking about the r prefix in r"http://url.com". Raw strings are usually used in regular expressions.

suggestion (domain.py, basic.py): out_col_lookup does not need to be a function. Same for all other functions returning a dict. If we make those simple module-level dicts, we can still list the sources in the module docstring. These have been updated to be module-level dicts, but I'm not sure on how you are proposing the docstrings could be included. Hate to lose all the examples etc. on these, have you seen this in documentation for other projects you could point me to

The idea would be to add the sources and any relevant information in the module docstring:

constants.py

"""
Constants submodule.

References
-----------

Plank:
The NIST Reference on Constants, Units, and Uncertainty. [NIST](https://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology). 20 May 2019.
"""

plank = 6.62607015e-34

Then you can access the source using help on the submodule, just like you would on a function. python -c "import constant;help(constant)"

Help on module constant:

NAME
    constant - Constants submodule.

DESCRIPTION

    References
    -----------

    Plank:
    The NIST Reference on Constants, Units, and Uncertainty. NIST. 20 May 2019.

DATA
    plank = 6.62607015e-34

As for the rest of my original points, I am okay with the changes / reasons not to change. Nice job!

Batalex commented 3 weeks ago

@Jacqui-123, @rcaneill Were your concerns addressed?

jbousquin commented 3 weeks ago

@Batalex RE quick feedback:

Ah! You really did mean it being raw string not it being a constant, resolved on branch (passing, will merge with the linting).

docstrings for dict constants - what I was stuck on was what to document it as if module level (''Attributes'' for sphinx). I'm not sure how to do the child level of an attribute, e.g., Examples, but I'll play around with it. docstring at the variable I wasn't sure how to associate it (still not sure of that, but looking at the sphinx doc helped me understand it needed to be after), documented that way the child level works, but I see where it doesn't seem to be part of the module level help, and I'm not sure how you would get help to retrieve the variable level doc-string (will look into that if module level doesn't work out).

Jacqui-123 commented 3 weeks ago

@jbousquin Thanks so much for the detailed response to my review/comments. The changes look great, and I appreciate your explanations. @Batalex I don't have anything further to add but let me know if you need anything else.

Batalex commented 3 weeks ago

@jbousquin Thanks so much for the detailed response to my review/comments. The changes look great, and I appreciate your explanations. @Batalex I don't have anything further to add but let me know if you need anything else.

Perfect, I just need you to check the approval box in your review above. Thank you so much for contributing to this review!

jbousquin commented 3 weeks ago

@Batalex RE:RE quick feedback: module level doc-strings are passing for both help() and docs.

pre-commits are very close to working, just need ruff to see settings in pyproject.toml like it does when local. Tried a few things based on pre-commit issues but haven't solved it yet. Close to just writing them out in the config - but reluctant since that duplicates what is in the toml (more maintenance making sure they always match)

rcaneill commented 3 weeks ago

@Batalex I am happy with the changes made / the answers when the authors disagreed with me

jbousquin commented 3 weeks ago

@Batalex - resolved ruff checks with pre-commits on PR 89, please let me know if there is anything unresolved from your review. Really happy getting lint/formatting as part of this workflow and thank you as the edits to the pyproject.toml in your draft PR helped immensely!

Batalex commented 1 week ago

🎉 harmonize-wq has been approved by pyOpenSci! Thank you @jbousquin for submitting harmonize-wq and many thanks to @rcaneill and @Jacqui-123 for reviewing this package! 😸