`astartes` - Githubissues

pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/

91 stars 36 forks source link

`astartes` #120

Closed JacksonBurns closed 8 months ago

JacksonBurns commented 1 year ago

Submitting Author: Name (@JacksonBurns) All current maintainers: (@kspieks, @himaghna) Package Name: astartes One-Line Description of Package: Better Data Splits for Machine Learning Repository Link: https://github.com/JacksonBurns/astartes Version submitted: v1.1.2 Editor: @cmarmo
Reviewer 1: @BerylKanali Reviewer 2: @du-phan Archive: Version accepted: v1.1.3 JOSS DOI: Date accepted (month/day/year): 10/15/2023

Code of Conduct & Commitment to Maintain Package

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
[x] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

Include a brief paragraph describing what your package does:

note: this is a selection from the abstract of the JOSS paper

Machine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets that are used to develop and evaluate models. Common practice in the literature is to assign these subsets randomly. Although this approach is fast and efficient, it only measures a model's capacity to interpolate. Testing errors from random splits may be overly optimistic if given new data that is dissimilar to the scope of the training set; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many similarity- and distance-based algorithms to partition data into more challenging splits. Separate from astartes, users can then use these splits to better assess out-of-sample performance with any ML model of choice.

Scope

Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [ ] Data visualization[^1]
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo

Community Partnerships

If your package is associated with an existing community please check below:

[ ] Pangeo
- [ ] My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?

The target audience is data scientists, machine learning scientists, and domain scientists using machine learning. The applications of astartes include rigorous ML model validation, automated featurization of chemical data (with flexibility to add others, and instructions for doing so), and reproducibility.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

We position astartes as a replacement to scikit-learn's provides train_test_split function, but with greater flexibility for sampling algorithms, and availability of train_val_test_split for more rigorous validation.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

N/A

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] uses an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a tutorial with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [x] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [x] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [x] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. __on a separate `joss-paper` branch__ - [x] The package is deposited in a long-term repository with the DOI: [10.5281/zenodo.7884532](https://zenodo.org/record/7884532) *Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

[x] I have read the author guide.
[x] I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

[x] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

cmarmo commented 1 year ago

Hi @JacksonBurns, Thanks for submitting astartes to pyOpenSci and welcome to our submission procedure! I'm Chiara and I will take care of the process as Editor.

After checking with the editorial board, I'm glad to confirm that your package is considered in scope with a pyOpenSci submission. I will perform the editor checks by the end of the week, and start looking for reviewers.

In the meanwhile I have a first question for you, if you don't mind. While the statistical side of the package is clear and answers to a well known need of the scikit-learn user community, I would like to read a bit more about the package connections with chemical engineering. I guess your scientific application was the motivation for this development in the first place and in the documentation the two aspects look quite unrelated. Perhaps a small pedagogical section about chemical applications in the documentation will also be helpful? Thanks!

JacksonBurns commented 1 year ago

@cmarmo thanks for the quick response and intro!

Thank you for sharing the update about scope and reviewers.

We will add a section to the documentation about applications to chemical engineering problems in particular - thank you for this suggestion.

cmarmo commented 1 year ago

Hello @JacksonBurns,

Below are the basic checks that your package needs to pass to begin our review.

Please check our Python packaging guide for more information on the elements below.

[x] Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
- [x] The package imports properly into a standard Python environment import package-name.
[x] Fit The package meets criteria for fit and overlap.
[x] Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
- [x] User-facing documentation that overviews how to install and start using the package.
- [x] Short tutorials that help a user understand how to use the package and what it can do for them.
- [x] API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
[x] Core GitHub repository Files
- [x] README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
- [x] Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
- [x] Code of Conduct The package has a Code of Conduct file.
- [x] License The package has an OSI approved license. NOTE: We prefer that you have development instructions in your documentation too.
[x] Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
[x] Automated tests Package has a testing suite and is tested via a Continuous Integration service.
[x] Repository The repository link resolves correctly.
[x] Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
[x] Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

[x] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Editor comments

The package is already in excellent shape, congratulations and thanks to all the co-authors for your hard work! I have some comments, they are not meant to be blockers for the review process, just suggestions for possible improvements.

It seems to me that the introduction about splitting algorithms is missing some references for those, like me, not really into chemical engineering. In particular I looked for the original work defining the Kennard-Stone algorithm (Kennard, R.W., and Stone, L.A., 1969. Computer aided design of experiments. Technometrics 11, 137-148) and I would like to see it cited in your documentation. The other algorithms also would benefit from some more references, when possible openly accessible, as unfortunately the links to most of the articles are pay-walled.
Some external links are not correctly rendered ... I'm afraid I couldn't find a solution for backquoted text with external link in sphinx... but I think having the link rendered is more readable. (see for example
``AIMSim` <https://vlachosgroup.github.io/AIMSim/README.html>`_ can be replaced by `AIMSim <https://vlachosgroup.github.io/AIMSim/README.html>`_)
Having jupyter notebooks in the documentation helps the user to understand how to use the package: however, I wasn't able to run the binder links (because of the initialization failing with the error "no space left on device"). I know Binder is downsizing, if you don't have alternatives perhaps it would be safer to just remove the links, in order to not unnecessarily pressure the system.
The CONTRIBUTING.md file is clear and complete. As users are explicitly asked to provide the package version when opening an issue, I'm suggesting to define an astartes.__version__ attribute and to detaile how to obtain the information when debugging.
The API section of the documentation would benefit from more examples: the user discovers in it that the package contains a lot more than the train-test-split functions, but it is not always clear how to exploit all this tools.

I think this package may be relevant for the scikit-learn user community, perhaps you might be interested in linking astartes in the related project page among the preprocessor tools, once the review process has come to an end.

I'm going to look for the two reviewers now (good news, I already found one! :) ). I will be back to you in a week (hopefully sooner), to check if we are able to start the process.

JacksonBurns commented 1 year ago

Hi @cmarmo thanks for the speedy work!

[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

I have just turned on the Zenodo automatic packaging releases (which I also just learned was a thing) so hopefully this doesn't fall behind again! I have also uploaded the current version (1.1.1) to Zenodo manually.

The package is already in excellent shape, congratulations and thanks to all the co-authors for your hard work! I have some comments, they are not meant to be blockers for the review process, just suggestions for possible improvements.

Thank you for the kind words!

It seems to me that the introduction about splitting algorithms is missing some references for those, like me, not really into chemical engineering. In particular I looked for the original work defining the Kennard-Stone algorithm (Kennard, R.W., and Stone, L.A., 1969. Computer aided design of experiments. Technometrics 11, 137-148) and I would like to see it cited in your documentation. The other algorithms also would benefit from some more references, when possible openly accessible, as unfortunately the links to most of the articles are pay-walled.

The reference for the Kennard-Stone algorithm is in the Implemented Sampling Algorithms table but it is not at all obvious what the links in the References column actually are. We will clarify this, as well as add references where missing.

As far as accessible articles, we can also try and find open-access versions of those which are pay-walled.

Some external links are not correctly rendered ... I'm afraid I couldn't find a solution for backquoted text with external link in sphinx... but I think having the link rendered is more readable. (see for example ``AIMSim` <https://vlachosgroup.github.io/AIMSim/README.html>`_ can be replaced by `AIMSim <https://vlachosgroup.github.io/AIMSim/README.html>`_)

:+1: will do.

Having jupyter notebooks in the documentation helps the user to understand how to use the package: however, I wasn't able to run the binder links (because of the initialization failing with the error "no space left on device"). I know Binder is downsizing, if you don't have alternatives perhaps it would be safer to just remove the links, in order to not unnecessarily pressure the system.

We can definitely do that. We will perhaps try to migrate these to target Google Colab.

The CONTRIBUTING.md file is clear and complete. As users are explicitly asked to provide the package version when opening an issue, I'm suggesting to define an astartes.__version__ attribute and to detaile how to obtain the information when debugging.

:+1: sounds good!

The API section of the documentation would benefit from more examples: the user discovers in it that the package contains a lot more than the train-test-split functions, but it is not always clear how to exploit all this tools.

:+1: will do.

I think this package may be relevant for the scikit-learn user community, perhaps you might be interested in linking astartes in the related project page among the preprocessor tools, once the review process has come to an end.

Didn't know this existed, thanks for sharing! We will do that.

I'm going to look for the two reviewers now (good news, I already found one! :) ). I will be back to you in a week (hopefully sooner), to check if we are able to start the process.

:tada:

JacksonBurns commented 1 year ago

@cmarmo quick question - we are getting these suggested changes made now (https://github.com/JacksonBurns/astartes/pull/147). Since we submitted astartes v1.1.1, can we go ahead and merge into main, or will that interfere with the review process?

cmarmo commented 1 year ago

Since we submitted astartes v1.1.1, can we go ahead and merge into main, or will that interfere with the review process?

Would you be ok with making a quick release of the changes before the review starts and update the description? As far as the version does not change during the review process I think this is not an issue. Thanks for addressing the comments already!

JacksonBurns commented 1 year ago

That sounds - we will get the changes in and make a release for the reviewers (and update the description) asap. Thanks!

JacksonBurns commented 1 year ago

Hi @cmarmo we have completed the edits mentioned above and pushed a new version (v1.1.2) and updated the description for this PR. Thanks again!

cmarmo commented 1 year ago

Hi @JacksonBurns , thank you for your new release: I've still spotted some small issues in the documentation, but I think it's better to move forward with the review. I'm stll looking for the second reviewer. I hope to find someone by the end of the week. Thank you for your patience.

cmarmo commented 1 year ago

Hello @JacksonBurns , I'm finally back! Thank you for your patience! Sorry it took more time than expected.

@BerylKanali, @du-phan welcome! :wave: Thank you for volunteering to review for pyOpenSci!

Please take some time to introduce yourself here before starting the review.

Please fill out our pre-review survey

Before beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.

[x] Beryl Kanali survey completed.
[x] Du Phan survey completed.

The following resources will help you complete your review:

Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Note that we ask reviewers to complete reviews in three weeks, the review for astartes is then due for August 22nd, please let me know if you have any major issue preventing to meet this deadline. Please get in touch with any questions or concerns!

Thanks again for your commitment!

du-phan commented 12 months ago

Hi team, Happy to help with this review! I'm a research scientist at Dataiku, a software company developing a data science platform. We are part of the scikit-learn consortium and have collaborated with them for quite some time. We also developed and maintained our own os packages for some specific ML use cases, for example Cardinal for active learning.

BerylKanali commented 12 months ago

Hi everyone,

My name is Beryl Kanali, I am a Data Scientist and Open Source Advocate. I am also currently a graduate student. I like contributing to open source and I am happy to be a reviewer for astartes.

@cmarmo I have completed the pre-review survey.

JacksonBurns commented 12 months ago

Thanks @du-phan and @BerylKanali for volunteering your time for this review, we appreciate it! Continued thanks to @cmarmo as well for orchestrating the review process.

We look forward to seeing your reviews.

p.s. @du-phan I hear about Dataiku all the time - you are one of the key sponsors to my hometown National Public Radio Station WVXU!

BerylKanali commented 11 months ago

Hi team, I am almost done with the review process, I will be done by end of tomorrow which is the deadline .

du-phan commented 11 months ago

Hi Jackson,

Sorry that my review takes a while, august is always a slow month.

Thank you for your contribution! It is a nice package.

Package Review

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience in README:
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[ ] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [ ] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 4-5

Review Comments:

Overview

The astartes package tackles an important step in the Machine Learning pipeline that is usually overlooked: generating robust data for training and validating the model performance. It provides several strategies to go beyond the classical random sampling and thus allows data practitioner to evaluate their model under different scenarios of interpolation/extrapolation.

Usage

The two main functions are train_test_split() and train_val_test_split(), which are intuitive enough for users as they just need to swap their sklean's function by those. That's great!
However, the devil is in the details, depending on the kind of sampler used (extrapolative vs interpolative), the number of returned arrays is not the same (as there can be cluster ids that are returned too). As this point is not clear in the doc, users can get confused (I was).
Some method does not support categorical features, which is understandable when understanding how the sampler works, but maybe we can add those details somewhere in the doc.
Being able to use the samplers directly is a nice addition!
The generate_regression_results_dict is a helpful util function, but it looks a bit "experimental" to me as 1) it just supports the regression model and 2) users cannot choose the desired metrics. It feels like with a bit of work we can have something more generic.

Documentation

The statement of needs is lacking in the README, maybe you can add the one in the paper to it?
In the doc, the difference between extrapolative and interpolative sampling is discussed quite briefly. It is by working through the split_comparison notebook that I understood the "weird" behaviors of train_test_split() that I observed above. Maybe we can make this information easier for users to find out.

JacksonBurns commented 11 months ago

Hi @du-phan thank you for the review!

However, the devil is in the details, depending on the kind of sampler used (extrapolative vs interpolative), the number of returned arrays is not the same (as there can be cluster ids that are returned too). As this point is not clear in the doc, users can get confused (I was).

We have clarified this in the README, and in doing so referenced the demonstration notebook which you mentioned was useful as an explainer.

Some method does not support categorical features, which is understandable when understanding how the sampler works, but maybe we can add those details somewhere in the doc.

:+1: added details to the docs!

The generate_regression_results_dict is a helpful util function, but it looks a bit "experimental" to me as 1) it just supports the regression model and 2) users cannot choose the desired metrics. It feels like with a bit of work we can have something more generic.

We have added the ability to pass custom metrics to the function, and have ideas about adding a more generic generate_results_dict function which would be wrapped by generate_regression_results_dict and generate_classification_results_dict - although we tend to focus on regression.

The statement of needs is lacking in the README, maybe you can add the one in the paper to it?

:+1: added!

The changes I have described here are in this draft pull request on astartes if you would like to take a look to see if they satisfy your comments here. After the review from @BerylKanali we will convert this into a normal Pull Request to finish the review process.

Thanks again!

BerylKanali commented 11 months ago

Hi Jackson,

Thank you for your patience, I am now settled and here is my review.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[ ] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [ ] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[ ] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[X] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 4hours 40 minutes

Review Comments

As a person who has used sklearn so many times, I am impressed by the astartes package. Being able quickly analyze the impact of sampling algorithms and to evaluate a model even futher under different interpoolation/extrapolation cases is an important step that will help in machine learning.

Documentation

The statement of need is missing. A user should be able to know why they need the package. However, I have gone through the draft pull request as well and I believe the need has been adrressed and I can see two clear needs which is good.
I also think it is better to have a clear target audience, with that statement of need we should be able to answer if the package is useful in machine learning generally or only those applying it in chemical data work. This is mostly helpful to beginners when they are trying to know if the package is useful to them or not. That is not clear yet
Any plans on adding vignettes?
Suggestion: In future you can add small sections of installation in windows, MacOS and linux like what you have done with pip and conda in the documentation, it has proven to help especially in promoting inclusion of underrepresented persons whose first language is not english, It goes directly to what they might need reducing what they might have to read through.

Functionality The package installation was easy and as per the instructions. The functional and performance claims are satisfied.

Usability The functions are easy to use and remember especially if you have used sklearn before. It is also great that

Do all methods support categorical data? This has been addressed in the draft PR and it is easier now to know which methods support it and how to use them.
Are there any suggestions or ongoing work to improve generate_regression_results_dict to make it more generic. Also is there adequate information on what it can presently do?
From the draft PR I can see the issue on number of returned arrays as mentioned by @du-phan.

cmarmo commented 11 months ago

Thank you Du and Beryl for your work! I have changed the status of the issue as "awaiting changes". @JacksonBurns do you think a three weeks deadline will allow you to address the comments?

JacksonBurns commented 11 months ago

@BerylKanali thank you for the review!

@cmarmo yes, three weeks is sufficient. I plan to have the comments addressed sooner, even.

ucodery commented 10 months ago

Hello @JacksonBurns, I have been helping out the reviewers and in doing so have looked over some of astartes and hope you will allow me to submit some additional extra comments even as someone who has not before used sklearn.

the version metadata can be dynamically determined from a code variable, meaning that the version in the REPL and in package METADATA can never be out of sync (instructions here)
more and more open source projects are providing a pronunciation guide at the top of their readme. IMO this would help as I don't know where the syllables break up.
there are several places a subclass overrides a method to then only call super, adding a call frame for no real change in logic.
almost all of your __init__.pys hoist their submodules' public interfaces. Although this does shorten the dotted namespaces a user must use, it also will force every module to be found and loaded into the Python interpreter before any import from your package can be used. Yours is not a very large package but this will slow down all of its imports and increase its memory footprint, and this for a package that, it does not seem to me, a user would ever need to use all available interpolation/extrapolation variants.

JacksonBurns commented 10 months ago

Review Comments

As a person who has used sklearn so many times, I am impressed by the astartes package. Being able quickly analyze the impact of sampling algorithms and to evaluate a model even futher under different interpoolation/extrapolation cases is an important step that will help in machine learning.

Thanks for the kind words!

Documentation

The statement of need is missing. A user should be able to know why they need the package. However, I have gone through the draft pull request as well and I believe the need has been adrressed and I can see two clear needs which is good.

:+1:

I also think it is better to have a clear target audience, with that statement of need we should be able to answer if the package is useful in machine learning generally or only those applying it in chemical data work. This is mostly helpful to beginners when they are trying to know if the package is useful to them or not. That is not clear yet

Added!

Any plans on adding vignettes?

I've expanded our Quick Start example to be copy-pastable - I think as far as other examples, we have some great demo notebooks are are now trying to push those in multiple places in the readme.

Suggestion: In future you can add small sections of installation in windows, MacOS and linux like what you have done with pip and conda in the documentation, it has proven to help especially in promoting inclusion of underrepresented persons whose first language is not english, It goes directly to what they might need reducing what they might have to read through.

:+1: we moved the platform installation tip above the conda/pip/source installation section. The only difference between platforms is the inclusion of quotes around one argument.

Functionality The package installation was easy and as per the instructions. The functional and performance claims are satisfied.

Usability The functions are easy to use and remember especially if you have used sklearn before. It is also great that

Do all methods support categorical data? This has been addressed in the draft PR and it is easier now to know which methods support it and how to use them.

Are there any suggestions or ongoing work to improve generate_regression_results_dict to make it more generic. Also is there adequate information on what it can presently do?

One plan we have internally is to make another generate_classification_results_dict, which would involve a small internal refactor to create a more general generate_results_dict. We are more familiar with regression, so we are reading on what metrics would be good defaults and then we will consider where to go.

For this submission, though, I have added some additional details clarifying what the function actually returns.

From the draft PR I can see the issue on number of returned arrays as mentioned by @du-phan.

:+1:

Thank you for the review! All of the changes I mentioned making above are on this draft PR (the same you already found). Please let me know how they look.

JacksonBurns commented 10 months ago

Hello @JacksonBurns, I have been helping out the reviewers and in doing so have looked over some of astartes and hope you will allow me to submit some additional extra comments even as someone who has not before used sklearn.

The additional feedback is appreciated!

the version metadata can be dynamically determined from a code variable, meaning that the version in the REPL and in package METADATA can never be out of sync (instructions here)

Thanks for the link - just done now.

more and more open source projects are providing a pronunciation guide at the top of their readme. IMO this would help as I don't know where the syllables break up.

:+1: this has come up a lot, should have added sooner!

there are several places a subclass overrides a method to then only call super, adding a call frame for no real change in logic.

Thanks for catching this, just removed them.

almost all of your __init__.pys hoist their submodules' public interfaces. Although this does shorten the dotted namespaces a user must use, it also will force every module to be found and loaded into the Python interpreter before any import from your package can be used. Yours is not a very large package but this will slow down all of its imports and increase its memory footprint, and this for a package that, it does not seem to me, a user would ever need to use all available interpolation/extrapolation variants.

I didn't know that incurred a cost like that, thanks for explaining. Since we are using a SamplerFactory (in astartes/utils/sampler_factory.py) which imports all of the samplers, would we still get the performance benefits of reworking the import statements? Is the extra cost only caused when the imports are in the __init__.pys?

JacksonBurns commented 10 months ago

@cmarmo less a couple small clarifications, I think we have responded to all of the reviewer comments. Thanks again to all who have helped out!

ucodery commented 10 months ago

I didn't know that incurred a cost like that, thanks for explaining. Since we are using a SamplerFactory (in astartes/utils/sampler_factory.py) which imports all of the samplers, would we still get the performance benefits of reworking the import statements? Is the extra cost only caused when the imports are in the init.pys?

Hmm, I see. Yes, since you import all samplers in that util, and that util is used by main, it will slow down most public imports of this package. It might be best to leave it as is. There is a workaround, where the factory does only imports needed within the get_samplet method, rather than taking that hit up front for all possibilities when the required samplers aren't yet known. But that might be too confusing to maintain. Plus doing an import during instance creation will slow it down a lot (still faster overall execution time, but the specific factory calls could be noticed to be slow).

cmarmo commented 10 months ago

@JacksonBurns thank you for your work and the answers provided to the reviewers. As we are still in the three weeks window for answering the reviews, I was wondering if it would be too much work for you to split the draft PR in multiple pull requests addressing specific items in the review. The comments for each commit are indeed clear, but there are 19 modified files and 13 commits not always separable. Having smaller and focused pull requests will simplify the work of the reviewers looking specifically to the modifications addressing their comments. Do you think this is doable in a week? Thank you very much for your time!

JacksonBurns commented 10 months ago

Hmm, I see. Yes, since you import all samplers in that util, and that util is used by main, it will slow down most public imports of this package. It might be best to leave it as is. There is a workaround, where the factory does only imports needed within the get_samplet method, rather than taking that hit up front for all possibilities when the required samplers aren't yet known. But that might be too confusing to maintain. Plus doing an import during instance creation will slow it down a lot (still faster overall execution time, but the specific factory calls could be noticed to be slow).

So the alternative would be something like:

class SamplerFactory
   def get_sampler_1():
      from sampler_source import sampler_1
     return sampler_1()

I think my auto-formatters would be mad at that - I wonder why the default suggestion is to always have imports at the top? Agreed that regardless, this would have some maintenance and runtime costs. I've learned something, though!

JacksonBurns commented 10 months ago

Having smaller and focused pull requests will simplify the work of the reviewers looking specifically to the modifications addressing their comments. Do you think this is doable in a week? Thank you very much for your time!

Happy to do that - will go about it now.

JacksonBurns commented 10 months ago

@cmarmo, I have split the review responses into two small, easier to review pull requests. All updates to the documentation are in https://github.com/JacksonBurns/astartes/pull/154 (statement of need, example updates, etc.) and fixes to the source code (single-source package version, remove pointless statements, etc.) are in https://github.com/JacksonBurns/astartes/pull/155. I hope this is better - I can subdivide further if it would be helpful.

cmarmo commented 10 months ago

Thank you @JacksonBurns , this is already more reviewer friendly... :) Let's see if @du-phan and @BerylKanali are happy with the changes then. Dear reviewers, do you mind checking the PRs and let us know in the next two weeks if they address your comments? Thank you!

cmarmo commented 10 months ago

Hello @du-phan and @BerylKanali , do you mind letting us know if Jackson addressed all your concerns in https://github.com/JacksonBurns/astartes/pull/154 and https://github.com/JacksonBurns/astartes/pull/155? Thank you so much for yur collaboration! :pray:

BerylKanali commented 9 months ago

Hi @cmarmo and @JacksonBurns thanks for the reminder. I have looked at both PR's. Documentation Good job on addressing the concerns and naming the commits in a way that is easy to identify what was being worked on. I have also gone through the README and the documentation page. I believe all my concerns have been addressed.

Code I have gone through the commits and the issues they address and I believe you have addressed and explained everything well. I have no additional concerns on the code or functionality.

I have updated the checklist above.

Good job and congratulations on a job well done!

JacksonBurns commented 9 months ago

Thank you @BerylKanali!

@cmarmo I will wait for approval from @du-phan before merging any changes!

du-phan commented 9 months ago

Hi @JacksonBurns,

The new document and code update are in great shape, thank you for your time and contribution!

I have updated the check list in my initial comment!

JacksonBurns commented 9 months ago

Thanks @du-phan!

@cmarmo can I go ahead and merge the two PRs?

cmarmo commented 9 months ago

can I go ahead and merge the two PRs?

Sure! go ahead and congratulations! I'm going to finalize the editor's checks in the next days.

cmarmo commented 9 months ago

Dear @JacksonBurns,

🎉 astartes has been approved by pyOpenSci! Thank you for your submission and many thanks to @BerylKanali and @du-phan for reviewing this package! 😸 Maintainer and reviewers, do you mind filling out the post-review survey?

The actions needed to finalize this review are detailed below.

Author Wrap Up Tasks

There are a few things left to do to wrap up this submission:

[x] Activate Zenodo watching the repo if you haven't already done so.
[x] Tag and create a release to create a Zenodo version and DOI.
[x] Add the badge for pyOpenSci peer-review to the README.md of . The badge should be [![pyOpenSci](https://tinyurl.com/y22nb8up)](https://github.com/pyOpenSci/software-review/issues/issue-number).
[ ] Please fill out the post-review survey. All maintainers and reviewers should fill this out.

It looks like you would like to submit this package to JOSS. Here are the next steps:

[x] Once the JOSS issue is opened for the package, we strongly suggest that you subscribe to issue updates. This will allow you to continue to update the issue labels on this review as it goes through the JOSS process.
[x] Login to the JOSS website and fill out the JOSS submission form using your Zenodo DOI. When you fill out the form, be sure to mention and link to the approved pyOpenSci review. JOSS will tag your package for expedited review if it is already pyOpenSci approved.
[x] Wait for a JOSS editor to approve the presubmission (which includes a scope check).
[x] Once the package is approved by JOSS, you will be given instructions by JOSS about updating the citation information in your README file.
[x] When the JOSS review is complete, add a comment to your review in the pyOpenSci software-review repo here that it has been approved by JOSS. An editor will then add the JOSS-approved label to this issue.

Editor Final Checks

Please complete the final steps to wrap up this review. Editor, please do the following:

[x] Make sure that the maintainers filled out the post-review survey
[x] Invite the maintainers to submit a blog post highlighting their package. Feel free to use / adapt language found in this comment to help guide the author.
[x] Change the status tag of the issue to 6/pyOS-approved6 🚀🚀🚀.
[x] If the author submits to JOSS, please continue to update the labels for JOSS on this issue until the author is accepted (do not remove the 6/pyOS-approved label). Once accepted add the label 9/joss-approved to the issue. Skip this check if the package is not submitted to JOSS.

If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.

JacksonBurns commented 9 months ago

Dear @JacksonBurns,

🎉 astartes has been approved by pyOpenSci! Thank you for your submission and many thanks to @BerylKanali and @du-phan for reviewing this package! 😸 Maintainer and reviewers, do you mind filling out the post-review survey?

🎉! @kspieks, @himaghna, and I are filling out the survey!

* [ ]  Activate [Zenodo](https://zenodo.org/) watching the repo if you haven't already done so.

Already done :+1:

* [ ]  Tag and create a release to create a Zenodo version and DOI.

* [ ]  Add the badge for pyOpenSci peer-review to the README.md of . The badge should be `[![pyOpenSci](https://tinyurl.com/y22nb8up)](https://github.com/pyOpenSci/software-review/issues/issue-number)`.

Will do!

It looks like you would like to submit this package to JOSS. Here are the next steps:

* [ ]   Once the JOSS issue is opened for the package, we strongly suggest that you subscribe to issue updates. This will allow you to continue to update the issue labels on this review as it goes through the JOSS process.

Will do!

* [ ]  Login to the JOSS website and fill out the JOSS submission form using your Zenodo DOI. **When you fill out the form, be sure to mention and link to the approved pyOpenSci review.** JOSS will tag your package for expedited review if it is already pyOpenSci approved.

:+1:

I will keep this issue up-to-date as the reviews progress.

Thanks again to @cmarmo, @BerylKanali, and @du-phan for volunteering your time!

cmarmo commented 8 months ago

@JacksonBurns congratulations! astartes has been accepted on JOSS! I have announced it "officially" on our Discourse too... ;) If you and and your colleagues maintainers have been happy with the pyOpenSci review process fell free to let the world know via a blog or any kind of testimonial ... :).

@BerylKanali and @du-phan thank you once more for your work as reviewers: if you have some time to fill the pos-review-survey this will help us a lot.

I am going to close the issue now. Let's go celebrate ! :partying_face:

BerylKanali commented 8 months ago

Congratulations! @JacksonBurns Thank you @cmarmo @lwasser and @ucodery for the opportunity and mentorship as a first time reviewer.

@cmarmo I have responded to the post-review survey.

pyOpenSci / software-submission

`astartes` #120

Code of Conduct & Commitment to Maintain Package

Description

Scope

Community Partnerships

Technical checks

Publication Options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Please fill out our survey

Editor and Review Templates

[x] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Editor comments

Please fill out our pre-review survey

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments:

Overview

Usage

Documentation

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

Review Comments

Author Wrap Up Tasks

Editor Final Checks