rdata, read R datasets from Python

pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/

94 stars 36 forks source link

rdata, read R datasets from Python #144

Open vnmabus opened 1 year ago

vnmabus commented 1 year ago

Submitting Author: Name (@vnmabus) All current maintainers: (@vnmabus) Package Name: rdata One-Line Description of Package: Read R datasets from Python. Repository Link: https://github.com/vnmabus/rdata Version submitted: 0.9.2.dev1 Editor: @isabelizimm
Reviewer 1: @rich-iannone Reviewer 2: @has2k1 Archive: JOSS DOI: TBD Version accepted: 0.11.0 Date accepted (month/day/year): 2/29/2024

Code of Conduct & Commitment to Maintain Package

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
[x] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

The package rdata allows to parse .rda and .rds files, containing serialized R objects, and convert them to Python. The users can influence this conversion and provide conversion routines for custom classes.

Scope

Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [ ] Data visualization[^1]
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo

Community Partnerships

If your package is associated with an existing community please check below:

[ ] Pangeo
- [ ] My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):

Its main purpose is to be able to read .rda and .rds files, the files used for storing data in the R programming language, and convert them to Python objects for further processing.

Who is the target audience and what are scientific applications of this package?

The target audience includes users that want to open in Python datasets created in R. These include scientists working in both Python and R, scientists who want to compare results among the two languages using the same data, or simply Python scientists that want to be able to use the numerous datasets available in CRAN, the R repository of packages.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

The package rpy2 can be used to interact with R from Python. This includes the ability to load data in the RData format, and to convert these data to equivalent Python objects. Although this is arguably the best package to achieve interaction between both languages, it has many disadvantages if one wants to use it just to load RData datasets. In the first place, the package requires an R installation, as it relies in launching an R interpreter and communicating with it. Secondly, launching R just to load data is inefficient, both in time and memory. Finally, this package inherits the GPL license from the R language, which is not compatible with most Python packages, typically released under more permissive licenses. The recent package pyreadr also provides functionality to read some R datasets. It relies in the C library librdata in order to perform the parsing of the RData format. This adds an additional dependency from C building tools, and requires that the package is compiled for all the desired operating systems. Moreover, this package is limited by the functionalities available in librdata, which at the moment of writing does not include the parsing of common objects such as R lists and S4 objects. The license can also be a problem, as it is part of the GPL family and does not allow commercial use.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

https://github.com/pyOpenSci/software-submission/issues/143

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] uses an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a tutorial with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [x] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [x] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [x] The package is deposited in a long-term repository with the DOI: 10.5281/zenodo.6382237 *Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

[x] I have read the author guide.
[x] I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

[x] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. Have feedback/comments about our review process? Leave a comment here

I received feedback in the presubmission inquiry, indicating that the package could benefit from more detailed examples in the vignette format. The infrastructure is here (I use scikit-gallery for creating the online notebook) but I would like to know more concretely which examples you think that would be more beneficial to highlight.
Although I want to submit the paper to JOSS if possible, I did not write the paper file yet (I hope that this is not necessary at this step).

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

NickleDave commented 1 year ago

Hi @vnmabus, just letting you know we have started the search for an editor for this review.

In the meantime, here are the initial editor checks. I am happy to report that rdata passes with flying colors.

Please see a couple of comments below. You are not required to address these for us to start the review, but I do think doing so would benefit your package--just trying to help, you get a free review from the editor in chief too :slightly_smiling_face:

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements below.

[x] Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
- [x] The package imports properly into a standard Python environment import package.
[x] Fit The package meets criteria for fit and overlap.
[x] Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
- [x] User-facing documentation that overviews how to install and start using the package.
- [x] Short tutorials that help a user understand how to use the package and what it can do for them.
- [x] API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
[x] Core GitHub repository Files
- [x] README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
- [x] Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
- [x] Code of Conduct The package has a Code_of_Conduct.md file.
- [x] License The package has an OSI approved license. NOTE: We prefer that you have development instructions in your documentation too.
[x] Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
[x] Automated tests Package has a testing suite and is tested via a Continuous Integration service.
[x] Repository The repository link resolves correctly.
[x] Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

[x] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Editor comments

Two suggestions:

include data in the package (more on this below)
it's great that you have examples, but add more concrete examples, e.g. using existing datasets on OSF, FigShare, or Zenodo. This will help your users understand how rdata fits into their workflow

More on adding data: I tried to run the snippet in the README but got this error.

>>> parsed = rdata.parser.parse_file(rdata.TESTDATA_PATH / "test_vector.rda")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pimienta/Documents/repos/coding/opensci/pyos/test-rdata/.venv/lib/python3.10/site-packages/rdata/parser/_parser.py", line 1002, in parse_file
    data = path.read_bytes()
  File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/pathlib.py", line 1126, in read_bytes
    with self.open(mode='rb') as f:
  File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/home/pimienta/Documents/repos/coding/opensci/pyos/test-rdata/.venv/lib/python3.10/site-packages/rdata/tests/data/test_vector.rda'

I think what's going on is that the test data is not in the built package?

You may have intended this snippet to be run with just the development version, but I would strongly suggest you add small amounts of data to the package itself. It's also possible you meant to "include" the data in your built package, but the setuptools build is not configured correctly? I have found the way setuptools does this a bit confusing in the past, but I think it's now possible for you to include all the files without using a MANIFEST.in file (which if I understand correctly is now a "legacy" way of including files with setuptools that should no longer be used). Note that other build backends have their own methods for this; for example flit would include everything by default, although you can have more fine grained control.

You can then provide access to this data to users (and yourself, for tests) by using importlib-resources as discussed in this talk. Here's an example of using importlib-resources in a package I develop--it might be a little easier to read than more general scientific Python packages that have a lot of internal infrastructure around their built-in datasets. I have a draft section of the guide on how to do this here--it's very rough still but the core ideas are there and might help you. If you have any feedback on this section, we'd love to hear it.

vnmabus commented 1 year ago

@NickleDave I have added the test data to the manifest and changed the package to use importlib.resources in the develop branch (but I have not yet released a version with the changes).

Some observations of the process:

The Traversable protocol used by importlib.resources was not available in Python<3.11 so I had to redefine it in that case.
I made parse_file able to load objects implementing the Traversable protocol. I think most Python developers are not aware of that interface, and thus most code is not capable to deal with general resources that are not in a filesystem. This probably should be discussed more widely in the Python community.
I am not really sure that there is a more modern alternative to MANIFEST.in, so I keep using it. Thank you for noticing that the example data was missing!

NickleDave commented 1 year ago

Great, thank you @vnmabus--this feedback is really helpful. I will read this in more detail, and share in our Slack team, but I want to let you know right away I appreciate it.

It might be worth sharing your observations on the process you went through to include data in our Discourse, if you wouldn't mind starting a new topic there, since that's open and we can get more eyeballs on it: https://pyopensci.discourse.group/. We could share the topic on social media too (e.g. Mastodon).

I just sent out an email to another potential editor today--will update you as soon as we have one! We want to make sure we find someone with the right expertise in both Python and R.

NickleDave commented 1 year ago

It might be worth sharing your observations on the process you went through to include data in our Discourse, if you wouldn't mind starting a new topic there, since that's open and we can get more eyeballs on it: https://pyopensci.discourse.group/.

or if you're ok with it @vnmabus I can start the topic on our Discourse and you could feel free to reply there. Please let me know

We'd like to get input from maintainers of setuptools there, to better understand when we need MANIFEST.IN files -- this has come up before in GitHub review of our packaging guide

vnmabus commented 1 year ago

If you can start the topic and give me a link, that would be awesome, thank you!

NickleDave commented 1 year ago

Hi again @vnmabus I asked a question about MANIFEST.in here, please feel free to chime in if I'm not understanding all the complexities of your particular situation: https://pyopensci.discourse.group/t/is-manifest-in-still-needed-to-include-data-in-a-package-built-with-setuptools/392

NickleDave commented 12 months ago

Hi @vnmabus, I'm very happy to report that @isabelizimm will be the editor for this review. We are now looking for reviewers.

vnmabus commented 12 months ago

@NickleDave And I am also happy to report that I have added a couple of more detailed examples, as per your second suggestion. Please, tell me if that is what you had in mind:

https://rdata.readthedocs.io/en/latest/auto_examples/index.html

NickleDave commented 12 months ago

These look great @vnmabus, thanks so much. This is exactly what I had in mind.

isabelizimm commented 12 months ago

Hey there! I'm excited to be the editor for this package--I can't count the amount of times I've done the R->Python data dance. I'm on the hunt for reviewers, and will check back end of next week with an update!

isabelizimm commented 11 months ago

:wave: Hi @rich-iannone and @has2k1! Thank you SO MUCH for volunteering to review for pyOpenSci 🎉

@vnmabus these two reviewers are individuals who are deeply involved in both the R and Python data and packaging world. I am super excited to be part of and learn from them through this review process!

Please fill out our pre-review survey

Before beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.

[ ] reviewer 1 survey completed.
[ ] reviewer 2 survey completed.

The following resources will help you complete your review:

Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

If anyone has any questions/comments/concerns, please don't hesitate to get in touch! Your review is due: Dec 8

has2k1 commented 11 months ago

Hi @vnmabus,

First, I am happy to review rdata.

Second, currently there is no tag for 0.9.2.dev1. Though, as there aren't many changes between when this issue was opened (after v0.9.1) and today, I am fine with addressing my review to any version/commit after that.

vnmabus commented 11 months ago

Yes, sorry, the 0.9.2.dev1 version was the number of the develop version when I opened this issue. As I added some breaking changes, it was renamed and published until version 0.10 instead.

That said, it would be better if you include in your review also the recent changes in develop (I know it is a moving target, but this package does not usually change that much). In particular, there is a recent PR from a user that greatly improves XDR parsing speed, removing the xdrlib dependency in the process (which is one of the "old batteries" that will be removed from Python 3.13). I wanted to ask you (and the pyOpenSci community in general) if it wouldn't be better to have an official xdrlib replacement where these changes can be made (I do not mind having the XDR parser inside my project, as it is rather small, but maybe other projects find that useful too). I do not know if you can tag the relevant people here to answer this question, or maybe I should open a Discourse/Discord/Slack thread.

has2k1 commented 11 months ago

That said, it would be better if you include in your review also the recent changes in develop

That is okay with me.

On xdrlib. I had seen that it is deprecated and getting removed and then thankfully a PR showed up. I think, eventually the best place for the parser is as a separate package.

rich-iannone commented 11 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [x] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[ ] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3

Review Comments

The software is good and useful. I find myself having a need for this very thing. In terms of letting people know what the package does, it's somewhat terse at the moment (along the lines of 'if you know you know'). That said, it would be valuable to demonstrate (perhaps through examples or an end-to-end solution) the value of taking an R dataset in .rda/.rds format and incorporating it into a Python workflow. This could involve using some collection of R packages that don't have a good correspondance in Python, generating one or more .rds files, and finishing the work in Python (using Python packages that have no parallel in R, to drive home the point).

Other use cases can be presented to get the casual user's imagination going, as well. This may involve hosting .rds files (maybe in a public GH repo) and demonstrating how to retrieve and convert the data. Another idea is to demonstrate how to extract datasets from R packages. I know that such demos would involve operations that exist outside of the functionality of the package but it can be inspiring to tie I/O related things to the use of the package itself. It sort of shows complete solutions to a problem that can involve your software in a specific, key step.

Some additional documentation things:

add some benchmarking information; compare with other solutions (could use https://github.com/airspeed-velocity/asv for this)
provide information on which R object translations are supported
provide some examples on how different translated objects can be used in a real-world analysis

Package README and Documentation Site

The organization of the project's GH page is pretty good. Without a direct link, however, it's a bit hard to search for (appearing on the 4th page of results) and that's due to many similarly named projects. To circumvent this, maybe add a few more search tags in the About section of the project page. Also, include a link to the documentation in the About section (I know it's linked from a badge and within the README but having a link at the top makes a difference). It's great to see that the package is getting a fair amount of use (100+ usages); you can draw more attention to that by removing the empty Packages section in the sidebar (this'll bring that useful stat higher up on the page).

A useful badge to have is the Python versions supported badge. The list of versions could match those that you're testing and the relevant "Programming Language :: Python :: 3.*" classifiers should be added to the pyproject.toml file.

Community Considerations

To make things better for community contributions, I suggest adding these two things:

add an issue template
add a pull request template

You already have a CONTRIBUTING document that describes ways to make contributions and this is great. The PR and issue templates can link to that document (this increases the chance that users will read it).

I recommend adding a CODE_OF_CONDUCT.md file. These are very commonplace so you can get a file from another project's repo.

Code Comments

I found the code to well organized and easy to follow. All conventions seem to be followed and many helpful docstrings were present for virtually all classes and methods. This definitely makes it easier for people to contribute to the project.

Testing

It's good to see that testing is being done on a solid matrix of recent Python versions and platforms (Linux/Ubuntu, Windows, MacOS). There does need to be some updates to the actions used in the testing workflow file (actions/checkout@v2 -> actions/checkout@v4 and actions/setup-python@v2 -> actions/setup-python@v4).

Another recommendation is to run pylint for linting, perhaps through pre-commit (see https://pylint.pycqa.org/en/latest/user_guide/installation/pre-commit-integration.html).

There are a lot of tests in place and very good line coverage. Having the coverage results sent to CodeCov makes it very easy to see which code paths go through tests. It's really good to see that a lot of .rda files (stored in tests/data) are being tested.

Perhaps consider adding Python 3.12 to the testing matrix.

If you want to go even further with testing, have different R versions on different platforms generate .rds/.rda files and test those directly.

Citation File

The citation file should have a version field, updated every time a version of the software is released. This can be a bit of a pain but it makes for a better, more complete citation and tools exist to help with this (see https://github.com/citation-file-format/citation-file-format#tools-to-work-with-citationcff-files-wrench).

has2k1 commented 10 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [x] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5 hours

Review Comments

Notes & Suggestions

For the code in the README (and since as it is not a doctest), it would be convenient to be able to copy & paste it and run without any editing.
- The default python kernel does ignore >>> & ..., though jupyter kernel does.
- Having the output in the same code block breaks the copy-paste-run for all kernels.
I feel like there is are missing read functions that wraps rdata.parser.parse_file and rdata.parser.convert and works with just a filepath. e.g.
```
parsed = rdata.parser.parse_file("dataframe.rds")
data = rdata.conversion.convert(parsed)
```
would be
```
data = rdata.read_rds("dataframe.rds")
```
And maybe also a read_rda. A single read_rdata can do for both cases but that does not communicate the difference between reading an object vs an environment with objects.

These functions could include optional arguments to pass on to the parser and/or converter.

This would simplify the user experience as the convert(parse_file(...)) usage seems to cover majority of the use cases. e.g. none of the current test cases deviate from that pattern.
It is not documented anywhere what base R objects are supported and what they are translate to as python object. I think a table would be helpful.
On packaging configuration, there is a setup.cfg and pyproject.toml. As even the configuration for setuptools is in pyproject.toml, it makes sense to migrate the configs for isort, mypy and pytest. That would leave only the flake8 configuration in setup.cfg as flake8 does not yet support pyproject.toml.
xdrlib has been deprecated in python 3.12 and it is good there is a plan (and some work) to deal with it.

Final Comment

While I have noted some suggestions, the project is otherwise well motivated, implemented and tested and I do not consider the above suggestions to be "blockers".

isabelizimm commented 10 months ago

THANK YOU!!!! To @has2k1 and @rich-iannone for your very thorough reviews 🙌 your time is so appreciated, and I think a great asset to this project!

I wanted to ask you (and the pyOpenSci community in general) if it wouldn't be better to have an official xdrlib replacement where these changes can be made (I do not mind having the XDR parser inside my project, as it is rather small, but maybe other projects find that useful too). I do not know if you can tag the relevant people here to answer this question, or maybe I should open a Discourse/Discord/Slack thread.

xdrlib replacements are a little out of the realm of my knowledge, so I just started a thread on Slack for this! My 2c from a general packaging viewpoint-- the new parser in that PR doesn't seem too wild, if you feel comfortable maintaining it, I would probably roll my own before bringing in another dependency (with appropriate tests, which seem to be missing from that PR for now). But, others might know of a great, well maintained replacement that I am not aware of!

I have gone ahead and changed the status of the issue to awaiting changes. There were a number of updates suggested by the reviewers; some of these comments will be quick to implement, and others might take a bit more time.

This part of the review is a bit more back and forth, as @vnmabus updates rdata from reviewer comments. Generally, pyOpenSci gives a timeframe of about 3 weeks to give updates from the review. I am cognizant that it is nearing the end of 2023 and holiday season for many people across the world, so if you need more time, that is completely understandable, just let us know what timeline is comfortable for you 😄

Thank you all again for your collaboration!

vnmabus commented 10 months ago

This part of the review is a bit more back and forth, as @vnmabus updates rdata from reviewer comments. Generally, pyOpenSci gives a timeframe of about 3 weeks to give updates from the review. I am cognizant that it is nearing the end of 2023 and holiday season for many people across the world, so if you need more time, that is completely understandable, just let us know what timeline is comfortable for you 😄

I will be on Christmas holiday from 14th December to 14th January, and will likely not have access to my computer. I will try to address as many issues as possible today and tomorrow, but it is likely that I won't be able to address them all until January. I hope that does not cause you many inconveniences.

isabelizimm commented 10 months ago

No inconvenience at all, please unplug and enjoy your Christmas holiday! I'm adding an on-hold tag on this review to signify to everyone that no further work is expected until mid-January upon your return. When you're back and ready to pick up changes to rdata again, just give a shout here and we will resume 📺

vnmabus commented 9 months ago

So I am back, and I think that I implemented most of the requested changes (in the develop branch, not yet released). First, for @rich-iannone:

* [ ]  Badges for:

  * [x]  Continuous integration and test coverage,
  * [x]  Docs building (if you have a documentation website),
  * [ ]  A [repostatus.org](https://www.repostatus.org/) badge,
  * [ ]  Python versions supported,
  * [x]  Current package version (on PyPI / Conda).

Missing badges have been added

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
* [ ]  Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.

I am not sure if you reviewed the latest version, because I think the links were in the examples documentation page before your review. Please, tell me if that is not enough.

* [ ]  The need for the package is clear

Again, I am not sure if you reviewed the latest version. I made changes to both the README and the main page of the documentation after submitting but before your review. Please tell me if something can be improved.

* [ ]  **Performance:** Any performance claims of the software been confirmed.

We do not have performance claims (although with the recent changes to the XDR parser, performance has been greatly improved).

  * [ ]  Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

I added Ruff to the CI and fixed all violations.

* [ ]  **A short summary** describing the high-level functionality of the software

* [ ]  **Authors:** A list of authors with their affiliations

* [ ]  **A statement of need** clearly stating problems the software is designed to solve and its target audience.

* [ ]  **References:** With DOIs for all those that have one (e.g. papers, datasets, software).

I plan to write it after the review, if possible, to accommodate all the changes.

The software is good and useful. I find myself having a need for this very thing. In terms of letting people know what the package does, it's somewhat terse at the moment (along the lines of 'if you know you know'). That said, it would be valuable to demonstrate (perhaps through examples or an end-to-end solution) the value of taking an R dataset in .rda/.rds format and incorporating it into a Python workflow. This could involve using some collection of R packages that don't have a good correspondance in Python, generating one or more .rds files, and finishing the work in Python (using Python packages that have no parallel in R, to drive home the point).

Other use cases can be presented to get the casual user's imagination going, as well. This may involve hosting .rds files (maybe in a public GH repo) and demonstrating how to retrieve and convert the data. Another idea is to demonstrate how to extract datasets from R packages. I know that such demos would involve operations that exist outside of the functionality of the package but it can be inspiring to tie I/O related things to the use of the package itself. It sort of shows complete solutions to a problem that can involve your software in a specific, key step.

As mentioned before, I am not sure if you reviewed the latest changes because a set of examples was added in response to @NickleDave comments. It is true that the examples do not illustrate a complete workflow, just the loading part. Do you consider that is really necessary?

* add some benchmarking information; compare with other solutions (could use https://github.com/airspeed-velocity/asv for this)

I added asv tests. Currently I only test the array parsing routines, as their performance has greatly improved recently, but if you have additional proposals for performance tests I will consider adding them.

* provide information on which R object translations are supported

I created a page in the documentation for this. Please, tell me if this is what you had in mind.

* provide some examples on how different translated objects can be used in a real-world analysis

I am not sure if I follow you here. Could you please clarify your intention?

Package README and Documentation Site

The organization of the project's GH page is pretty good. Without a direct link, however, it's a bit hard to search for (appearing on the 4th page of results) and that's due to many similarly named projects. To circumvent this, maybe add a few more search tags in the About section of the project page. Also, include a link to the documentation in the About section (I know it's linked from a badge and within the README but having a link at the top makes a difference). It's great to see that the package is getting a fair amount of use (100+ usages); you can draw more attention to that by removing the empty Packages section in the sidebar (this'll bring that useful stat higher up on the page).

I added the link (thanks!) and removed the "Packages" section. I am unsure of which additional search tags I could use.

A useful badge to have is the Python versions supported badge. The list of versions could match those that you're testing and the relevant "Programming Language :: Python :: 3.*" classifiers should be added to the pyproject.toml file.

Added.

Community Considerations

To make things better for community contributions, I suggest adding these two things:
* add an issue template

* add a pull request template

I added both of them.

You already have a CONTRIBUTING document that describes ways to make contributions and this is great. The PR and issue templates can link to that document (this increases the chance that users will read it).

I linked to it in the PR template.

I recommend adding a CODE_OF_CONDUCT.md file. These are very commonplace so you can get a file from another project's repo.

That was already added before your review IIRC.

Testing

It's good to see that testing is being done on a solid matrix of recent Python versions and platforms (Linux/Ubuntu, Windows, MacOS). There does need to be some updates to the actions used in the testing workflow file (actions/checkout@v2 -> actions/checkout@v4 and actions/setup-python@v2 -> actions/setup-python@v4).

Done.

Another recommendation is to run pylint for linting, perhaps through pre-commit (see https://pylint.pycqa.org/en/latest/user_guide/installation/pre-commit-integration.html).

I have finally chosen Ruff for its performance. I added it to CI in order to report violations (but I do not want them to be automatically fixed without human intervention, so it just checks them).

Perhaps consider adding Python 3.12 to the testing matrix.

Added.

If you want to go even further with testing, have different R versions on different platforms generate .rds/.rda files and test those directly.

I want to improve testing, and I had several ideas. I created an issue for tracking that.

Citation File

The citation file should have a version field, updated every time a version of the software is released. This can be a bit of a pain but it makes for a better, more complete citation and tools exist to help with this (see https://github.com/citation-file-format/citation-file-format#tools-to-work-with-citationcff-files-wrench).

I am a bit unsure about it. Adding a version field has the risk that the user does a copy-paste of the generated Bibtex from the Github repo, which may not be the same version that he is using. If I do not add it, either the user cites the package or he retrieves the version from his copy if he is interested in reproducibility. In both cases, the information should not be wrong. Is there a compelling argument for having the version there?

vnmabus commented 9 months ago

Now, for @has2k1:

  * [ ]  A [repostatus.org](https://www.repostatus.org/) badge,
  * [ ]  Python versions supported,

Added.

* [ ]  Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.

I just noted that the link is not in the README, added now.

* [ ]  If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.

I do not have it yet. I should probably include a comparison at least with pyreadr, but I am not sure how to frame it.

* [ ]  Citation information

There is a CITATION.cff file. I plan to add citation to the README after publishing in JOSS (if they approve it).

* [ ]  **Performance:** Any performance claims of the software been confirmed.

I made no performance claims.

* [ ]  **Continuous Integration:** Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)

I have tests, typing and style as Github actions.

  * [ ]  Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

I added Ruff as a linter and fixed all violations.

* [ ]  **A short summary** describing the high-level functionality of the software

* [ ]  **Authors:** A list of authors with their affiliations

* [ ]  **A statement of need** clearly stating problems the software is designed to solve and its target audience.

* [ ]  **References:** With DOIs for all those that have one (e.g. papers, datasets, software).

I plan to add it after the review by PyOpenSci is complete.

1. For the code in the README (and since as it is not a doctest), it would be convenient to be able to copy & paste it and run without any editing.

Done.

2. I feel like there is are missing `read` functions that wraps `rdata.parser.parse_file` and
   `rdata.parser.convert` and works with just a filepath. e.g.
parsed = rdata.parser.parse_file("dataframe.rds")
data = rdata.conversion.convert(parsed)
would be
data = rdata.read_rds("dataframe.rds")
And maybe also a read_rda. A single read_rdata can do for both cases but that does not communicate the difference between reading an object vs an environment with objects.

These functions could include optional arguments to pass on to the parser and/or converter.

These convenience functions have been incorporated.

3. It is not documented anywhere what base R objects are supported and what they are translate to as python object. I think a table would be helpful.

I added a page in the documentation. Please tell me if that is what you had in mind.

4. On packaging configuration, there is a `setup.cfg` and `pyproject.toml`. As even the configuration for `setuptools` is in `pyproject.toml`, it makes sense to migrate the configs for `isort`, `mypy` and `pytest`. That would leave only the `flake8` configuration in `setup.cfg` as `flake8` [does not yet support](https://github.com/PyCQA/flake8/issues/234) `pyproject.toml`.

Everything moved to pyproject.toml. Flake8 config has been removed as now the project uses Ruff.

5. `xdrlib` has been deprecated in python 3.12 and it is good there is a plan (and some work) to deal with it.

The PR has been merged and xdrlib is no longer required.

isabelizimm commented 8 months ago

Ah, this is a great round of improvements-- thank you for the update @vnmabus!

For @has2k1 and @rich-iannone, are you able to comment on if these fixes are what you expected?

has2k1 commented 8 months ago

@vnmabus has resolved the all core queries and addressed the 2 conditional issues.

To keep track of one, I have filed an issue about the comparison with other python packages, and the paper.md file for JOSS submission can be resolved when it is time to submit.

In between, the tables for default conversions are great and super informative.

Great work Carlos.

rich-iannone commented 8 months ago

Sorry for being very late here, but thank you @vnmabus for all the improvements made. I've made a few comments in error previously so I thank you for the clarifications on those!

The examples are great, no need to go further beyond the loading part (which is the key thing here).

As far as adding a version number to the citation file, I take what I said back since I'm in favor of your reasoning (and you bring up some good points about practical usage of the citation text).

Testing is always a potential marathon, you're doing great with this and the inclusion of asv tests is very helpful.

The page in the documentation about the supported R translations is quite useful. Thanks for adding that in.

To wrap up, this package fully meets my quality and usability expectations. Excellent work!

isabelizimm commented 8 months ago

With two reviewers giving the 👍 ... it is my absolute pleasure to say that ... 🎉 ... rdata has been approved by pyOpenSci!

Thank you @vnmabus for submitting rdata and many thanks to @has2k1 and @rich-iannone for reviewing this package! 😸

Author Wrap Up Tasks

There are a few things left to do to wrap up this submission:

[x] Activate Zenodo watching the repo if you haven't already done so.
[x] Tag and create a release to create a Zenodo version and DOI.
[x] Add the badge for pyOpenSci peer-review to the README.md of rdata. The badge should be [![pyOpenSci](https://tinyurl.com/y22nb8up)](https://github.com/pyOpenSci/software-review/issues/issue-number).
[ ] Please fill out the post-review survey. All maintainers and reviewers should fill this out.

It looks like you would like to submit this package to JOSS. Here are the next steps:

[ ] task for @isabelizimm (that's me!) Once the JOSS issue is opened for the package, we strongly suggest that you subscribe to issue updates. This will allow you to continue to update the issue labels on this review as it goes through the JOSS process.
[ ] Login to the JOSS website and fill out the JOSS submission form using your Zenodo DOI. When you fill out the form, be sure to mention and link to the approved pyOpenSci review. JOSS will tag your package for expedited review if it is already pyOpenSci approved.
[ ] Wait for a JOSS editor to approve the presubmission (which includes a scope check).
[ ] Once the package is approved by JOSS, you will be given instructions by JOSS about updating the citation information in your README file.
[ ] When the JOSS review is complete, add a comment to your review in the pyOpenSci software-review repo here that it has been approved by JOSS. An editor will then add the JOSS-approved label to this issue.

🎉 Congratulations! You are now published with both JOSS and pyOpenSci! 🎉

Editor Final Checks

Please complete the final steps to wrap up this review. Editor, please do the following:

[ ] Make sure that the maintainers filled out the post-review survey
[ ] Invite the maintainers to submit a blog post highlighting their package. Feel free to use / adapt language found in this comment to help guide the author.
[x] Change the status tag of the issue to 6/pyOS-approved6 🚀🚀🚀.
[ ] Invite the package maintainer(s) and both reviewers to slack if they wish to join.
[ ] If the author submits to JOSS, please continue to update the labels for JOSS on this issue until the author is accepted (do not remove the 6/pyOS-approved label). Once accepted add the label 9/joss-approved to the issue. Skip this check if the package is not submitted to JOSS.
[ ] If the package is JOSS-accepted please add the JOSS doi to the YAML at the top of the issue.

If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.

vnmabus commented 8 months ago

Thank you @isabelizimm for the smooth reviewing experience, and also thanks to the reviewers @rich-iannone and @has2k1!

I already did the steps at section "There are a few things left to do to wrap up this submission:". I will plan to do the necessary steps to submit to JOSS and create a blog post in the next weeks.

vnmabus commented 7 months ago

@isabelizimm I wanted to tell you that I have written a draft of a blog post, and I am not sure how to send it to you for review.

On an unrelated note: rdata shows now in the main pyOpenSci page, but it is not shown in a particular category (maybe it should be shown in "data munging"?).

lwasser commented 7 months ago

hey @vnmabus just a quick note that i saw this message and noticed a bug in our website! things should be working and rdata should filter as expected now: thanks for letting us know about this 🙌 and welcome to our community!

vnmabus commented 6 months ago

So @isabelizimm , @lwasser : is it possible to add a blog post or not?

isabelizimm commented 6 months ago

Apologies for late response-- you are welcome to make a PR with your post to the pyOpenSci website's repo! Here is an example of what this PR can look like 😄 https://github.com/pyOpenSci/pyopensci.github.io/pull/42

lwasser commented 4 months ago

hi team! i'm circling back on this review and doing a bit of cleanup. rdata was approved - yay and i see the pyos-accepted label! i also believe we published your blog post @vnmabus did this package however get to joss? and if it did it should have a doi if it was inscope. we should close the issue if we you decided to forgo joss! but if it had a joss submission we should link to that issue in JOSS and update the label to also include joss!

many thanks!!

vnmabus commented 4 months ago

Sorry, I got a new job the same day this package was accepted in PyOpenSci, so I had a few months without a lot of time to invest on it. My plan is still to submit to JOSS after https://github.com/vnmabus/rdata/pull/40 is merged.

lwasser commented 4 months ago

hey @vnmabus no worries. no need to apologize!

A few notes

because JOSS accepts the review that we implemented and as such does NOT re-review your code, adding code to your package via tha referenced pr which looks significant between acceptance here and publication at JOSS is not ideal. The release for this review - 0.11.0 should be the same accepted by JOSS. they will ask you to create a new release with the joss badge and doi i believe.
the review from joss should be a fast track - they will only review your paper - not the code. so it should be quick and not a lot of additional effort on your part (aside from writing the paper!).

there is no huge rush but I highly encourage you to submit to joss before adding more code to your package to keep our existing review synced with what joss accepts as a publication. If not, there will be an issue with a lot of new functionality that has not been reviewed associated with a doi.

please let me know if you have any questions!

vnmabus commented 4 months ago

Is there no way to communicate to them the changes since your review, so that they can review the new commits if they want to? I feel that the changes in that PR, and a couple of PRs before by the same author, enrich the package in such a way that it seems unfair not to mention them in the JOSS paper. In fact, I wanted to offer the author of these changes to co-author the JOSS paper with me if he wants.

Sorry, because I completely understand the issue with the fast-tracking and DOI. This was not planned when I submitted the package for pyOpenSci review: I wanted to submit the same version to JOSS at that time. Otherwise, I would have just waited to do these changes before submitting to pyOpenSci.

So, what do you think it is the best way to approach this problem in a way convenient for everyone?

lwasser commented 4 months ago

@vnmabus i'm going to ping @arfon on this specific issue from JOSS. We ask authors to submit a package to us when the API has stabilized and similar to an academic paper being reviewed, it is not ideal if changes are being made while the software / (or paper!) is in active review.

Arfon - in this case we have reviewed the rdata package and it has been pyOpenSci accepted. However, since the pyOpenSci acceptance of rdata there are new commits that look to me like fairly significant enhancements that have been added to the package.

Because there is new functionality, we may not want to move forward with a fast track (these changes might require another review). Generally we ask the maintainer to submit based on the release / doi created at the end of our pyOpenSci review and as such JOSS can trust that we have reviewed the entire code base. In this case we have a divergence.

How would you prefer to handle this on the JOSS side of things? I believe that this is the first time this has happened. many thanks!

arfon commented 4 months ago

Hrm, yes isn't ideal (I agree with the idea that the software should essentially be the same between the pyOpenSci and JOSS reviews). That said, we do have the ability to associate a specific version of a software package with a JOSS publication (e.g., v1.0.0) so provided there is a release/version number that pertains to the pyOpenSci submission, we could simply do that?

lwasser commented 4 months ago

@arfon that makes sense @vnmabus what do you think about going with that suggested route? So in this case we'd want to submit version 0.11.0 to JOSS as that was the version accepted by pyOpenSci.

vnmabus commented 4 months ago

I could do that, but I wonder if it would at least be possible to mention the post-reviews PRs, maybe as future/ongoing work? Again, I would like to offer @trossi to be a coauthor, given that his contributions added a lot of value to the package (and a JOSS paper is mostly an academic/citable record of the software itself), so it would make sense to at least mention them in passing.

(Again, I want to apologize and thank everyone for their time providing guidance in this matter).

lwasser commented 4 months ago

@vnmabus I appreciate that! I'd really like @arfon or someone from JOSS to answer that question. From our end, you are always welcome to update a blog post (or write a blog post) noting new features, etc post review. We know that development goes on. From JOSS's perspective however i'd like for @arfon or someone on that team to make the call on what should (and shouldn't go into the paper).

pyOpenSci is here to support you through the process as well!

@arfon would it make sense for @vnmabus to submit the package to review for JOSS noting this specific item (when they have bandwidth!)? And you / JOSS can make a call on addressing the new features in that paper over there? We will link back to the review here as we always do so there is a clear communication trail.

if we do that our fearless editor @isabelizimm can watch the package go through JOSS to ensure things get wrapped up on both ends. ✨

arfon commented 2 months ago

@vnmabus – you're welcome to include future-looking work in the paper, and @trossi as author too given their current role on the project.

@arfon would it make sense for @vnmabus to submit the package to review for JOSS noting this specific item (when they have bandwidth!)? And you / JOSS can make a call on addressing the new features in that paper over there? We will link back to the review here as we always do so there is a clear communication trail.

Sounds good to me!

lwasser commented 2 months ago

Ok, wonderful. @vnmabus, why don't you go ahead and submit it to JOSS? Please provide the link to the issue when the review opens here. Could you include me in the conversation? That way, the editors at JOSS can make a call.

One of two things might happen: Once they accept your package at the release point received here, it would be 0.11.0. OR there is another review. This decision needs to be made by the JOSS editors rather than us. Please let me know if you have any questions.

pyOpenSci / software-submission