vnmabus / rdata

Reader of R datasets in .rda format, in Python
https://rdata.readthedocs.io
MIT License
40 stars 2 forks source link

Add writers for xdr and ascii files #40

Closed trossi closed 1 month ago

trossi commented 4 months ago

References to issues or other PRs

Closes #31.

Describe the proposed changes

This PR will add basic writing functionality.

Demo (demo.py):

import numpy as np
from rdata.io import write
from rdata.conversion import convert_to_r_data

matrix = np.pi * np.arange(12.).reshape(3, 4)
data = [matrix, 'hello', dict(my_key='my_value'), None, 1, 2.2, 3.3+4.4j]
print(data)

r_data = convert_to_r_data(data)
write("demo.rds", r_data)

Output:

$ python3 demo.py 
[array([[ 0.        ,  3.14159265,  6.28318531,  9.42477796],
       [12.56637061, 15.70796327, 18.84955592, 21.99114858],
       [25.13274123, 28.27433388, 31.41592654, 34.55751919]]), 'hello', {'my_key': 'my_value'}, None, 1, 2.2, (3.3+4.4j)]

$ Rscript -e "readRDS('demo.rds')"
[[1]]
         [,1]      [,2]      [,3]      [,4]
[1,]  0.00000  3.141593  6.283185  9.424778
[2,] 12.56637 15.707963 18.849556 21.991149
[3,] 25.13274 28.274334 31.415927 34.557519

[[2]]
[1] "hello"

[[3]]
[[3]]$my_key
[1] "my_value"

[[4]]
NULL

[[5]]
[1] 1

[[6]]
[1] 2.2

[[7]]
[1] 3.3+4.4i

Additional information

Testing

Two tests looping over all test files in the repository have been added:

  1. rdata/tests/test_write.py::test_write: Test that read-parse-write results in bit-wise unchanged files (omitting compression)
  2. rdata/tests/test_write.py::test_convert_to_r: Test that R-to-Python-to-R conversion results in unchanged RData object.

Implemented features are indicated in the test output: XFAIL: writing/converting of the given type has not implemented in this PR, SKIPPED: R-to-Python-to-R conversion might be ambiguous (not enough information in Python object to construct the original R object).

Test output (click to expand) ``` rdata/tests/test_write.py::test_write[test_altrep_compact_intseq.rda] XFAIL (RObjectType.ALTREP) rdata/tests/test_write.py::test_write[test_altrep_compact_intseq_asymmetric.rda] XFAIL (RObjectType.ALTREP) rdata/tests/test_write.py::test_write[test_altrep_compact_realseq.rda] XFAIL (RObjectType.ALTREP) rdata/tests/test_write.py::test_write[test_altrep_compact_realseq_asymmetric.rda] XFAIL (RObjectType.ALTREP) rdata/tests/test_write.py::test_write[test_altrep_deferred_string.rda] XFAIL (RObjectType.ALTREP) rdata/tests/test_write.py::test_write[test_altrep_wrap_logical.rda] XFAIL (RObjectType.ALTREP) rdata/tests/test_write.py::test_write[test_altrep_wrap_real.rda] XFAIL (RObjectType.ALTREP) rdata/tests/test_write.py::test_write[test_altrep_wrap_string.rda] XFAIL (RObjectType.ALTREP) rdata/tests/test_write.py::test_write[test_ascii_v2.rda] PASSED rdata/tests/test_write.py::test_write[test_ascii_v2.rds] PASSED rdata/tests/test_write.py::test_write[test_ascii_v3.rda] PASSED rdata/tests/test_write.py::test_write[test_ascii_v3.rds] PASSED rdata/tests/test_write.py::test_write[test_ascii_win_v2.rda] PASSED rdata/tests/test_write.py::test_write[test_ascii_win_v2.rds] PASSED rdata/tests/test_write.py::test_write[test_ascii_win_v3.rda] PASSED rdata/tests/test_write.py::test_write[test_ascii_win_v3.rds] PASSED rdata/tests/test_write.py::test_write[test_builtin.rda] PASSED rdata/tests/test_write.py::test_write[test_complex.rda] PASSED rdata/tests/test_write.py::test_write[test_dataframe.rda] XFAIL (RObjectType.REF) rdata/tests/test_write.py::test_write[test_dataframe.rds] XFAIL (RObjectType.REF) rdata/tests/test_write.py::test_write[test_dataframe_rownames.rda] XFAIL (RObjectType.REF) rdata/tests/test_write.py::test_write[test_dataframe_v3.rda] XFAIL (RObjectType.REF) rdata/tests/test_write.py::test_write[test_dataframe_v3.rds] XFAIL (RObjectType.REF) rdata/tests/test_write.py::test_write[test_empty_function.rda] XFAIL (RObjectType.CLO) rdata/tests/test_write.py::test_write[test_empty_function_uncompiled.rda] XFAIL (RObjectType.CLO) rdata/tests/test_write.py::test_write[test_empty_str.rda] PASSED rdata/tests/test_write.py::test_write[test_emptyenv.rda] XFAIL (RObjectType.EMPTYENV) rdata/tests/test_write.py::test_write[test_encodings.rda] PASSED rdata/tests/test_write.py::test_write[test_encodings_v3.rda] PASSED rdata/tests/test_write.py::test_write[test_environment.rda] XFAIL (RObjectType.ENV) rdata/tests/test_write.py::test_write[test_expression.rda] PASSED rdata/tests/test_write.py::test_write[test_file.rda] XFAIL (RObjectType.EXTPTR) rdata/tests/test_write.py::test_write[test_full_named_matrix.rda] PASSED rdata/tests/test_write.py::test_write[test_full_named_matrix.rds] PASSED rdata/tests/test_write.py::test_write[test_function.rda] XFAIL (RObjectType.CLO) rdata/tests/test_write.py::test_write[test_function_arg.rda] XFAIL (RObjectType.CLO) rdata/tests/test_write.py::test_write[test_half_named_matrix.rda] PASSED rdata/tests/test_write.py::test_write[test_list.rda] PASSED rdata/tests/test_write.py::test_write[test_list_attrs.rda] PASSED rdata/tests/test_write.py::test_write[test_logical.rda] PASSED rdata/tests/test_write.py::test_write[test_matrix.rda] PASSED rdata/tests/test_write.py::test_write[test_minimal_function.rda] XFAIL (RObjectType.CLO) rdata/tests/test_write.py::test_write[test_minimal_function_uncompiled.rda] XFAIL (RObjectType.CLO) rdata/tests/test_write.py::test_write[test_na_string.rda] PASSED rdata/tests/test_write.py::test_write[test_named_matrix.rda] PASSED rdata/tests/test_write.py::test_write[test_nullable_int.rda] PASSED rdata/tests/test_write.py::test_write[test_nullable_logical.rda] PASSED rdata/tests/test_write.py::test_write[test_s4.rda] XFAIL (RObjectType.S4) rdata/tests/test_write.py::test_write[test_ts.rda] PASSED rdata/tests/test_write.py::test_write[test_vector.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_altrep_compact_intseq.rda] SKIPPED (Type RObjectType.ALTREP not implemented) rdata/tests/test_write.py::test_convert_to_r[test_altrep_compact_intseq_asymmetric.rda] SKIPPED (Type RObjectType.ALTREP not implemented) rdata/tests/test_write.py::test_convert_to_r[test_altrep_compact_realseq.rda] SKIPPED (Type RObjectType.ALTREP not implemented) rdata/tests/test_write.py::test_convert_to_r[test_altrep_compact_realseq_asymmetric.rda] SKIPPED (Type RObjectType.ALTREP not implemented) rdata/tests/test_write.py::test_convert_to_r[test_altrep_deferred_string.rda] SKIPPED (Type RObjectType.ALTREP not implemented) rdata/tests/test_write.py::test_convert_to_r[test_altrep_wrap_logical.rda] SKIPPED (Type RObjectType.ALTREP not implemented) rdata/tests/test_write.py::test_convert_to_r[test_altrep_wrap_real.rda] SKIPPED (Type RObjectType.ALTREP not implemented) rdata/tests/test_write.py::test_convert_to_r[test_altrep_wrap_string.rda] SKIPPED (Type RObjectType.ALTREP not implemented) rdata/tests/test_write.py::test_convert_to_r[test_ascii_v2.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_ascii_v2.rds] PASSED rdata/tests/test_write.py::test_convert_to_r[test_ascii_v3.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_ascii_v3.rds] PASSED rdata/tests/test_write.py::test_convert_to_r[test_ascii_win_v2.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_ascii_win_v2.rds] PASSED rdata/tests/test_write.py::test_convert_to_r[test_ascii_win_v3.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_ascii_win_v3.rds] PASSED rdata/tests/test_write.py::test_convert_to_r[test_builtin.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_complex.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_dataframe.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_dataframe.rds] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_dataframe_rownames.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_dataframe_v3.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_dataframe_v3.rds] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_empty_function.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_empty_function_uncompiled.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_empty_str.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_emptyenv.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_encodings.rda] SKIPPED (ambiguous R->py->R transformation) rdata/tests/test_write.py::test_convert_to_r[test_encodings_v3.rda] SKIPPED (ambiguous R->py->R transformation) rdata/tests/test_write.py::test_convert_to_r[test_environment.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_expression.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_file.rda] SKIPPED (ambiguous R->py->R transformation) rdata/tests/test_write.py::test_convert_to_r[test_full_named_matrix.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_full_named_matrix.rds] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_function.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_function_arg.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_half_named_matrix.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_list.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_list_attrs.rda] SKIPPED (ambiguous R->py->R transformation) rdata/tests/test_write.py::test_convert_to_r[test_logical.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_matrix.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_minimal_function.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_minimal_function_uncompiled.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_na_string.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_named_matrix.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_nullable_int.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_nullable_logical.rda] PASSED rdata/tests/test_write.py::test_convert_to_r[test_s4.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_ts.rda] XFAIL () rdata/tests/test_write.py::test_convert_to_r[test_vector.rda] PASSED ```

These tests do not cover all the functionality of writer, so extra tests need to be added still.

Documentation

No documentation in README or https://rdata.readthedocs.io is included in this PR yet.

Checklist before requesting a review

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 82.26221% with 69 lines in your changes missing coverage. Please review.

Project coverage is 89.76%. Comparing base (4d160ff) to head (05c4eac). Report is 16 commits behind head on develop.

:exclamation: Current head 05c4eac differs from pull request most recent head 7cf643f

Please upload reports for the commit 7cf643f to get more accurate results.

Files Patch % Lines
rdata/io/__init__.py 29.72% 26 Missing :warning:
rdata/conversion/to_r.py 85.45% 16 Missing :warning:
rdata/tests/test_write.py 81.42% 13 Missing :warning:
rdata/io/xdr.py 88.88% 4 Missing :warning:
rdata/parser/_parser.py 77.77% 4 Missing :warning:
rdata/io/ascii.py 91.66% 3 Missing :warning:
rdata/io/base.py 96.29% 3 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #40 +/- ## =========================================== + Coverage 89.20% 89.76% +0.56% =========================================== Files 10 16 +6 Lines 1176 1563 +387 =========================================== + Hits 1049 1403 +354 - Misses 127 160 +33 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

trossi commented 4 months ago

Hi @vnmabus! Sorry for the delay with this PR. Could you have a look what you think in general about this PR and if such a writing feature would fit within the rdata package?

There is still some work to be done, for example some usage documentation and more tests would still be needed, and the organization of files and the names of classes could be streamlined (e.g., should "Writer" be called something else to match better as the counterpart of "Parser"?).

trossi commented 3 months ago

Thank you for the review! I did the changes and pushed some extra tests too.

I think that naming and file organization could be improved, but probably is better to do that after the merge. I see no other obvious problems.

I agree that the current naming does not align with the existing names. I left the naming open for you to decide what should be the suitable naming convention (e.g. should the current "Writer" be "Deparser" or something else instead to align with "Parser"). I can do some renaming too if you prefer.

But it would be great if we could have already in this PR somewhat stable high-level user interface. For example, is it ok for you if I add write_rds etc (analogously to read_rds etc) to rdata/__init__.py, so that a script like below would keep working even when internal names are changed (similar to the demo code in PR description)?

import numpy as np
from rdata import write_rds

matrix = np.pi * np.arange(12.).reshape(3, 4)
data = [matrix, 'hello', dict(my_key='my_value'), None, 1, 2.2, 3.3+4.4j]

write_rds("demo.rds", data)

Regarding our earlier discussion about authors in pyproject.toml, would you mind including me there? I feel that implementing this initial writer support is somewhat sizeable contribution to the package.

trossi commented 3 months ago

Is there a way to regenerate the code coverage analysis? The coverage should be now better than reported above, but I could still add more tests to cover missing branches.

vnmabus commented 2 months ago

I agree that the current naming does not align with the existing names. I left the naming open for you to decide what should be the suitable naming convention (e.g. should the current "Writer" be "Deparser" or something else instead to align with "Parser"). I can do some renaming too if you prefer.

But it would be great if we could have already in this PR somewhat stable high-level user interface. For example, is it ok for you if I add write_rds etc (analogously to read_rds etc) to rdata/__init__.py, so that a script like below would keep working even when internal names are changed (similar to the demo code in PR description)?

write_rds and write_rdata are clear candidates for a stable high-level user interface.

For the parsing step, we could use unparse_file and unparse_data, as mirror images of parse_file and parse_data. The unparse nomenclature was used in the ast module, so I hope it is not too surprising.

The convert analog is less clear, as the name convert does not imply a particular direction. I would propose to add two methods to Converter, as well as the corresponding functions: python_to_r and r_to_python. convert would then be an alias to r_to_python, and we may decide to deprecate it.

Regarding our earlier discussion about authors in pyproject.toml, would you mind including me there? I feel that implementing this initial writer support is somewhat sizeable contribution to the package.

I do not mind at all. There is no particular contribution threshold to appear there. Please add yourself to pyproject.toml, either as part of this PR or in a different one, and I will review it when I have time (hopefully soon). Is there a particular reason for you to want to appear there?

trossi commented 2 months ago

write_rds and write_rdata are clear candidates for a stable high-level user interface.

I added this interface to rdata/_write.py and rdata/__init__.py.

For the parsing step, we could use unparse_file and unparse_data, as mirror images of parse_file and parse_data. The unparse nomenclature was used in the ast module, so I hope it is not too surprising.

I have renamed writing to unparsing throughout to have more symmetric naming with parsers and added also unparse_data() operating on bytestring similarly to parse_data().

The convert analog is less clear, as the name convert does not imply a particular direction. I would propose to add two methods to Converter, as well as the corresponding functions: python_to_r and r_to_python. convert would then be an alias to r_to_python, and we may decide to deprecate it.

I feel that having both conversions under a single Converter class might make it rather complex. For example, some state variables would not likely be needed for both directions (e.g., the current R-to-Python Converter has default_encoding and force_default_encoding for handling various encodings the input RData object might contain, but Python-to-R is simpler in this respect as we need to only specify which encoding to use for the output RData).

I didn't do any refactoring for the conversion part and it is as before.

I do not mind at all. There is no particular contribution threshold to appear there. Please add yourself to pyproject.toml, either as part of this PR or in a different one, and I will review it when I have time (hopefully soon). Is there a particular reason for you to want to appear there?

Great, thanks, added! I'm thinking this in case it might be useful when reporting work (not sure if it makes any difference though). I left emails out from the authors list so that you remain as the main contact point (in the maintainers list too with an email).

I also added a few more tests and the following section to pyproject.toml to exclude type checking sections from coverage report (3d815f1):

[tool.coverage.report]
exclude_also = [
    "if TYPE_CHECKING:",
    ]

Let me know if this change is ok for you.

@vnmabus This PR would be ready from my side now unless you have some change requests.

vnmabus commented 2 months ago

I will review as soon as I can.

I feel that having both conversions under a single Converter class might make it rather complex. For example, some state variables would not likely be needed for both directions (e.g., the current R-to-Python Converter has default_encoding and force_default_encoding for handling various encodings the input RData object might contain, but Python-to-R is simpler in this respect as we need to only specify which encoding to use for the output RData).

I am not heavily opposed to have two kinds of converters either, although the advantage of having one is that we could try to make roundtrip (R -> Python -> R and Python -> R -> Python) work for at least the common cases. If we use different classes for each conversion it is not immediately obvious for which combinations rountrip even makes sense.

Great, thanks, added! I'm thinking this in case it might be useful when reporting work (not sure if it makes any difference though). I left emails out from the authors list so that you remain as the main contact point (in the maintainers list too with an email).

I asked because one of the things I wanted to do when I am less busy is to submit the package to the Journal of Open Source Software (JOSS). As the package was already approved by PyOpenSci, it should be fast-tracked (provided that they consider it in scope), so I would only need to write a small paper. I already have a description of the library from my PhD thesis, so I would only need to adapt it to the final format. If we merge this before submitting to JOSS, I would need to also add a small description about the writing capabilities, and I could add you as coauthor, if you need/want to be recognized. Note that although the JOSS is a peer-reviewed journal, it is still not indexed in Web of Science, so it may not be taken into account for academic merits in some countries, such as Spain, which tend to disregard publications not in JCR.

I also added a few more tests and the following section to pyproject.toml to exclude type checking sections from coverage report (https://github.com/vnmabus/rdata/commit/3d815f1b8c8fdcd672a8e4b4bed40f9d2a746379).

That looks useful indeed. I would have done it in a separate PR, but it does not matter much.

trossi commented 2 months ago

I am not heavily opposed to have two kinds of converters either, although the advantage of having one is that we could try to make roundtrip (R -> Python -> R and Python -> R -> Python) work for at least the common cases. If we use different classes for each conversion it is not immediately obvious for which combinations rountrip even makes sense.

Such roundtrip is currently implemented in the added test, but it's true that it's not obvious without running the code which types work for the roundtrip. This could indeed be refactored.

I asked because one of the things I wanted to do when I am less busy is to submit the package to the Journal of Open Source Software (JOSS). As the package was already approved by PyOpenSci, it should be fast-tracked (provided that they consider it in scope), so I would only need to write a small paper. I already have a description of the library from my PhD thesis, so I would only need to adapt it to the final format. If we merge this before submitting to JOSS, I would need to also add a small description about the writing capabilities, and I could add you as coauthor, if you need/want to be recognized. Note that although the JOSS is a peer-reviewed journal, it is still not indexed in Web of Science, so it may not be taken into account for academic merits in some countries, such as Spain, which tend to disregard publications not in JCR.

I think JOSS is a great journal and I would be happy to co-author if that is fine with you. Thanks a lot for asking! I can also contribute in writing the paper and adding documentation about the writing functionality to the readthedocs pages. But it's up to you if you would prefer to restrict the JOSS paper to the reading capabilities of the package only.

trossi commented 4 weeks ago

@vnmabus Thank you for the thorough review and the update regarding JOSS!