vnmabus / rdata

Reader of R datasets in .rda format, in Python
https://rdata.readthedocs.io
MIT License
40 stars 2 forks source link

Faster xdr reader #32

Closed trossi closed 6 months ago

trossi commented 7 months ago

This PR adds faster reader for files in xdr format. Full arrays are read directly with numpy instead of reading element by element. As a positive side effect, deprecated xdrlib isn't needed anymore.

Related to #31. I have cherry-picked and rebased commits related to xdr reader improvements to this PR. There are also some structural changes that are open for discussion, for example, xdr reader is moved to rdata/io/xdr.py to simplify separation between different readers like (upcoming) rdata/io/ascii.py. @vnmabus Could you review?

codecov-commenter commented 7 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (d4c1d83) 92.02% compared to head (3a4631e) 91.98%. Report is 11 commits behind head on develop.

Files Patch % Lines
rdata/parser/_parser.py 96.15% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #32 +/- ## =========================================== - Coverage 92.02% 91.98% -0.05% =========================================== Files 6 7 +1 Lines 1104 1086 -18 =========================================== - Hits 1016 999 -17 + Misses 88 87 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

trossi commented 7 months ago

Here is approximate timing data for reference:

Array size (MiB) Time to read before this PR (s) Time to read with this PR (s)
16 1.2 0.3
32 2.2 0.3
64 4.2 0.3
128 8.0 0.4
256 18.5 0.5
512 39.4 0.8
1024 77.3 1.5

This data was created with the following (in bash):

  1. Generate test data (without compression to skip time spent in decompression): for i in {1..7}; do n=$(( 2 ** $i )); Rscript -e "saveRDS(runif(n=$n*1024**2), file='array_$i.rds', compress=FALSE)"; done
  2. Read and measure time: for i in {1..7}; do echo $i; time -p python -c "from rdata.parser import parse_file; parse_file('array_$i.rds')"; done
trossi commented 6 months ago

Sorry for the delay in accepting this. I was on vacation and had a "forced" digital detox. Approving and merging now.

No problem, thank you for merging! I hope you had a relaxing vacation!

I'll open a PR for ASCII reader next.

vnmabus commented 5 months ago

Here is approximate timing data for reference:

Just to let you know: I added your example (but limited to 5 iterations) as an asv benchmark to the package, to check for future performance regressions. I also added a new testing module (currently undocumented) to retrieve and execute R snippets from strings, so that each test can have its associated R snippet for creating the data, instead of a big script for all.