Why don't you compress files?

ofajardo / pyreadr

Python package to read and write R RData and Rds files into/from pandas dataframes. No R or other external dependencies required.

GNU Affero General Public License v3.0

298 stars 23 forks source link

Why don't you compress files? #41

Closed pablodegrande closed 4 years ago

pablodegrande commented 4 years ago

I investigated a few, and I believe that creating compressed rdata files in no more that calling:

import sys import gzip import shutil

with open('uncompressedfile.rdata', 'rb') as f_in: with gzip.open('compressedfile.gz.rdata', 'wb') as f_out: shutil.copyfileobj(f_in, f_out)

I was wondering why your library wouldn't do that while saving... Is any other format issue I am not aware of?

Thanks a lot, Pablo.

ofajardo commented 4 years ago

No, it is as you say, but it increases the time and there is no general agreement on what the compression should be, you like gzip, but some others prefer bzip2 and so on.

The other reason is that the idea behind the writing is that you are going to do a quick and dirty data exchange with R and therefore the files will be destroyed after use and therefore the size is not very relevant. If needed to store the files then it sounds like a very bad idea to do it as rdata or rds ... Better use arrow.

pablodegrande commented 4 years ago

Great! I will use your library into this project https://github.com/poblaciones/poblaciones (which renders a collaborative data oriented map https://poblaciones.org). Users will be ok downloading an rdata file, and will gzip-it for them before retrieval... Thanks a lot!!

ofajardo commented 4 years ago

Yeah I see, for your case you need it compressed. Maybe I add it as an option in the future (default will be no compression so that it doesn't break existing code).

Just as a piece of advice, the interoperability of R files is terrible. Only R can read and write it correctly, because the format is undocumented and changes all the time. For that reason it would be better to provide files in an interoperable, documented format. But of course if you have a lot of R users they won't like it (and if you have users from other systems they won't like R formats)

ofajardo commented 4 years ago

OK, gzip compression is implemented as an option in pyreadr 0.3.2:

pyreadr.write_rdata("test.RData", df, df_name="dataset", compress="gzip")

Now I also remembered that the reason why this was not implemented before was partially because not high priority as explained before, but also because there was a bug on Windows that did not allow to delete the created files (https://github.com/Roche/pyreadstat/issues/49), that was blocking this.

Hope it helps

pablodegrande commented 4 years ago

Nice! Thanks!

Pablo De Grande - IDICSO (USAL) / CONICET http://www.aacademica.org/pablo.de.grande http://www.aacademica.org/pablo.de.grande

On Tue, Sep 1, 2020 at 7:46 AM Otto Fajardo notifications@github.com wrote:

OK, gzip compression is implemented as an option:

pyreadr.write_rdata("test.RData", df, df_name="dataset", compress="gzip")

Now I also remembered that the reason why this was not implemented before was partially because not high priority as explained before, but also because there was a bug on Windows that did not allow to delete the created files, that was blocking this.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/ofajardo/pyreadr/issues/41#issuecomment-684762107, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYIIEFVFUAPM6QTZKXVPKDSDTGIPANCNFSM4QQR7O6A .