Open turicas opened 7 years ago
cc @cuducos
Sounds awesome — I'd love to code on this issue.
This is my first contact with rows
source code. I think I could add more functions to utils.py
such as local_zip_file
, local_lzma_file
etc. that would be basic wrappers around existing local_file
. Does it sound as a good strategy?
@cuducos, I like the idea but was thinking in a more automatic function. Something like this:
from rows.utils import decompress
table1 = rows.import_from_csv(decompress('filename.csv.gz'))
table2 = rows.import_from_csv(decompress('filename.csv.xz'))
table3 = rows.import_from_csv(decompress('filename.zip', inner='path/filename.csv'))
The same would be true to rows.export_to_*
, using compress
.
What do you think about this API?
To clarify: the functions would receive a filename or a file object of a compressed file (in any of the supported formats) and then return a file object of the uncompressed content.
Hum… I must say I'm not fully convinced…
What are the pros & cons of these two other possibilities?
# example 1
table1 = rows.import_from_gz('filename.csv.gz')
table2 = rows.import_from_xz('filename.csv.xz')
table3 = rows.import_from_zip('filename.zip', inner='path/filename.csv')
# example 2
table1 = rows.import_from_csv('filename.csv.gz')
table2 = rows.import_from_csv('filename.csv.xz')
table3 = rows.import_from_csv('filename.zip', inner='path/filename.csv')
Maybe example 2 could properly guess the compression type using the file extension or mime type (not sure) and offer an explicity argument compression='zip'
for instance.
Let's say we would like to import a compressed HTML file, instead of a compressed CSV file. On example 1 the methods will need to identify the file type and then call the correct plugin -- it has some implications such as looking inside the uncompressed file contents if there's no known extension.
On example 2 we delegate to the plugin the ability to decompress, so some plugin would use it and some would not. Requires changes on all plugins and the final code will be more coupled.
The idea of having a separate helper function is to solve these problems:
csv
module, for example).The ideia, in my opinion, is to have a open()
-like function which does the job (compress
will return a writable file object and decompress
a readable one) -- we can think of it receiving the filename at first but maybe support receiving the file object too. Still need to check if this approach would interfere on rows.plugins.utils.get_filename_and_fobj
.
Makes tons of sense… thanks for expanding on it. I hope it was worth it for as much as it was for me ; )
Gonna try some code one of these days and open a PR!
If anyone else is interested in the progress of this issue I just got started.
I've changed the description of this issue to focus on compress/decompress only (so we're not going to implement support for Zip here, since it compress and archives files). The archive support will be implemented on https://github.com/turicas/rows/issues/236.
This code sample may help (note that get_filename_and_fobj
receives and returns binary file objects and this sample will return text ones):
import io
import lzma
# NOTE: should `try: import lzma except: ...` because it's not always compiled
# may use https://github.com/peterjc/backports.lzma
# READING:
# get `output_filename` and `encoding` variables
fobj = io.TextIOWrapper(lzma.open(output_filename, mode='r'), encoding=encoding)
# WRITING:
# get `output_filename` and `encoding` variables
fobj = io.TextIOWrapper(lzma.open(output_filename, mode='w'), encoding=encoding)
There are a lot of use cases which involve storing CSV files compressed since the compression ratio is high, so we can save space. Sometimes we'd like to work directly on compressed files (so decompress on the fly) instead of storing the uncompressed version before working on the data.
Some of the following compression formats are widely used:
filename.csv.gz
filename.csv.xz
filename.csv.lzma
filename.bz2
The Python standard library already provides modules to work with these formats (
gzip
,lzma
andbz2
), so we just need some helper functions to use these features easily when importing and exporting data.