turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
869 stars 134 forks source link

Compressed files input and output #230

Open turicas opened 7 years ago

turicas commented 7 years ago

There are a lot of use cases which involve storing CSV files compressed since the compression ratio is high, so we can save space. Sometimes we'd like to work directly on compressed files (so decompress on the fly) instead of storing the uncompressed version before working on the data.

Some of the following compression formats are widely used:

The Python standard library already provides modules to work with these formats (gzip, lzma and bz2), so we just need some helper functions to use these features easily when importing and exporting data.

turicas commented 7 years ago

cc @cuducos

cuducos commented 7 years ago

Sounds awesome — I'd love to code on this issue.

This is my first contact with rows source code. I think I could add more functions to utils.py such as local_zip_file, local_lzma_file etc. that would be basic wrappers around existing local_file. Does it sound as a good strategy?

turicas commented 7 years ago

@cuducos, I like the idea but was thinking in a more automatic function. Something like this:

from rows.utils import decompress
table1 = rows.import_from_csv(decompress('filename.csv.gz'))
table2 = rows.import_from_csv(decompress('filename.csv.xz'))
table3 = rows.import_from_csv(decompress('filename.zip', inner='path/filename.csv'))

The same would be true to rows.export_to_*, using compress.

What do you think about this API?

turicas commented 7 years ago

To clarify: the functions would receive a filename or a file object of a compressed file (in any of the supported formats) and then return a file object of the uncompressed content.

cuducos commented 7 years ago

Hum… I must say I'm not fully convinced…

What are the pros & cons of these two other possibilities?


# example 1
table1 = rows.import_from_gz('filename.csv.gz')
table2 = rows.import_from_xz('filename.csv.xz')
table3 = rows.import_from_zip('filename.zip', inner='path/filename.csv')

# example 2
table1 = rows.import_from_csv('filename.csv.gz')
table2 = rows.import_from_csv('filename.csv.xz')
table3 = rows.import_from_csv('filename.zip', inner='path/filename.csv')

Maybe example 2 could properly guess the compression type using the file extension or mime type (not sure) and offer an explicity argument compression='zip' for instance.

turicas commented 7 years ago

Let's say we would like to import a compressed HTML file, instead of a compressed CSV file. On example 1 the methods will need to identify the file type and then call the correct plugin -- it has some implications such as looking inside the uncompressed file contents if there's no known extension.

On example 2 we delegate to the plugin the ability to decompress, so some plugin would use it and some would not. Requires changes on all plugins and the final code will be more coupled.

The idea of having a separate helper function is to solve these problems:

The ideia, in my opinion, is to have a open()-like function which does the job (compress will return a writable file object and decompress a readable one) -- we can think of it receiving the filename at first but maybe support receiving the file object too. Still need to check if this approach would interfere on rows.plugins.utils.get_filename_and_fobj.

cuducos commented 7 years ago

Makes tons of sense… thanks for expanding on it. I hope it was worth it for as much as it was for me ; )

Gonna try some code one of these days and open a PR!

cuducos commented 7 years ago

If anyone else is interested in the progress of this issue I just got started.

turicas commented 7 years ago

I've changed the description of this issue to focus on compress/decompress only (so we're not going to implement support for Zip here, since it compress and archives files). The archive support will be implemented on https://github.com/turicas/rows/issues/236.

turicas commented 6 years ago

This code sample may help (note that get_filename_and_fobj receives and returns binary file objects and this sample will return text ones):

import io
import lzma  
# NOTE: should `try: import lzma except: ...` because it's not always compiled
# may use https://github.com/peterjc/backports.lzma

# READING:
# get `output_filename` and `encoding` variables
fobj = io.TextIOWrapper(lzma.open(output_filename, mode='r'), encoding=encoding)

# WRITING:
# get `output_filename` and `encoding` variables
fobj = io.TextIOWrapper(lzma.open(output_filename, mode='w'), encoding=encoding)