turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
869 stars 134 forks source link

Not being able to import csv after opening compressed file #284

Closed berinhard closed 6 years ago

berinhard commented 6 years ago

I have a sample.csv.gz file and I'm getting the following error when trying to open it with rows.utils.open_compressed:

In [3]: from rows.utils import open_compressed

In [4]: import rows

In [5]: with open_compressed('sample.csv.gz') as fd:
   ...:     data = rows.import_from_csv(fd)
   ...:     for d in data:
   ...:         print(d)
   ...:         
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-63853ffa5786> in <module>()
      1 with open_compressed('sample.csv.gz') as fd:
----> 2     data = rows.import_from_csv(fd)
      3     for d in data:
      4         print(d)
      5 

~/.virtualenvs/poder360-resultados/local/lib/python3.6/site-packages/rows/plugins/plugin_csv.py in import_from_csv(filename_or_fobj, encoding, dialect, sample_size, *args, **kwargs)
    104     if dialect is None:
    105         dialect = discover_dialect(sample=read_sample(fobj, sample_size),
--> 106                                    encoding=encoding)
    107 
    108     reader = unicodecsv.reader(fobj, encoding=encoding, dialect=dialect)

~/.virtualenvs/poder360-resultados/local/lib/python3.6/site-packages/rows/plugins/plugin_csv.py in discover_dialect(sample, encoding, delimiters)
     62         while not finished:
     63             try:
---> 64                 decoded = sample.decode(encoding)
     65 
     66             except UnicodeDecodeError as exception:

AttributeError: 'str' object has no attribute 'decode'
turicas commented 6 years ago

By now all the functions expect a binary opened file (if you're passing a file-like object) and an encoding parameter if it's different from UTF-8. We may adapt the functions to work even with text file-objects or (by now) raise an exception if a text-like file object is provided (could you pelase create a PR for the CSV plugin?). Also, the open_compressed is not meant (by now) to be used as a context manager (sorry, still need to test/implement some things).

Making little changes on your code will do the job:

import rows
from rows.utils import open_compressed

filename = 'data/balneabilidade-bahia/balneabilidade.csv.xz'
encoding = 'utf-8'
fobj = open_compressed(filename, mode='rb', encoding=encoding)
table = rows.import_from_csv(fobj)
for row in table:
    print(row)

Note that the code above is greedy, so will load everything in memory to create the table (I'm still working on lazy evaluation of tables). If your CSV is big, then you can use this approach (will work for any CSV dialect):

import csv

import rows
from rows.utils import open_compressed

filename = 'data/balneabilidade-bahia/balneabilidade.csv.xz'
encoding = 'utf-8'

# First, open the file (in binary mode) to detect its dialect using a 1MiB sample
fobj = open_compressed(filename, mode='rb')
dialect = rows.plugins.csv.discover_dialect(fobj.read(1024 ** 2), encoding=encoding)

# Now open again (in text mode) to read it lazily
fobj = open_compressed(filename, encoding=encoding)
reader = csv.DictReader(fobj, dialect=dialect)
for row in reader:
    print(row)

If you'd like to use the same rows interface and have all values converted, you can take a sample, import a table using this sample (so the library can detect the column types) and then import the file lazily (using a simple mokey patch) with the detected types. It's done in csv2sqlite function (forgot to detect the dialect there - will create an issue).

In the future we're going to create an "universal rows import function" that will automatically handle compressed files (like rows.utils.import_from_uri does discovering the plugin, but also for compressed files).