jdunck / python-unicodecsv

Python2's stdlib csv module is nice, but it doesn't support unicode. This module is a drop-in replacement which *does*. If you prefer python 3's semantics but need support in py2, you probably want https://github.com/ryanhiebert/backports.csv
Other
594 stars 87 forks source link

working with streams #73

Open willembressers opened 8 years ago

willembressers commented 8 years ago

I'm writing an mapReduce script (and thus are working with input / output streams).

If i use the unicodecsv module

#!/usr/bin/python
import sys
import unicodecsv as csv

def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    # writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        print line

Then i get the error:

Traceback (most recent call last):
  File "scripts/streaming/adwords/mapper.py", line 30, in <module>
    mapper()
  File "scripts/streaming/adwords/mapper.py", line 10, in mapper
    for line in reader:
  File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 117, in next
    row = self.reader.next()
_csv.Error: line contains NULL byte

If i read the file with pandas

data = pandas.read_csv(input_file, encoding='utf-16', sep='\t', skiprows=5, skip_footer=1, engine='python')

then everything works like a charm.

I don't know how to resolve this issue. I tried almost everything, even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.

If i open / save the file with libreOffice in utf-8 and run the script again the strings in the lines are prefixed with u. I know this has something to do with encodings but it's not clear to me how it works.

Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like writer.writerow((line[0] + line[2], line[5])) so that my reducer.py doesn't have to hassle with encodings.

any help would deeply be appreciated.

ryanhiebert commented 8 years ago

The first issue is that you're not using the right encoding. The unicodecsv reader requires a binary-opened file. the reader function takes an encoding argument, which defaults to utf-8, which is incorrect for you.

Unfortunately, unicodecsv doesn't yet support utf-16, because the underlying reader doesn't allow for any null bytes. We have talked about a couple ideas for fixing it, but it hasn't been implemented yet.