working with streams - Githubissues

I'm writing an mapReduce script (and thus are working with input / output streams).

If i use the unicodecsv module

#!/usr/bin/python
import sys
import unicodecsv as csv

def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    # writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        print line

Then i get the error:

Traceback (most recent call last):
  File "scripts/streaming/adwords/mapper.py", line 30, in <module>
    mapper()
  File "scripts/streaming/adwords/mapper.py", line 10, in mapper
    for line in reader:
  File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 117, in next
    row = self.reader.next()
_csv.Error: line contains NULL byte

If i read the file with pandas

data = pandas.read_csv(input_file, encoding='utf-16', sep='\t', skiprows=5, skip_footer=1, engine='python')

then everything works like a charm.

I don't know how to resolve this issue. I tried almost everything, even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.

If i open / save the file with libreOffice in utf-8 and run the script again the strings in the lines are prefixed with u. I know this has something to do with encodings but it's not clear to me how it works.

Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like writer.writerow((line[0] + line[2], line[5])) so that my reducer.py doesn't have to hassle with encodings.

any help would deeply be appreciated.

jdunck / python-unicodecsv

working with streams #73