Open willembressers opened 8 years ago
The first issue is that you're not using the right encoding. The unicodecsv
reader requires a binary-opened file. the reader
function takes an encoding
argument, which defaults to utf-8
, which is incorrect for you.
Unfortunately, unicodecsv
doesn't yet support utf-16
, because the underlying reader doesn't allow for any null bytes. We have talked about a couple ideas for fixing it, but it hasn't been implemented yet.
I'm writing an mapReduce script (and thus are working with input / output streams).
If i use the
unicodecsv
moduleThen i get the error:
If i read the
file
withpandas
then everything works like a charm.
I don't know how to resolve this issue. I tried almost everything, even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.
If i open / save the file with libreOffice in
utf-8
and run the script again the strings in the lines are prefixed withu
. I know this has something to do with encodings but it's not clear to me how it works.Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like
writer.writerow((line[0] + line[2], line[5]))
so that my reducer.py doesn't have to hassle with encodings.any help would deeply be appreciated.