jdunck / python-unicodecsv

Python2's stdlib csv module is nice, but it doesn't support unicode. This module is a drop-in replacement which *does*. If you prefer python 3's semantics but need support in py2, you probably want https://github.com/ryanhiebert/backports.csv
Other
595 stars 90 forks source link

Unsupported encodings? #59

Open kengruven opened 9 years ago

kengruven commented 9 years ago

I wanted to parse a UTF-16 CSV file, so I did something like this:

r = unicodecsv.reader(f, encoding='UTF-16')

Unfortunately, this just raises an exception when I try to read from it. I looked at the unicodecsv source code, and I don't think the unicodecsv approach can ever work for this case. It tries loading the input stream as 8-bit characters, and then decodes each cell value. Python's 'csv' module can't handle NUL bytes, which are common in UTF-16, so this fails.

I think the answer to this is that the 'unicodecsv' library only works for encodings like UTF-8 or Latin-1 which are supersets of ASCII, and don't use 0x00 bytes. Is this true? We should put it in the documentation.

(Also, I think this means I should really upgrade to Python 3!)

jdunck commented 9 years ago

You're right that the underlying reader is byte-centric and the wrapper approach falls down on null bytes. None of the control characters of CSV are outside the ASCII format -- are you saying that UTF-16 encodings of these control characters include null bytes? (I've not encountered utf-16 csvs in my work.)

kengruven commented 9 years ago

Yes. The UTF-16 encoding of an ASCII file is simply that ASCII file with null bytes inserted before each byte (or after, depending on if it's UTF-16LE or UTF-16BE).

jdunck commented 9 years ago

... I'm a bit surprised I haven't heard complaints about this before. :)

You're right that the approach would need to change to fix this -- namely wrapping the given file in a decoder before handing it to the underlying csv.reader.

ryanhiebert commented 8 years ago

Does this only affect the reader, or does it also affect the writer?

I think the "right" solution is to create a backport of the Python 3 csv module, which only works on unicode, and wrap the file being read in a decoder. However, that is, at best, a ways off.

One possible, but performance terrible, approach that we could take would be to wrap the file in a decoder, and also an encoder in some acceptable encoding (utf-8), which then would get fed into the underlying csv.reader.

ryanhiebert commented 8 years ago

One possible, but performance terrible, approach that we could take would be to wrap the file in a decoder, and also an encoder in some acceptable encoding (utf-8), which then would get fed into the underlying csv.reader.

Turns out that the Python 2 csv module actually has doing that as an example at the end of the documentation. https://docs.python.org/2/library/csv.html#examples

I've now written a pure-python backport of the Python 3 csv module, so we could choose that approach to solving the problem. It would just be using the same code as the Python 3 version of unicodecsv when we detected an encoding other than ascii or utf-8.

But because the implementation is in pure python, I think it could likely be slower than the special encoding wrapper to ensure that csv was always dealing with utf-8 bytes.

@jdunck : I'm interested in writing up a solution to this problem, but I'm not sure which approach would be better. Is it better to use the decoder, or to use the backport?