jdunck / python-unicodecsv

Python2's stdlib csv module is nice, but it doesn't support unicode. This module is a drop-in replacement which *does*. If you prefer python 3's semantics but need support in py2, you probably want https://github.com/ryanhiebert/backports.csv
Other
595 stars 90 forks source link

python-unicodecsv 0.14.1 broke read/write on text-mode files on Py3 #65

Closed zarnovican closed 8 years ago

zarnovican commented 8 years ago

Hi guys,

Unittest on python-cliff project are failing on py34 env. This is due to recent upgrade to python-unicodecsv 0.14.1. When tested with python-unicodecsv 0.14.0 (from pip) it is passing. The commit that broke it is c68ae77dccd15a88b629ef42d882ab601626f1ee.

What's worrying is that version bump to 0.14.1 (b5301ad762c9fa8aead92a563d24b255eb7e8da9) was before that commit. If using unicodecsv from this commit, it is actually passing. So it looks like the build was made from tip of the master at some point.

Simple test:

(unicodecsv34)nlm:~/tmp$ pip install unicodecsv
You are using pip version 6.0.8, however version 7.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting unicodecsv
  Using cached unicodecsv-0.14.1.tar.gz
Installing collected packages: unicodecsv
  Running setup.py install for unicodecsv
Successfully installed unicodecsv-0.14.1
(unicodecsv34)nlm:~/tmp$ 
(unicodecsv34)nlm:~/tmp$ python
Python 3.4.2 (default, Jan 12 2015, 12:13:20) 
[GCC 4.9.2 20150107 (Red Hat 4.9.2-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodecsv
>>> import sys
>>> writer = unicodecsv.writer(sys.stdout)
>>> writer.writerow(('a', 'b'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zarnovic/venv/unicodecsv34/lib/python3.4/site-packages/unicodecsv/py3.py", line 28, in writerow
    return self.writer.writerow(row)
  File "/home/zarnovic/venv/unicodecsv34/lib/python3.4/site-packages/unicodecsv/py3.py", line 15, in write
    return self.binary.write(string.encode(self.encoding, self.errors))
TypeError: must be str, not bytes
>>> 
ryanhiebert commented 8 years ago

The commit that bumped to version 0.14.1 was made before that commit, but it was released after that commit. The commit history isn't necessarily a good record of releases. He does try to tag commits appropriately, though.

The strategy for Python 3 had been to just proxy the built-in csv module. Unfortunately, that means that it was a completely different API between Python 2 and Python 3, and I'd essentially have to write 2 versions of my CSV parsing and writing code either way. The new strategy allows you to write a single codebase that will work on Python 2 and 3.

I can't speak for him, but I suspect that the reason that this wasn't seen as a backward incompatible change is because this 1) only affected Python 3, and 2) on Python 3, unicodecsv was just a pure, completely untested proxy to the built-in csv module. As I mentioned above, the csv module has different semantics from unicodecsv, on both Python 2 and Python 3.

The change referenced added real support for Python 3, instead of just punting, because the APIs are so different.

Since your build failure is on Python 3 only, I would change your import to something like:

import sys
if sys.version_info[0] == 2:
    import unicodecsv as csv
else:
    import csv

This will give you the same behavior that unicodecsv did before 0.14.1. I'll have to leave it as a decision for @jdunck whether the versions should be changed, but I do think that the way we've landed on is the better path forward. So if you're wanting the fallback to the built-in csv module, I think that you should make it explicit in your code.

I hope that helps clarify the reasons that we switched strategies, and gives you a good path forward to making your work Python 3 compatible.

zarnovican commented 8 years ago

Hi Ryan,

thanks for the fast and informative feedback. I have looked at the problem a little bit deeper and manually tested several combinations (on Linux). The issue is with text vs binary file mode.

Py3,csv,text - :white_check_mark: Py3,csv,binary - :x: "TypeError: 'str' does not support the buffer interface" (is it bug or feature?) Py3,unicodecsv,text - :x: "TypeError: must be str, not bytes" Py3,unicodecsv,binary - :white_check_mark: (I assume that pre-0.14.1 unicodecsv behaves exactly like csv)

You have implemented binary-mode write, but broke text-mode in the process. My guess is that most people would use text mode on CSVs. I have also noticed that most (all?) your unittest are running tests on files opened in binary mode.

The new strategy allows you to write a single codebase that will work on Python 2 and 3.

Gotcha.. It's a compatibility module, bringing Py2's csv semantics into Py3. I believe it's a genuine bug that text-mode write does not work..

I will use your suggestion with conditional import, since python-cliff's usage of CSV is quite minimal (mostly output to text-mode stdout).

ryanhiebert commented 8 years ago

Gotcha.. It's a compatibility module, bringing Py2's csv semantics into Py3. I believe it's a genuine bug that text-mode write does not work..

I suspect that your usage of unicodecsv simply wasn't aware of what it was doing. In order for you to have written the code identically on Python 2 and 3, you would have written this, with the notably absent declaration of the encoding:

import unicodecsv as csv

writer = csv.writer(open('myfile.csv', 'w'))
writer.writerow(...)

The problem with the above code is that it previously had completely different semantics on Python 2 and 3.

On Python 2, csv.writer(file) is the same as csv.writer(file, encoding='utf-8'). So you didn't realize it, but your file was always being written in binary mode. Python 2's loose string semantics, though, would still let you pretend like it was text. But it wasn't, it was bytes.

On Python 3 though, the built-in csv module doesn't have an encoding argument. That means that if I was using that argument, it wouldn't work on Python 3, because the API is different.

ryanhiebert commented 8 years ago

You're using unicodecsv as a drop-in replacement for csv. The problem is that it's only a reasonable drop-in replacement for Python 2 csv. If it attempted to be a drop-in replacement for Python 3's as well, it would be a different API, as was already demonstrated. I don't see a reasonable way to make it be both compatible between Python 2 and 3 and be a drop-in replacement for both versions. That's because they have radically different semantics.

The new Python 3 work doesn't add Python 2's semantics to Python 3, but rather adds Python 3's clarity of purpose to the API presented on Python 2. The Python 2 API was the invariant that couldn't be changed, I just had to codify what that API was doing using Python 3's strict split between binary and text.

zarnovican commented 8 years ago

So you didn't realize it, but your file was always being written in binary mode.

Sorry, I should have been more explicit. When referring to binary vs text mode I was talking about the flag used to open the file (eg, "w" vs "wb"). Simple test (results are on my last comment):

#import unicodecsv as csv
import csv

f = open('myfile.csv', 'w')
#f = open('myfile.csv', 'wb')
ret = csv.writer(f).writerow(('a', 'b', '\u017d'))
f.close()

Py3 text-mode files have associated encoding f.encoding == 'UTF-8' (that's the default), while binary-mode files are just a stream of bytes. I reckon, that's why Py3 csv module does not have encoding= anymore. It is handled at file-level.

ryanhiebert commented 8 years ago

And I should have said "your file was always being written as if it were in binary mode". Since Python 2's text mode still gives byte strings (str), it can be treated like bytes.

jdunck commented 8 years ago

0.14.0 and 0.14.1 are now correctly tagged in git. Now that there's more than 1 contributor, I should be using a develop branch to make it clear what's a pre-release bump and the actual release...

ryanhiebert commented 8 years ago

I'm going to close this issue, as this bug is simply part of the design of the library. Feel free to let me know if you believe that is a mistake.