Correct handling of Unicode in py2

reenberg commented 7 years ago

I'm not a py3 coder, so this is regarding py2.

When trying to include a .csv file which is utf-8 encoded, the filter fails miserably as panflute expects unicode data (it calls text.encode('utf-8') on line 338 of panflute/tools.py)

You are using the built in csv module, which clearly states in the docs, that "The csv module doesn’t directly support reading and writing Unicode [...]". Aka, you should convert the read data to unicode before returning it as raw_table_list in read_data() at line 197 of pantable.py.

Or perhaps more preferrable, just use the unicodecsv module, which claims to be a "drop-in replacement for Python 2’s csv module which supports unicode strings without a hassle."

I tested unicodecsv in a local checkout, and it seems to work flawlessly. I just added import unicodecsv as csv, not caring about anything than py2 :)

ickc commented 7 years ago

A quick look into the unicodecsv's README seems to suggest backport.csv is a better fit here. I mostly use python 3 (and code in python 3 while pasteurized into python 2). So a backport of csv to python 2 seems better.

ickc commented 7 years ago

I'm thinking if we should avoid the extra dependency for Python 3 users. Do you know a quick and clean way to do so?

reenberg commented 7 years ago

I don't have any preferences for which module. I just ended up not being able to load in my .csv files when they contained Danish language instead of English, which I had primarily used this filter for in the past.

Well we could obviously do a conditional import of the different modules, but that doesn't take care of the dependency in setup.py. I don't have a solution on top of my head for doing conditional dependencies in there.

I thought the py3 csv module had the same issues as the py2, but it seems that it was only an issue in py3.0, and it was fixed in 3.1.

It seems that the backport.csv module is a pure python implementation, meaning it is slow (according to the author), However I don't know how bit of a deal this is in practice.

I could try and make a proposal using this and the io.open() as changes instead? Also trying to figure out how to deal with setup.py. Alternatively we could just try and import it, and if it works, then great, if not then just use the default csv module. This way it the user install the module herself, then it will work and the setup.py file wouldn't polute py3 installations.

reenberg commented 7 years ago

Dealing with this in setup.py actually seems quite easy. Environment Markers (PEP 508) is designed for this.

However there seems to be some fuzz about old versions of setuptools. Instead of specifying the environment markers in install_requires, it should be set in extra_requires as this is supported since setuptools 18 (e.g., here)

It seems you need version 20.6.8 (May 2016) for support to be fully functional in install_requires.

reenberg commented 7 years ago

See https://github.com/reenberg/pantable/commit/7e3fa9467625d135f7a0aeace292c4f5bb825ed3 for an initial test at using environment markers. It works like a charm. And I don't think it is unreasonable to depend on a fairly new version of setuptools.

reenberg commented 7 years ago

@ickc And with the added backports.csv module instead for py2 (https://github.com/reenberg/pantable/commit/bb13cd2dfd731780015ed9ec76853f5a5b6243ba).

It Actually simplified the code a bit, as there wasn't a need for differentiating between io.BytesIO and io.StringIO any more.

However the tests seems to fail on python 2, as some of the parameters is unicode instead of str:

- [...] TableRow(TableCell(Para(Math(E=mc^2; format=u'InlineMath'))) [...]
?
+ [...] TableRow(TableCell(Para(Math(E=mc^2; format='InlineMath'))) [...]

However this is an issue in the master branch as well, so doesn't seem related to what i have changed.

See full log: pantable.test.txt

Should I make a PR for this, or do you see anything that needs changing? It has minimum impact on py3 as requested.

reenberg commented 7 years ago

Actually I can see that something changed, since the test_read_data now also fails for assert read_data(True, '') is None in py3, which it doesn't on master, due to the fact that i removed the str() call inside io.open(str(include)).

Is there a particular reason why this is done so? Can there be any valid ways to actually sneak a bool or anything else than a string into this variable? It comes from the yaml meta-data block.
So you would have to pass something else than a string here, for example a list or something, but calling str() on that would still yield something that is not useful.

ickc commented 7 years ago

Also see #21

ickc commented 7 years ago

@reenberg, please check pantable v0.11 in #25 fixes your problem. Thanks.

ickc / pantable

Correct handling of Unicode in py2 #14