Issue with Non-Ascii Characters

roskakori / cutplace

validate data stored in CSV, PRN, ODS or Excel files

http://cutplace.readthedocs.org/

GNU Lesser General Public License v3.0

18 stars 20 forks source link

Issue with Non-Ascii Characters #52

Closed reggiegentle closed 9 years ago

reggiegentle commented 11 years ago

cutplace seems to bomb out when non-ascii characters are involved in a source file. IE, if you have an e with a tilda over it, cutplace just bombs out. Is this as designed or could it be addressed or is there a potential work-around?

roskakori commented 10 years ago

In what context are you using a non ASCII character?

If the character is part of the input file, you have to specify the encoding in the ICD as described in http://roskakori.github.io/cutplace/writing-an-icd.html#data-formats

If the character is part of the ICD (e.g. a comment in the ICD) there are no special steps required unless you use plain CSV as format for the ICD itself. In this case, specify the encoding of the ICD using --icd-encoding as described in http://roskakori.github.io/cutplace/command-line-usage.html#icds-containing-non-ascii-characters.

In my experience the most common encodings for western languages are cp1252, iso-8859-15 and utf-8.

roskakori commented 9 years ago

The code base for 0.8.0 should be a lot more solid concerning UnicodeError, but there still is a need to consistently test this. The goal is that under Python 3 all UnicodeErrors result in an at least somewhat helpful DataError that points to the cell (for Excel and ODS) or line (for delimited and fixed) that caused the error. With Python 2 this goal might be more troublesome to reach, so maybe a few cases still remain a major PITA - which is just another reason to upgrade to Python 3.

We should consider exhaustive testing for one of the next sprints.

roskakori commented 9 years ago

With the test improvements of the past weeks, all readers in rowio and validio seem to be able to process unicode, so I consider this solved. If still something slipped through the cracks, we'll address this when it shows up.