Closed gugek closed 6 years ago
I'm having the same problem (in py3) – put in unicode strings with non-ascii characters, get back unicode strings with weird escape sequences.
We're all a little confused about str / bytes, and supporting unicode on python 2 and 3 must be a nightmare, so I just wanted to offer my own explanation of why @gugek's fix works:
My understanding largely comes from Ch. 4 of this book: http://shop.oreilly.com/product/0636920032519.do
And please correct me if I'm wrong! I get as confused by unicode as everyone else!
Okay, it looks like sickle isn't at fault – at least in my case, requests is wrongly detecting ISO-8859-1 instead of UTF-8.
Also, I didn't realize that lxml requires bytes – so that explains why Sickle encodes its input. So sorry for getting all pedantic about that.
Fixed in the latest release 0.6.3
Is there a reason that the utf-8 (str/unicode) text is being used in the XML property/method in
response.py
?I'm finding lxml is having problems with unicode strings (bytes in py3) that include external entities or non-ascii/latin characters.
Here is an example record: there are right single quotes
\u2019
embedded in there.lxml does some optimizations in py2 where it sometimes will output a unicode object and other times a text one. In py3 it will always output unicode. But for whatever reason when you get the text from the
requests
response, encode it back to str/bytes and then parse it with lxml.etree some unicode (and maybe external entities) are getting processed incorrectly.in:
sickle/sickle/response.py
I think your tests are passing because you are reading from a file object which is directly providing string/byte, though you may not have coverage of anything past the ASCII character space.
I think the simple fix is in
sickle/sickle/response.py
to just parse the response content rather than encoding the text which is already being processed byrequests
and which needs to be encoded to be handled bylxml.etree