Open adamlwgriffiths opened 11 years ago
Well, crap. Any hints? I'll try to get to this within the next 20-30 hours.
The character in question is an e-acute. And is apparently from LATIN-1. This seems to be issues with non-UTF-8 pages saying they're UTF-8.
Best way might be for me to put a try / catch around my call to requestions and handle this myself. If so, then this is an invalid bug =P.
Sure, your best bet is fixing it yourself if you need it within the next 8 hours. I am busy sleeping. And based on how bad I am at unicode, fix might be quick for me to write tomorrow/whenever.. or not.
Appreciate your promptness =). You're obviously on a different timezone, so just ignore these messages until you've got time. Apologies if they're waking you up with a notification.
The following seems to work, but may not be the correct fix.
io.py
def write_response(obj, return_string=True):
<snip>
#serialization["content"] = obj.content
serialization["content"] = requests.utils.get_unicode_from_response(obj)
<snip>
From the requests API docs http://docs.python-requests.org/en/latest/api/
requests.utils.get_unicode_from_response(r)
Returns the requested content back in unicode.
Parameters: r – Response object to get unicode content from.
Tried:
charset from content-type
every encodings from <meta ... charset=XXX>
fall back and replace all unicode characters
If it's not appropriate, I can perform the serialisation myself. Should just be a matter of * pasing False for the 'return_string' parameter of write_response(...). * replacing the 'content' value with the one returned by the above function * json'ing the value
Edit: It seems however, that once deserialise the json and try and access the 'text' attribute, it causes an exception.
File "... /lib/python2.7/site-packages/requests/models.py", line 633, in text
content = str(self.content, errors='replace')
TypeError: decoding Unicode is not supported
So one way around this (that I have been experimenting with in Betamax) is turning everything into base64 encoded text and using that for serialization. It isn't fool proof but it might be more reliable than just serializing the Unicode content.
Aside from this, @kanzure don't blame yourself for Python's string/Unicode API being awful and hard to use properly. 90% of us struggle with the same issues.
@kanzure this may be of interest to you.
@adamlwgriffiths you may be interested but it's a different tool (albeit related) and might give you a way to send an easy PR.
Interesting points. This would adversely affect the readability of serialised objects, I'm not sure if that's an issue.
It seems this is a pretty common issue. I just tried 'jsonpickle', and it triggers the same exception with this data. =/
At least the response.encoding is correct.
ISO-8859-1
https://en.wikipedia.org/wiki/ISO/IEC_8859-1 Wikipedia that 0xe9 in ISO-8859-1 is definitely e-acute (https://en.wikipedia.org/wiki/%C3%89).
The problem is that json is assuming utf-8. The response.encoding needs to be used (I think).
json.dumps(obj, encoding=response.encoding)
The problem with this is that json.loads requires the encoding to be passed in also. This also presents a problem as you need to know the encoding to de-serialise the response at a later time. Which is not very 'just works'.
The other possible method would be to drop the non-utf8 characters.
response.text.encode('utf-8', errors='ignore')
But then you'll cause problems with serialising pages in non-utf8 encodings.
I'm not sure what the proper fix is.
I've changed the flow of my code so this is no longer an issue for me.
I'm not sure how you'd fix it cleanly because I think you need the encoding at both serialisation and deserialisation time, which would change your API and cause a lot of issues.
Thanks for your help guys =)
The following test program causes an exception when serialising.