Doesn't handle unicode in some cases

adamlwgriffiths commented 11 years ago

The following test program causes an exception when serialising.

import requests
import requestions

url = 'http://www.amazon.com/review/RAYTXRF3122TO'
response = requests.get(url)
sresponse = requestions.write_response(response)
response = requestions.read_response(sresponse)
print response.text

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    requestions.write_response(response)
  File "/media/sf_workspace/VirtualEnvs/requestions/src/python-requestions/requestions/io.py", line 79, in write_response
    return json.dumps(convert_to_dict(serialization))
  File "/usr/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 81636: invalid continuation byte

kanzure commented 11 years ago

Well, crap. Any hints? I'll try to get to this within the next 20-30 hours.

adamlwgriffiths commented 11 years ago

The character in question is an e-acute. And is apparently from LATIN-1. This seems to be issues with non-UTF-8 pages saying they're UTF-8.

adamlwgriffiths commented 11 years ago

Best way might be for me to put a try / catch around my call to requestions and handle this myself. If so, then this is an invalid bug =P.

kanzure commented 11 years ago

Sure, your best bet is fixing it yourself if you need it within the next 8 hours. I am busy sleeping. And based on how bad I am at unicode, fix might be quick for me to write tomorrow/whenever.. or not.

adamlwgriffiths commented 11 years ago

Appreciate your promptness =). You're obviously on a different timezone, so just ignore these messages until you've got time. Apologies if they're waking you up with a notification.

The following seems to work, but may not be the correct fix.

io.py

def write_response(obj, return_string=True):
    <snip>
    #serialization["content"] = obj.content
    serialization["content"] = requests.utils.get_unicode_from_response(obj)
    <snip>

From the requests API docs http://docs.python-requests.org/en/latest/api/

requests.utils.get_unicode_from_response(r)
Returns the requested content back in unicode.

Parameters: r – Response object to get unicode content from.
Tried:

charset from content-type
every encodings from <meta ... charset=XXX>
fall back and replace all unicode characters

If it's not appropriate, I can perform the serialisation myself. Should just be a matter of * pasing False for the 'return_string' parameter of write_response(...). * replacing the 'content' value with the one returned by the above function * json'ing the value

Edit: It seems however, that once deserialise the json and try and access the 'text' attribute, it causes an exception.

  File "... /lib/python2.7/site-packages/requests/models.py", line 633, in text
    content = str(self.content, errors='replace')
TypeError: decoding Unicode is not supported

sigmavirus24 commented 11 years ago

So one way around this (that I have been experimenting with in Betamax) is turning everything into base64 encoded text and using that for serialization. It isn't fool proof but it might be more reliable than just serializing the Unicode content.

Aside from this, @kanzure don't blame yourself for Python's string/Unicode API being awful and hard to use properly. 90% of us struggle with the same issues.

sigmavirus24 commented 11 years ago

@kanzure this may be of interest to you.

@adamlwgriffiths you may be interested but it's a different tool (albeit related) and might give you a way to send an easy PR.

adamlwgriffiths commented 11 years ago

Interesting points. This would adversely affect the readability of serialised objects, I'm not sure if that's an issue.

It seems this is a pretty common issue. I just tried 'jsonpickle', and it triggers the same exception with this data. =/

adamlwgriffiths commented 11 years ago

At least the response.encoding is correct.

ISO-8859-1

https://en.wikipedia.org/wiki/ISO/IEC_8859-1 Wikipedia that 0xe9 in ISO-8859-1 is definitely e-acute (https://en.wikipedia.org/wiki/%C3%89).

The problem is that json is assuming utf-8. The response.encoding needs to be used (I think).

json.dumps(obj, encoding=response.encoding)

The problem with this is that json.loads requires the encoding to be passed in also. This also presents a problem as you need to know the encoding to de-serialise the response at a later time. Which is not very 'just works'.

The other possible method would be to drop the non-utf8 characters.

response.text.encode('utf-8', errors='ignore')

But then you'll cause problems with serialising pages in non-utf8 encodings.

I'm not sure what the proper fix is.

adamlwgriffiths commented 11 years ago

I've changed the flow of my code so this is no longer an issue for me.

I'm not sure how you'd fix it cleanly because I think you need the encoding at both serialisation and deserialisation time, which would change your API and cause a lot of issues.

Thanks for your help guys =)

kanzure / python-requestions

Doesn't handle unicode in some cases #4