chunk size error for unicode content

ilovenwd commented 10 years ago

I found this code in requests/adapters.py (latest version installed by pip):

https://github.com/kennethreitz/requests/blob/master/requests/adapters.py#L383

for i in request.body:                                                     
       low_conn.send(hex(len(i))[2:].encode('utf-8'))

if i is a unicode, the low_conn send utf8 encoding byte string, but the chunk size is wrong. I think it should change to:

if isinstance(i, unicode):
  i = i.encode('utf8')
low_conn.send(hex(len(i))[2:].encode('utf-8'))

Lukasa commented 10 years ago

Thanks for raising this!

I don't think we should do that, however. If you've passed us a unicode string we should not be guessing at what text encoding you want to use in the body. I think I'd be happier not accepting unicode at all in this case, rather than guessing that 'UTF-8' is what is meant.

This is a bit of a thorny issue though, because that reduces our compatibility: we've implicitly allowed it in the past. Maybe force a decode to ASCII instead? (A choice which is almost certain to work.)

Lukasa commented 10 years ago

@sigmavirus24, can I get your thoughts here?

sigmavirus24 commented 10 years ago

So while my instinct is to insist the user give us everything as a bytes object (and I don't think it's entirely unreasonable), we actively encourage users to do:

requests.post(url,
              data=json.dumps({'my': 'json', 'data': 'here'}),
              headers={'Content-Type': 'application/json'})

If we don't handle this in requests, at least for some deprecation period, we will be forcing users to do:

requests.post(url,
              data=json.dumps({'my': 'json', 'data': 'here'}).encode('utf-8'),
              headers={'Content-Type': 'application/json'})

I'm sure the number of people passing JSON to data is not insignificant. I guess I'm in favor of using a Warning and transitioning to forcing this. This use case I outlined will also become obsolete soon because requests will be handling json.dumps for users. Which reminds me...

ilovenwd commented 10 years ago

UTF8 is the most reasonable default. Besides, python3 string defaults to unicode, many data read from db/http is auto convert to unicode(default utf8). so, why not accept utf8 as default unicode encoding? The python standard library ALREADY AUTO convert unicode to utf8 when write to socket. (that why the chunk size is wrong, but the chunk body is ok)

@sigmavirus24 this bug only appears when using generator as data (chunked encoding) post data=unicode works because

The python standard library ALREADY AUTO convert unicode to utf8 when write to socket.

Lukasa commented 10 years ago

The python standard library ALREADY AUTO convert unicode to utf8 when write to socket. (that why the chunk size is wrong, not the chunk body)

Not in Python 3 it doesn't:

Python 3.4.1 (default, Aug 25 2014, 11:56:02) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.create_connection(('mkcert.org', 80))
>>> s.write("unicode string")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

In fact, it doesn't even work in Python 2 on my machine:

Python 2.7.8 (default, Aug 25 2014, 11:53:26) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.create_connection(('mkcert.org', 80))
>>> s.send(u"unicode string with ÜBİTAK")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 20: ordinal not in range(128)

The answer to 'why not accept utf8 as default encoding' is because that mistake is exactly what causes this problem in the first place. There is no 'default encoding', there's only right and wrong. We cannot and should not guess in this regard. It makes no sense to send unicode bytes on a socket.

Sometimes, we can guess. JSON has a set of well-defined text encodings, so we can pick one of those. But you could be sending text in any encoding, and we have no way to guess. Getting weird server errors is worse than us blowing up and saying "you have to give us binary data!"

sigmavirus24 commented 10 years ago

Getting weird server errors is worse than us blowing up and saying "you have to give us binary data!"

Yeah I'm surprised we haven't had more bug reports about this frankly. Like I said, I think we should follow a deprecation pattern for this behaviour for 2.5 and 2.6, then make it default in 2.7 (or 3.0).

I think we should issue a DeprecationWarning when we receive data whose type is not bytes. We should then immediately try to encode the data for the user.
For the case that @ilovenwd is encountering (using a generator) we should issue 1 deprecation warning after the first chunk and then encode the data for the user.
In the case of the user passing a file(-like) object to data, we should check the mode to ensure it was opened with 'b' or is an instance/subclass of BytesIO. This case is tougher because some portions of it may be handled by the generator case (i.e., some users don't define __len__ on BytesIO subclasses and so they're treated as generators.).

Once we have a json parameter, we can confidently handle that ourselves, for the user.

psf / requests

chunk size error for unicode content #2235