Open ilovenwd opened 10 years ago
Thanks for raising this!
I don't think we should do that, however. If you've passed us a unicode string we should not be guessing at what text encoding you want to use in the body. I think I'd be happier not accepting unicode at all in this case, rather than guessing that 'UTF-8' is what is meant.
This is a bit of a thorny issue though, because that reduces our compatibility: we've implicitly allowed it in the past. Maybe force a decode to ASCII instead? (A choice which is almost certain to work.)
@sigmavirus24, can I get your thoughts here?
So while my instinct is to insist the user give us everything as a bytes object (and I don't think it's entirely unreasonable), we actively encourage users to do:
requests.post(url,
data=json.dumps({'my': 'json', 'data': 'here'}),
headers={'Content-Type': 'application/json'})
If we don't handle this in requests, at least for some deprecation period, we will be forcing users to do:
requests.post(url,
data=json.dumps({'my': 'json', 'data': 'here'}).encode('utf-8'),
headers={'Content-Type': 'application/json'})
I'm sure the number of people passing JSON to data
is not insignificant. I guess I'm in favor of using a Warning and transitioning to forcing this. This use case I outlined will also become obsolete soon because requests will be handling json.dumps
for users. Which reminds me...
UTF8 is the most reasonable default. Besides, python3 string defaults to unicode, many data read from db/http is auto convert to unicode(default utf8). so, why not accept utf8 as default unicode encoding? The python standard library ALREADY AUTO convert unicode to utf8 when write to socket. (that why the chunk size is wrong, but the chunk body is ok)
@sigmavirus24 this bug only appears when using generator as data (chunked encoding) post data=unicode works because
The python standard library ALREADY AUTO convert unicode to utf8 when write to socket.
The python standard library ALREADY AUTO convert unicode to utf8 when write to socket. (that why the chunk size is wrong, not the chunk body)
Not in Python 3 it doesn't:
Python 3.4.1 (default, Aug 25 2014, 11:56:02)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.create_connection(('mkcert.org', 80))
>>> s.write("unicode string")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
In fact, it doesn't even work in Python 2 on my machine:
Python 2.7.8 (default, Aug 25 2014, 11:53:26)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.create_connection(('mkcert.org', 80))
>>> s.send(u"unicode string with ÜBİTAK")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 20: ordinal not in range(128)
The answer to 'why not accept utf8 as default encoding' is because that mistake is exactly what causes this problem in the first place. There is no 'default encoding', there's only right and wrong. We cannot and should not guess in this regard. It makes no sense to send unicode bytes on a socket.
Sometimes, we can guess. JSON has a set of well-defined text encodings, so we can pick one of those. But you could be sending text in any encoding, and we have no way to guess. Getting weird server errors is worse than us blowing up and saying "you have to give us binary data!"
Getting weird server errors is worse than us blowing up and saying "you have to give us binary data!"
Yeah I'm surprised we haven't had more bug reports about this frankly. Like I said, I think we should follow a deprecation pattern for this behaviour for 2.5 and 2.6, then make it default in 2.7 (or 3.0).
DeprecationWarning
when we receive data
whose type is not bytes
. We should then immediately try to encode the data for the user.data
, we should check the mode to ensure it was opened with 'b'
or is an instance/subclass of BytesIO
. This case is tougher because some portions of it may be handled by the generator case (i.e., some users don't define __len__
on BytesIO
subclasses and so they're treated as generators.).Once we have a json
parameter, we can confidently handle that ourselves, for the user.
I found this code in requests/adapters.py (latest version installed by pip):
https://github.com/kennethreitz/requests/blob/master/requests/adapters.py#L383
if
i
is a unicode, the low_conn send utf8 encoding byte string, but the chunk size is wrong. I think it should change to: