Closed michaelhelmick closed 11 years ago
So here's my first guess (but I'll obviously look into this more): I think that OAuth1 might be generating unicode and opening the file in binary form will cause this issue due to how httplib sends the message body. Also this may be related to #1250.
@Lukasa, opinions?
That seems like a reasonable diagnosis. We get lots of obscure bugs in Requests from problems with encoding, I'm adding it to my list of things I need to do. =P
still no fix? :(
Virtually all maintained twitter python libraries migrated to use requests > 1.0 and posting images is broken, if you want a very specific example try twython with its twythonAPI.updateProfileImage(open('file','r')) method where this bug is causing pain.
Ok, so I've tracked this down. Gimme a second to write a fix.
@grillermo it isn't a matter of reproducing the bug or not believing you. The stack trace is fairly clear on the matter. The problem is that we're all quite busy.
So, really, this is because oauthlib is converting all the headers to unicode objects. We're then concatenating these unicode objects with the bytes of the file. Python tries to implicitly decode the bytes into unicode using the locale default codec, and obviously fails.
We can do the 'easy' fix and have requests-oauthlib
just encode the headers using Latin-1, but that defers the problem. Alternatively, we can do the 'right' thing and take control of header encoding ourselves. I don't know if Kenneth is up for that, though. @kennethreitz, thoughts?
Just spoke with Kenneth, and he and I agree that we should fix requests-oauthlib
. This is no longer blocking v1.2.
@Lukasa We recently had a discussion about header encoding in shazow/urllib3#164 at which you might want to look.
Thanks Thomas. I was going to mention that when I got home.
So wait, urllib3 is expecting us to provide unicode objects, not bytes?
Cc @shazow
I think the strategy, as with the rest of Python, is to use the appropriate type where appropriate. I would argue that it makes more sense for header keys to be strings rather than bytes. Is there a counter-argument? (Things like request body should definitely be bytes.)
My position on the matter would be that headers have a defined encoding on the wire (ISO-8859-1), which means that the only valid headers are ones that can be encoded in that encoding. You can't send strings on the wire, you can only send bytes, and the user shouldn't have to know what bytes those are. I'm happy to leave that encoding to urllib3 though. =)
Incidentally, if urllib3 is passing unicode headers through to httplib without encoding them it might be the cause of our issue.
Further debugging suggests the problem is in the interface between urllib3 and httplib. This exact problem can be reproduced using the following short program:
import httplib
conn = httplib.HTTPConnection('httpbin.org', 80)
conn.request('POST', u'/post', '\xff', {'test': 'value'}) # Exception here.
Any unicode value, whether in the method, the url, or the keys/values of the headers dict, will cause the entire body of the message to be 'promoted' to a unicode string. This is fine unless you are uploading a file that isn't ascii text, which might contain out-of-range bytes. Exceptions will then be dramatically thrown.
I think either urllib3 or requests needs to ensure that by this stage, everything is bytes.
If we ensure everything is bytes, this will work well for Python 2 because str
s are bytes
objects. In python 3 this seems to produce an issue like @t-8ch mentioned. Naturally it's perfectly fine for there to be multiple header values and there are no bizarre characters in headers (and cannot be if I remember the spec properly) so the coercion to whatever will be fine. You might think this falls on our shoulders because it doesn't seem that too many urllib3 users have reported this issue, but you're wrong.
The problem with doing this is exactly the case where we're reading binary data which is a very common use case. If we're provided a file (or file-like object). We have no way of knowing if it's binary data or not and images and the like can't be coerced to text. This makes me think that the burden lies on urllib3 to coerce everything together.
Either way, I feel obligated to leave this behind.
Wouldn't this mean urllib3 has to mess with Content-Length
? (and maybe other
things I am not aware of)
(Python 3):
>>> requests.get('http://httpbin.org/post', data='u').json()
{
'data': 'u', # data looks correct
'headers': {
'Content-Length': '1',
# [..]
},
}
# note the "ü" v
>>> requests.get('http://httpbin.org/post', data='ü').json()
{
'data': 'data:application/octet-stream;base64,/A==', # ??
'headers': {
'Content-Length': '1',
# [..]
}
# [..]
}
>>> requests.get('http://httpbin.org/post', data='ü'.encode()).json()
{
'data': 'ü', # works
'headers': {
'Content-Length': '2',
# [..]
}
# [..]
}
$ curl --data-binary ü http://httpbin.org/post
{
"data": "\u00fc" # == 'ü'
"headers": {
"Content-Length": "2",
# [..]
},
# [..]
}
OCTET = <any 8-bit sequence of data>
# [..]
The Content-Length entity-header field indicates the size of the
entity-body, in decimal number of OCTETs, sent to the recipient or,
in the case of the HEAD method, the size of the entity-body that
would have been sent had the request been a GET.
I disagree with your assessment of correct. =)
Content-Length, as you rightly pointed out, asks for the length of the data in octets. The unicode string u'a'
(using Python 2.7 notation to avoid ambiguity) does not have a length in octets, because it's unicode. Only encoded text has any octet-based length. For example:
>>> len(u'a'.encode('utf-8'))
1
>>> len(u'a'.encode('utf-16')) # Don't forget the BOM will be here too!
4
>>> len(u'a'.encode('utf-32'))
8
This means that if urllib3 gets unicode data, but no explicit Content-Length header, urllib3 should encode that data and then set the content-length based on that encoding. However, if urllib3 gets an explicit Content-Length header, I'd argue that it should just assume the user knows what they're doing and let it go.
From where I'm sitting, the problem here is that urllib3 needs to assume that it might get unicode values for any of these strings, but the wire needs bytes. httplib isn't doing the right thing here, so to avoid the Python interpreter doing its totally bogus implicit encoding/decoding, urllib3 needs to take it into its own hands. This means encoding the unicode.
It is totally legitimate to ask users of urllib3 to do their own encoding, and if you conclude that that is what you want to do then we can make the fix in requests-oauthlib. However, I think that someone in the stack, either requests or urllib3, needs to take responsibility for this encoding stuff, because Python 2 just does it all wrong.
(Python 2 works, following is Python 3)
This means that if urllib3 gets unicode data, but no explicit Content-Length header,
urllib3 should encode that data and then set the content-length based on that encoding.
Urllib3 does get a explicit Content-Length
.
>>> r = requests.Request('POST', 'http://httpbin.org', data=u'ü').prepare()
>>> r.headers
{'Content-Length': '1'}
My 2 cents: Urllib3 should assume native strings for headers and bytes for the body.
Yeah, I was excluding Requests' behaviour for a moment, and just trying to nail down what urllib3 should be doing. Then we could change Requests to program to that interface. =)
Thomas, I'm also quite happy with your proposal there. If @shazow thinks that's the way it should go, the fix belongs outside urllib3. :fireworks:
I would prefer to avoid doing aggressive type coercion for every input on the urllib3 side.
So this can not block 1.2.0 unless @kennethreitz really wants it to.
Perhaps to satisfy @michaelhelmick and company we should add a notice to the release that we realize that this is broken and a fix is being worked on in shazow/urllib3
I was sleeping while all this convo was going on :blush: haha
But, first and foremost I want to thank all of you for the participation in the issue!
Although I feel that it would be weird if 1.2.0 was released, this issue was still valid and then all of the sudden 2 weeks later without any version bump to requests
, this issue was just solved and file uploading worked, etc worked
Although, urllib3
is contained within requests
so I guess the Kenneth would have to update the internal package anyways; therefore forcing some sort of version bump? So I guess this technically isn't a block for 1.2.0; my apologies.
@michaelhelmick I hope you got better sleep than I did. :) And yes, as soon as this gets fixed, I would be certain to bug @kennethreitz about a bump to 1.2.1
And there's no need to apologize.
@sigmavirus24 I got about 9 hours, haha. And alright, and if he doesn't bump on your first request; we'll start a trending topic on Twitter ;D
We just won't upload any images with those tweets. :-P
xD hahah, I just lol'd haha
So, everyone in this thread who cares about the requests end of this, I've pushed a fix up to requests-oauthlib. Anyone who cares to test it should download from the unicodedecodeerror
branch (yes, you will mis-type that at least once), and I welcome code review on the PR at requests/requests-oauthlib#26.
i would like to try, but i dont know, any guide? im using debian on raspberry pi mainly use for twython to upload picture
So since this seems to be fixed in requests/requests-oauthlib and since it seems like we all agree this should be done in urllib3, can we close this?
Actually I misread shazow's comment. I thought he said he'd prefer to do the coercion in urllib3. It seemed bizarre but at least I got it right the second time around, right?
@sigmavirus24 I'd like to treat urllib3 as more of an expected-input-expected-output library, and Requests to do the "do silly thing to input to make behaviour more user-friendly" stuff. Does that make sense?
Yeah it makes perfect sense. I meant that I found it odd that you would want urllib3 to do the coercion, which is how I first read it.
Seems fair to me. I'll try to take a look into it at some point over the long weekend. No guarantees though!
Possibly in order to punish us (:wink:), requests-oauthlib does not work on Python3 if you upload files. That's because Requests uses encode_multipart_formdata
from urllib3, which returns the content-type as bytes. @shazow: is that intentional? If so, I can work around it here. If not, I can offer you a PR to fix it.
According to shazow's previous comment(s) I think he wants the body to be bytes and headers to be strings.
Sure, but the content-type is a header value. =)
Ah, then yeah that might be a mistake. I don't know though.
Hmm yes I think that's a mistake. If everyone agrees, a PR sounds good. :)
I think my problem seems to be the same. UnicodeDecodeError
appears when method
's param is unicode and requests.request()
got files
argument, e.g.:
>>> requests.request(u'post', u'http://httpbin.org/post',
... files={u'file': open('README.rst', 'rb')})
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1759: ordinal not in range(128)
But requests.request(u'post', u'http://httpbin.org/post')
is ok.
@marselester: What version of Requests are you using? I can't reproduce this in Requests v1.2.0, using either Python 2.7 or Python 3.3.
I use Python 2.7.3, Requests 1.2.0:
Python 2.7.3 (default, Mar 9 2013, 17:38:02)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.24)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.__version__
'1.2.0'
>>> requests.request(u'post', u'http://httpbin.org/post',
... files={u'file': open('README.rst', 'rb')})
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 354, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 460, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 211, in send
timeout=timeout
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 421, in urlopen
body=body, headers=headers)
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 273, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 958, in request
self._send_request(method, url, body, headers)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 992, in _send_request
self.endheaders(body)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 954, in endheaders
self._send_output(message_body)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 812, in _send_output
msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1759: ordinal not in range(128)
Oh, hang on, this just occurred to me: are you uploading Requests' README.rst file?
I have tried to upload image file:
>>> requests.request(u'post', u'http://httpbin.org/post', files={u'file': open('IMG_1365.JPG', 'rb')})
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 134: ordinal not in range(128)
But if convert method
's param to str
then it is fine:
>>> requests.request(str(u'post'), u'http://httpbin.org/post', files={u'file': open('README.rst', 'rb')})
<Response [200]>
Oh, that makes perfect sense. I suggest you just use a normal, Python 2.7 native string, e.g. 'POST'
. Or, even better, use requests.post()
and save yourself this trouble entirely. =)
The issue is that Python 2.7 thinks it can convert between unicode and byte strings without your input, which it can't. When you concatenate two strings, if one of them is unicode, the other is decoded using the default encoding (almost always ASCII). HTTP is a text-based format, so building an HTTP message involves a lot of string concatenation. When you upload anything with non-ascii bytes in it, and you've used unicode in a place we don't change it, Bad Stuff Happens(tm).
Requests is aiming to improve sanitising of this stuff (see #1338). However, there are no plans to sanitise the verb string. You must always provide that verb string as a native string (On Python 2.X, bytes, on Python 3.X, unicode).
@Lukasa, thank you. When I can use requests.post()
I use it :)
I also find this problem ,how to solve it?
python 2.7
Traceback (most recent call last):
File "F:/gitcode/201704/user-profile-waimai-crawler/service/data_service.py", line 3, in
This seems like a bug with asn1crypto: specifically, it looks for libcrypto
and bumps into a problem handling the path. I recommend you open a support request there.
Something similar has been posted before: https://github.com/kennethreitz/requests/issues/403
This is using
requests 1.1.0
But this problem is still popping up while trying to post just a file and a file with data.On top of the similar issue, I've posted about this before and in
requests_oauthlib
it has said to been fixed; If you wish, I'll try and find the issue in that lib, just too lazy to open a new tab now ;PError:
I posted a gist with sample code if any of you needed to test it. If you guys are really being lazy, I can post some app/user tokens for you to use (let me know).
Gist: https://gist.github.com/michaelhelmick/5199754