psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.15k stars 9.32k forks source link

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 169: ordinal not in range(128) #1252

Closed michaelhelmick closed 11 years ago

michaelhelmick commented 11 years ago

Something similar has been posted before: https://github.com/kennethreitz/requests/issues/403

This is using requests 1.1.0 But this problem is still popping up while trying to post just a file and a file with data.

On top of the similar issue, I've posted about this before and in requests_oauthlib it has said to been fixed; If you wish, I'll try and find the issue in that lib, just too lazy to open a new tab now ;P

Error:

Traceback (most recent call last):
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/sessions.py", line 340, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/sessions.py", line 279, in request
    resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies)
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/sessions.py", line 374, in send
    r = adapter.send(request, **kwargs)
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/adapters.py", line 174, in send
    timeout=timeout
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 422, in urlopen
    body=body, headers=headers)
  File "/Users/mikehelmick/.virtualenv/twython/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 274, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 955, in request
    self._send_request(method, url, body, headers)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 989, in _send_request
    self.endheaders(body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 951, in endheaders
    self._send_output(message_body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 809, in _send_output
    msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 169: ordinal not in range(128)

I posted a gist with sample code if any of you needed to test it. If you guys are really being lazy, I can post some app/user tokens for you to use (let me know).

Gist: https://gist.github.com/michaelhelmick/5199754

sigmavirus24 commented 11 years ago

So here's my first guess (but I'll obviously look into this more): I think that OAuth1 might be generating unicode and opening the file in binary form will cause this issue due to how httplib sends the message body. Also this may be related to #1250.

@Lukasa, opinions?

Lukasa commented 11 years ago

That seems like a reasonable diagnosis. We get lots of obscure bugs in Requests from problems with encoding, I'm adding it to my list of things I need to do. =P

husainihisan commented 11 years ago

still no fix? :(

grillermo commented 11 years ago

Virtually all maintained twitter python libraries migrated to use requests > 1.0 and posting images is broken, if you want a very specific example try twython with its twythonAPI.updateProfileImage(open('file','r')) method where this bug is causing pain.

Lukasa commented 11 years ago

Ok, so I've tracked this down. Gimme a second to write a fix.

sigmavirus24 commented 11 years ago

@grillermo it isn't a matter of reproducing the bug or not believing you. The stack trace is fairly clear on the matter. The problem is that we're all quite busy.

Lukasa commented 11 years ago

So, really, this is because oauthlib is converting all the headers to unicode objects. We're then concatenating these unicode objects with the bytes of the file. Python tries to implicitly decode the bytes into unicode using the locale default codec, and obviously fails.

We can do the 'easy' fix and have requests-oauthlib just encode the headers using Latin-1, but that defers the problem. Alternatively, we can do the 'right' thing and take control of header encoding ourselves. I don't know if Kenneth is up for that, though. @kennethreitz, thoughts?

Lukasa commented 11 years ago

Just spoke with Kenneth, and he and I agree that we should fix requests-oauthlib. This is no longer blocking v1.2.

t-8ch commented 11 years ago

@Lukasa We recently had a discussion about header encoding in shazow/urllib3#164 at which you might want to look.

sigmavirus24 commented 11 years ago

Thanks Thomas. I was going to mention that when I got home.

Lukasa commented 11 years ago

So wait, urllib3 is expecting us to provide unicode objects, not bytes?

t-8ch commented 11 years ago

Cc @shazow

shazow commented 11 years ago

I think the strategy, as with the rest of Python, is to use the appropriate type where appropriate. I would argue that it makes more sense for header keys to be strings rather than bytes. Is there a counter-argument? (Things like request body should definitely be bytes.)

Lukasa commented 11 years ago

My position on the matter would be that headers have a defined encoding on the wire (ISO-8859-1), which means that the only valid headers are ones that can be encoded in that encoding. You can't send strings on the wire, you can only send bytes, and the user shouldn't have to know what bytes those are. I'm happy to leave that encoding to urllib3 though. =)

Incidentally, if urllib3 is passing unicode headers through to httplib without encoding them it might be the cause of our issue.

Lukasa commented 11 years ago

Further debugging suggests the problem is in the interface between urllib3 and httplib. This exact problem can be reproduced using the following short program:

import httplib
conn = httplib.HTTPConnection('httpbin.org', 80)

conn.request('POST', u'/post', '\xff', {'test': 'value'}) # Exception here.

Any unicode value, whether in the method, the url, or the keys/values of the headers dict, will cause the entire body of the message to be 'promoted' to a unicode string. This is fine unless you are uploading a file that isn't ascii text, which might contain out-of-range bytes. Exceptions will then be dramatically thrown.

I think either urllib3 or requests needs to ensure that by this stage, everything is bytes.

sigmavirus24 commented 11 years ago

If we ensure everything is bytes, this will work well for Python 2 because strs are bytes objects. In python 3 this seems to produce an issue like @t-8ch mentioned. Naturally it's perfectly fine for there to be multiple header values and there are no bizarre characters in headers (and cannot be if I remember the spec properly) so the coercion to whatever will be fine. You might think this falls on our shoulders because it doesn't seem that too many urllib3 users have reported this issue, but you're wrong.

The problem with doing this is exactly the case where we're reading binary data which is a very common use case. If we're provided a file (or file-like object). We have no way of knowing if it's binary data or not and images and the like can't be coerced to text. This makes me think that the burden lies on urllib3 to coerce everything together.

Either way, I feel obligated to leave this behind.

t-8ch commented 11 years ago

Wouldn't this mean urllib3 has to mess with Content-Length? (and maybe other things I am not aware of)

(Python 3):

>>> requests.get('http://httpbin.org/post', data='u').json()
{
 'data': 'u', # data looks correct
 'headers': {
  'Content-Length': '1',
  # [..]
  },
}
# note the "ü"                                    v
>>> requests.get('http://httpbin.org/post', data='ü').json()
{
 'data': 'data:application/octet-stream;base64,/A==', # ??
 'headers': {
  'Content-Length': '1',
  # [..]
 }
 # [..]
}
>>> requests.get('http://httpbin.org/post', data='ü'.encode()).json()
{
 'data': 'ü', # works
 'headers': {
  'Content-Length': '2',
  # [..]
 }
 # [..]
}
$ curl --data-binary ü http://httpbin.org/post
{
  "data": "\u00fc" # == 'ü'
  "headers": {
    "Content-Length": "2",
    # [..]
  },
  # [..]
}

RFC2616:

OCTET = <any 8-bit sequence of data>

# [..]

The Content-Length entity-header field indicates the size of the
entity-body, in decimal number of OCTETs, sent to the recipient or,
in the case of the HEAD method, the size of the entity-body that
would have been sent had the request been a GET.
Lukasa commented 11 years ago

I disagree with your assessment of correct. =)

Content-Length, as you rightly pointed out, asks for the length of the data in octets. The unicode string u'a' (using Python 2.7 notation to avoid ambiguity) does not have a length in octets, because it's unicode. Only encoded text has any octet-based length. For example:

>>> len(u'a'.encode('utf-8'))
1
>>> len(u'a'.encode('utf-16')) # Don't forget the BOM will be here too!
4
>>> len(u'a'.encode('utf-32'))
8

This means that if urllib3 gets unicode data, but no explicit Content-Length header, urllib3 should encode that data and then set the content-length based on that encoding. However, if urllib3 gets an explicit Content-Length header, I'd argue that it should just assume the user knows what they're doing and let it go.

From where I'm sitting, the problem here is that urllib3 needs to assume that it might get unicode values for any of these strings, but the wire needs bytes. httplib isn't doing the right thing here, so to avoid the Python interpreter doing its totally bogus implicit encoding/decoding, urllib3 needs to take it into its own hands. This means encoding the unicode.

It is totally legitimate to ask users of urllib3 to do their own encoding, and if you conclude that that is what you want to do then we can make the fix in requests-oauthlib. However, I think that someone in the stack, either requests or urllib3, needs to take responsibility for this encoding stuff, because Python 2 just does it all wrong.

t-8ch commented 11 years ago

(Python 2 works, following is Python 3)

This means that if urllib3 gets unicode data, but no explicit Content-Length header,
urllib3 should encode that data and then set the content-length based on that encoding. 

Urllib3 does get a explicit Content-Length.

>>> r = requests.Request('POST', 'http://httpbin.org', data=u'ü').prepare()
>>> r.headers
{'Content-Length': '1'}

My 2 cents: Urllib3 should assume native strings for headers and bytes for the body.

Lukasa commented 11 years ago

Yeah, I was excluding Requests' behaviour for a moment, and just trying to nail down what urllib3 should be doing. Then we could change Requests to program to that interface. =)

Thomas, I'm also quite happy with your proposal there. If @shazow thinks that's the way it should go, the fix belongs outside urllib3. :fireworks:

shazow commented 11 years ago

I would prefer to avoid doing aggressive type coercion for every input on the urllib3 side.

sigmavirus24 commented 11 years ago

So this can not block 1.2.0 unless @kennethreitz really wants it to.

Perhaps to satisfy @michaelhelmick and company we should add a notice to the release that we realize that this is broken and a fix is being worked on in shazow/urllib3

michaelhelmick commented 11 years ago

I was sleeping while all this convo was going on :blush: haha

But, first and foremost I want to thank all of you for the participation in the issue!

Although I feel that it would be weird if 1.2.0 was released, this issue was still valid and then all of the sudden 2 weeks later without any version bump to requests, this issue was just solved and file uploading worked, etc worked

michaelhelmick commented 11 years ago

Although, urllib3 is contained within requests so I guess the Kenneth would have to update the internal package anyways; therefore forcing some sort of version bump? So I guess this technically isn't a block for 1.2.0; my apologies.

sigmavirus24 commented 11 years ago

@michaelhelmick I hope you got better sleep than I did. :) And yes, as soon as this gets fixed, I would be certain to bug @kennethreitz about a bump to 1.2.1

And there's no need to apologize.

michaelhelmick commented 11 years ago

@sigmavirus24 I got about 9 hours, haha. And alright, and if he doesn't bump on your first request; we'll start a trending topic on Twitter ;D

sigmavirus24 commented 11 years ago

We just won't upload any images with those tweets. :-P

michaelhelmick commented 11 years ago

xD hahah, I just lol'd haha

Lukasa commented 11 years ago

So, everyone in this thread who cares about the requests end of this, I've pushed a fix up to requests-oauthlib. Anyone who cares to test it should download from the unicodedecodeerror branch (yes, you will mis-type that at least once), and I welcome code review on the PR at requests/requests-oauthlib#26.

husainihisan commented 11 years ago

i would like to try, but i dont know, any guide? im using debian on raspberry pi mainly use for twython to upload picture

sigmavirus24 commented 11 years ago

So since this seems to be fixed in requests/requests-oauthlib and since it seems like we all agree this should be done in urllib3, can we close this?

sigmavirus24 commented 11 years ago

Actually I misread shazow's comment. I thought he said he'd prefer to do the coercion in urllib3. It seemed bizarre but at least I got it right the second time around, right?

shazow commented 11 years ago

@sigmavirus24 I'd like to treat urllib3 as more of an expected-input-expected-output library, and Requests to do the "do silly thing to input to make behaviour more user-friendly" stuff. Does that make sense?

sigmavirus24 commented 11 years ago

Yeah it makes perfect sense. I meant that I found it odd that you would want urllib3 to do the coercion, which is how I first read it.

Lukasa commented 11 years ago

Seems fair to me. I'll try to take a look into it at some point over the long weekend. No guarantees though!

Lukasa commented 11 years ago

Possibly in order to punish us (:wink:), requests-oauthlib does not work on Python3 if you upload files. That's because Requests uses encode_multipart_formdata from urllib3, which returns the content-type as bytes. @shazow: is that intentional? If so, I can work around it here. If not, I can offer you a PR to fix it.

sigmavirus24 commented 11 years ago

According to shazow's previous comment(s) I think he wants the body to be bytes and headers to be strings.

Lukasa commented 11 years ago

Sure, but the content-type is a header value. =)

sigmavirus24 commented 11 years ago

Ah, then yeah that might be a mistake. I don't know though.

shazow commented 11 years ago

Hmm yes I think that's a mistake. If everyone agrees, a PR sounds good. :)

marselester commented 11 years ago

I think my problem seems to be the same. UnicodeDecodeError appears when method's param is unicode and requests.request() got files argument, e.g.:

>>> requests.request(u'post', u'http://httpbin.org/post',
...                  files={u'file': open('README.rst', 'rb')})
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1759: ordinal not in range(128)

But requests.request(u'post', u'http://httpbin.org/post') is ok.

Lukasa commented 11 years ago

@marselester: What version of Requests are you using? I can't reproduce this in Requests v1.2.0, using either Python 2.7 or Python 3.3.

marselester commented 11 years ago

I use Python 2.7.3, Requests 1.2.0:

Python 2.7.3 (default, Mar  9 2013, 17:38:02) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.24)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.__version__
'1.2.0'
>>> requests.request(u'post', u'http://httpbin.org/post',
...                  files={u'file': open('README.rst', 'rb')})
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 354, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 460, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 211, in send
    timeout=timeout
  File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 421, in urlopen
    body=body, headers=headers)
  File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 273, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 958, in request
    self._send_request(method, url, body, headers)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 992, in _send_request
    self.endheaders(body)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 954, in endheaders
    self._send_output(message_body)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 812, in _send_output
    msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1759: ordinal not in range(128)
Lukasa commented 11 years ago

Oh, hang on, this just occurred to me: are you uploading Requests' README.rst file?

marselester commented 11 years ago

I have tried to upload image file:

>>> requests.request(u'post', u'http://httpbin.org/post', files={u'file': open('IMG_1365.JPG', 'rb')})
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 134: ordinal not in range(128)
marselester commented 11 years ago

But if convert method's param to str then it is fine:

>>> requests.request(str(u'post'), u'http://httpbin.org/post', files={u'file': open('README.rst', 'rb')})
<Response [200]>
Lukasa commented 11 years ago

Oh, that makes perfect sense. I suggest you just use a normal, Python 2.7 native string, e.g. 'POST'. Or, even better, use requests.post() and save yourself this trouble entirely. =)

The issue is that Python 2.7 thinks it can convert between unicode and byte strings without your input, which it can't. When you concatenate two strings, if one of them is unicode, the other is decoded using the default encoding (almost always ASCII). HTTP is a text-based format, so building an HTTP message involves a lot of string concatenation. When you upload anything with non-ascii bytes in it, and you've used unicode in a place we don't change it, Bad Stuff Happens(tm).

Requests is aiming to improve sanitising of this stuff (see #1338). However, there are no plans to sanitise the verb string. You must always provide that verb string as a native string (On Python 2.X, bytes, on Python 3.X, unicode).

marselester commented 11 years ago

@Lukasa, thank you. When I can use requests.post() I use it :)

BigFishhhh commented 7 years ago

I also find this problem ,how to solve it? python 2.7 Traceback (most recent call last): File "F:/gitcode/201704/user-profile-waimai-crawler/service/data_service.py", line 3, in import requests File "C:\Python27\lib\site-packages\requests__init.py", line 52, in from .packages.urllib3.contrib import pyopenssl File "C:\Python27\lib\site-packages\requests\packages\urllib3\contrib\pyopenssl.py", line 47, in from cryptography import x509 File "C:\Python27\lib\site-packages\cryptography\x509\init__.py", line 7, in from cryptography.x509.base import ( File "C:\Python27\lib\site-packages\cryptography\x509\base.py", line 16, in from cryptography.x509.extensions import Extension, ExtensionType File "C:\Python27\lib\site-packages\cryptography\x509\extensions.py", line 14, in from asn1crypto.keys import PublicKeyInfo File "C:\Python27\lib\site-packages\asn1crypto\keys.py", line 22, in from ._elliptic_curve import ( File "C:\Python27\lib\site-packages\asn1crypto_elliptic_curve.py", line 51, in from ._int import inverse_mod File "C:\Python27\lib\site-packages\asn1crypto_int.py", line 56, in from ._perf._big_num_ctypes import libcrypto File "C:\Python27\lib\site-packages\asn1crypto_perf_big_num_ctypes.py", line 31, in libcrypto_path = find_library('crypto') File "C:\Python27\lib\ctypes\util.py", line 53, in find_library fname = os.path.join(directory, name) File "C:\Python27\lib\ntpath.py", line 85, in join result_path = result_path + p_path UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 1: ordinal not in range(128)

Lukasa commented 7 years ago

This seems like a bug with asn1crypto: specifically, it looks for libcrypto and bumps into a problem handling the path. I recommend you open a support request there.