psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.02k stars 9.3k forks source link

HeaderParsingError: Failed to parse headers #3098

Closed guyskk closed 8 years ago

guyskk commented 8 years ago
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1
DEBUG:requests.packages.urllib3.connectionpool:"POST /kkblog/ HTTP/1.1" 201 None
WARNING:requests.packages.urllib3.connectionpool:Failed to parse headers (url=http://127.0.0.1:5984/kkblog/): [MissingHeaderBodySeparatorDefect()], unparsed data: '³é\x97\xad\r\nETag: "1-967a00dff5e02add41819138abb3284d"\r\nDate: Fri, 15 Apr 2016 14:45:18 GMT\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Length: 69\r\nCache-Control: must-revalidate\r\n\r\n'
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 390, in _make_request
    assert_header_parsing(httplib_response.msg)
  File "/usr/lib/python3.5/site-packages/requests/packages/urllib3/util/response.py", line 59, in assert_header_parsing
    raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
requests.packages.urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: '³é\x97\xad\r\nETag: "1-967a00dff5e02add41819138abb3284d"\r\nDate: Fri, 15 Apr 2016 14:45:18 GMT\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Length: 69\r\nCache-Control: must-revalidate\r\n\r\n'

here is the same request with curl:

curl -v -X POST 127.0.0.1:5984/kkblog/ -H "Content-Type: application/json" -d '{"_id": "关闭"}'
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 5984 (#0)
> POST /kkblog/ HTTP/1.1
> Host: 127.0.0.1:5984
> User-Agent: curl/7.47.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 17
> 
* upload completely sent off: 17 out of 17 bytes
< HTTP/1.1 201 Created
< Server: CouchDB/1.6.1 (Erlang OTP/18)
< Location: http://127.0.0.1:5984/kkblog/关闭
< ETag: "3-bc27b6930ca514527d8954c7c43e6a09"
< Date: Fri, 15 Apr 2016 15:13:14 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 69
< Cache-Control: must-revalidate
< 
{"ok":true,"id":"关闭","rev":"3-bc27b6930ca514527d8954c7c43e6a09"}
* Connection #0 to host 127.0.0.1 left intact

the problem is Location: http://127.0.0.1:5984/kkblog/关闭 in the response header, I tried other Chinese chars but they didn't cause Exception.

>>> '关闭'.encode('utf-8')
b'\xe5\x85\xb3\xe9\x97\xad'
>>> 
Lukasa commented 8 years ago

The header parsing is done by httplib, in the Python standard library; that is the part that failed to parse. The failure to parse is understandable though: servers should not be shoving arbitrary bytes into headers.

Using UTF-8 for your headers is extremely unwise, as discussed by RFC 7230:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII].

In this instance it's not really possible for us to resolve the problem. The server should instead be sending urlencoded URLs, or RFC 2047-encoded header fields. Either way, httplib is getting confused here, and we can't really step in and stop it.

guyskk commented 8 years ago

thanks

fake-name commented 7 years ago

In this instance it's not really possible for us to resolve the problem. The server should instead be sending urlencoded URLs, or RFC 2047-encoded header fields. Either way, httplib is getting confused here, and we can't really step in and stop it.

So..... what about the thousands and thousands of potential servers that I can't just SSH into and fix?

Basically, "just follow the RFC" is a complete non-answer, because I don't control the world (I'm taking minion applications, though!).
The fact is, servers out there serve UTF-8 headers. This is not something fixable, because they're not my server. My web browser handles this situation just fine, so it's clearly possible to make this situation work.

As it is, requests fails on these servers. This is fixable, because I control the code on my local machine.

fake-name commented 7 years ago

Additionally, the RFC only even ever states In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data., so UTF-8 header values are even RFC compliant!

StewPoll commented 7 years ago

US-ASCII is a sub-set of UTF-8, not the other way around. Characters like 关闭 are UTF-8, but are NOT US-ASCII. So no, UTF-8 is NOT RFC compliant.

fake-name commented 7 years ago

[Deleted comments about RFC stuff, since it's irrelevant, due to broad use of UTF-8 headers]


Anyways, the issue of RFC compliance is irrelevant. There are servers out there that are serving UTF-8 headers, which is then broadly supported by all browser vendors.

StewPoll commented 7 years ago

Content-Disposition is defined in RFC2183. It states that "Short" parameters should be US-ASCII. Long Parameters should be encoded as per RFC-2184. I can't see where in RFC-2184 it says that UTF-8 encoding is valid. (But I may be missing that particular line)

@Lukasa knows much more about the relevant RFCs than I do though.

fake-name commented 7 years ago

And again, whether it's compliant is irrelevant. There are servers out that that act like this. And my browser, and cURL are completely happy talking to them, yet requests explodes.


Hell, I'm interacting with cloudflare, and it's serving UTF-8 headers. So basically unicode header support is massively, MASSIVELY deployed and available, standards be damned.

If you're dead set on being puritanical about RFC support, the only people who are harmed are people who want to use the requests library.

Lukasa commented 7 years ago

@fake-name I don't recall being puritanical about anything. Here is, word for word, what I said (literally quoting myself from this thread):

The header parsing is done by httplib, in the Python standard library; that is the part that failed to parse. The failure to parse is understandable though: servers should not be shoving arbitrary bytes into headers.

Using UTF-8 for your headers is extremely unwise, as discussed by RFC 7230:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII].

In this instance it's not really possible for us to resolve the problem. The server should instead be sending urlencoded URLs, or RFC 2047-encoded header fields. Either way, httplib is getting confused here, and we can't really step in and stop it.

Note the key part of this comment: "Either way, httplib is getting confused here, and we can't really step in and stop it.".

This is what I mean when I say "it's not really possible for us to resolve the problem". The issue here is in a helper library that sits in the Python standard library. Changing the header parsing logic of that standard library module, while possible, is something that needs to be done as part of the standard CPython development process. Requests already carries more subclasses and monkeypatches to httplib than we're happy with, and we're strongly disinclined to carry more.

So here are the options for resolving this issue:

  1. File a bug with the CPython development team, get the bug fixed and into Python 3.6 or 3.7, upgrade to that Python version.
  2. Work out what the minimal possible invasive change is to httplib in order to allow parsing UTF-8 headers, propose that patch to us to see what we think of it.
  3. Write a patch that removes Requests' requirement to use httplib at all so that the next time we have a problem that boils down to "httplib is stupid", we don't have to have this argument again.

Now, you are welcome to pursue any of those options, but I've been to this rodeo a few times so I'm pursuing (3), which is the only one that actually makes this problem go away for good. Unfortunately, it turns out that replacing our low-level HTTP stack that we have spent 7 years integrating with takes quite a lot of work, and I can't just vomit out the code to fix this on demand.

To sum up: I didn't say I didn't think this was a problem or a bug, I said it was a problem that the Requests team couldn't fix, at least not on a timescale that was going to be helpful to this user. If you disagree, by all means, provide a patch to prove me wrong.


And let me make something clear. For the last 9 months or so I have been the most active Requests maintainer by a long margin. Requests is not all I do with my time. I maintain 15 other libraries and actively contribute to more. I have quite a lot of stuff I am supposed to be doing. So I have to prioritise my bug fixing.

Trust me when I say that a bug where the effort required to fix it is extremely high and the flaw comes from a server that is emitting non-RFC-compliant output, that's not a bug that screams out "must be fixed this second". Any time there is a bug predicated on the notion that our peer isn't spec compliant that bug drops several spaces down my priority list. Postel was wrong.

Browsers are incentivised to support misbehaving servers because they are in a competitive environment, and users only blame them when things go wrong. If Chrome doesn't support ${CRAPPY_WEBSITE_X} then Chrome users will just go to a browser that does when they need access.

That's all fine and good, but the reason that Requests doesn't do this is because we have two regular developers. That's it. There are only so many things two developers can do in a day. Neither of us work on just Requests. Compare this to Chrome, which has tens of full-time developers and hundreds of part-time ones. If you want Requests to work on every site where Chrome does, then I have bad news for you my friend because it's just never going to happen.

I say all of this to say: please don't berate the Requests team because we didn't think your particular pet bug was important. We prioritise bugs and close ones we don't think we'll fix any time soon. If you would like to see this bug fixed, a much better option is to write the patch yourself. Shouting at me does not make me look fondly on your request for assistance.

fake-name commented 7 years ago

I apologize. I assumed you were holding an opinion that you were not, and proceeded to be a complete ass.

In any event, I don't disagree that this is a issue with the core library, but, well, in my experience complaining about encoding issues in the core library is non-productive (I have an issue with the built-in ftplib where it decodes some messages as iso-8859-1 even when in utf-8 mode, which I was only able to solve by monkey-patching the stdlib).

Anyways, Assuming you're OK with monkey patching, here's a simple snippet that monkey-patches http.client to make it much, MUCH more robust to arbitrary header encodings:


try:
    import cchardet as chardet
except ImportError:
    import chardet as chardet

import http.client
import email.parser

def parse_headers(fp, _class=http.client.HTTPMessage):
    """Parses only RFC2822 headers from a file pointer.

    email Parser wants to see strings rather than bytes.
    But a TextIOWrapper around self.rfile would buffer too many bytes
    from the stream, bytes which we later need to read as bytes.
    So we read the correct bytes here, as bytes, for email Parser
    to parse.

    Note: Monkey-patched version to try to more intelligently determine
    header encoding

    """
    headers = []
    while True:
        line = fp.readline(http.client._MAXLINE + 1)
        if len(line) > http.client._MAXLINE:
            raise http.client.LineTooLong("header line")
        headers.append(line)
        if len(headers) > http.client._MAXHEADERS:
            raise HTTPException("got more than %d headers" % http.client._MAXHEADERS)
        if line in (b'\r\n', b'\n', b''):
            break

    hstring = b''.join(headers)
    inferred = chardet.detect(hstring)
    if inferred and inferred['confidence'] > 0.8:
        print("Parsing headers!", hstring)
        hstring = hstring.decode(inferred['encoding'])
    else:
        hstring = hstring.decode('iso-8859-1')

    return email.parser.Parser(_class=_class).parsestr(hstring)

http.client.parse_headers = parse_headers

Note: This does require cchardet, or chardet. I'm open to better ways to determine the encoding. It simply overrides the http.client.parse_headers() member of the stdlib, which is kind of squicky.

Splatting the above into a file, and then just importing it at the beginning of the requests/__init__.py file seems to solve the problem.

Lukasa commented 7 years ago

So, that patch looks somewhat plausible. I should note for others skimming this thread that it only works on Python 3, but that's ok because this problem only exists on Python 3 (Python 2 doesn't insist on decoding the headers).

There are a few things we'd need to address if we wanted to move forward on this patch:

  1. It's slow. Requests includes chardet, and already cops quite a lot of flak for using it to try to guess the encoding of body content because of how slow it is. Doing a chardet check for every response we get is likely to be pretty unpleasant.
  2. It makes the assertion that every header is in the response is encoded the same way. Unfortunately I can confidently tell you that that's not true. It's extremely common to have a situation involving reverse or caching proxies where multiple entities add headers to a response, often using different character encodings. Of course, this patch isn't any worse in that case than the current behaviour, but it doesn't necessarily gracefully handle it well either.
  3. Monkeypatching the stdlib is asking for trouble. It hasn't entirely stopped us in the past, but that doesn't make me happy about it.

Part of me wonders whether we can address this issue in a more sledgehammery way. For example, can we avoid the chardet requirement by asserting that we only want to try UTF-8 and ISO-8859-1? That will fail in the rare instance that someone sends a header block encoded in UTF-16, but frankly I'd argue that in that instance the server is beyond help and needs to be put out of its misery.

The other advantage of doing it that way around is that it would let us drop down and put the patch into urllib3. In addition to ensuring that more people get the opportunity to benefit from the patch, it is the more natural place to put it: urllib3 is where we do most of our interfacing with httplib. The other advantage of putting it there is that I'm currently involved in trying to replace httplib in urllib3, and having the patch in urllib3 makes me much more likely to remember to remove the patch once we've resolved the issue. ;)

What do you think @fake-name? Would you be open to a more reduced version of this patch getting added to urllib3 as a (hopefully) temporary measure until we can remove httplib entirely?

fake-name commented 7 years ago

It's definitely a extremely clumsy fix.

It's slow. Requests includes chardet, and already cops quite a lot of flak for using it to try to guess the encoding of body content because of how slow it is. Doing a chardet check for every response we get is likely to be pretty unpleasant.

If requests is using chardet, have you looked at cChardet? It's a Cython-based api-compatible replacement for chardet, which according to it's readme page is about 1000-4000x faster then plain chardet.

It makes the assertion that every header is in the response is encoded the same way. Unfortunately I can confidently tell you that that's not true. It's extremely common to have a situation involving reverse or caching proxies where multiple entities add headers to a response, often using different character encodings. Of course, this patch isn't any worse in that case than the current behaviour, but it doesn't necessarily gracefully handle it well either.

Ugh. Why can't everything just be UTF-8?

I wonder if it's possible to only bother doing a chardet call if the string contains characters outside of the normal ASCII printable chars? Something like if any([char > 127 for char in rawheader]). That'd let the system avoid the call entirely when the string is definitively entirely plain ascii.

I can add per-header decoding after I'm home from work.

Monkeypatching the stdlib is asking for trouble. It hasn't entirely stopped us in the past, but that doesn't make me happy about it.

It's certainly pretty ugly, though this is one of the cleaner monkey-patches I've seen (they were nice enough to break out the entire header decode thing into a single function, rather then having to replace parts of a class or something). Since it looks like parse_headers is part of the httplib public API (albeit an undocumented part), the function signature should be pretty fixed.

I'm not too enthused about it either, but I think trying to convince python core to include cchardet in their dependencies for it is probably not exactly viable.

Part of me wonders whether we can address this issue in a more sledgehammery way. For example, can we avoid the chardet requirement by asserting that we only want to try UTF-8 and ISO-8859-1? That will fail in the rare instance that someone sends a header block encoded in UTF-16, but frankly I'd argue that in that instance the server is beyond help and needs to be put out of its misery.

The problem with putting servers out of their misery is the people who own datacenters tend to get kind of irritable when you break in and hit someone's server with a baseball-bat. And then you have to deal with the person who actually owns the server.

The other advantage of doing it that way around is that it would let us drop down and put the patch into urllib3. In addition to ensuring that more people get the opportunity to benefit from the patch, it is the more natural place to put it: urllib3 is where we do most of our interfacing with httplib. The other advantage of putting it there is that I'm currently involved in trying to replace httplib in urllib3, and having the patch in urllib3 makes me much more likely to remember to remove the patch once we've resolved the issue. ;)

What do you think @fake-name? Would you be open to a more reduced version of this patch getting added to urllib3 as a (hopefully) temporary measure until we can remove httplib entirely?

I don't know the requests architecture well enough to comment coherently, though you should treat my little hack above as basically public domain (or WTFPL if you really want a actual license). If it's useful, do what you want with it.

Lukasa commented 7 years ago

If requests is using chardet, have you looked at cChardet?

Unfortunately, we can't depend on third-party modules, especially not if they require compilation. That rules out cython. =(

I'm pretty strongly inclined to want to ship this as a much more minor patch that tries UTF-8 and then ISO-8859-1 in that order. I suspect that will cover 98% of the problems, and will let us get a better feel for how problematic the remaining cases are.

fake-name commented 7 years ago

Unfortunately, we can't depend on third-party modules, especially not if they require compilation. That rules out cython. =(

Isn't chardet a third-party module? Also, cChardet has available windows builds in PyPi for 2.7, 3.4 and 3.5.

Anyways, cchardet and chardet are API-compatible. Try to install cchardet, fall back to chardet on failure.

Lukasa commented 7 years ago

chardet is vendored into the requests repository.

I'm open to having cchardet as a fast-path for regular chardet though, via optional import.

sigmavirus24 commented 7 years ago

So, chardet has been vastly improved (but not yet released) to significantly improve its speed. Once they get around to it, the new release should improve things for us.

fake-name commented 7 years ago

Kinda hackier chardet-less version. Also does per-header decoding:

import sys
import codecs

import http.client
import email.parser

cchardet = False

try:
    import cchardet
except ImportError:
    pass

def isUTF8Strict(data):
    '''
    Check if all characters in a bytearray are decodable
    using UTF-8.
    '''
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def decode_headers(header_list):
    '''
    Decode a list of headers.

    Takes a list of bytestrings, returns a list of unicode strings.
    The character set for each bytestring is individually decoded.
    '''

    decoded_headers = []
    for header in header_list:
        if cchardet:
            inferred = cchardet.detect(header)
            if inferred and inferred['confidence'] > 0.8:
                print("Parsing headers!", header)
                decoded_headers.append(header.decode(inferred['encoding']))
            else:
                decoded_headers.append(header.decode('iso-8859-1'))
        else:
            # All bytes are < 127 (e.g. ASCII)
            if all([char & 0x80 == 0 for char in header]):
                decoded_headers.append(header.decode("us-ascii"))
            elif isUTF8Strict(header):
                decoded_headers.append(header.decode("utf-8"))
            else:
                decoded_headers.append(header.decode('iso-8859-1'))

    return decoded_headers

def parse_headers(fp, _class=http.client.HTTPMessage):
    """Parses only RFC2822 headers from a file pointer.

    email Parser wants to see strings rather than bytes.
    But a TextIOWrapper around self.rfile would buffer too many bytes
    from the stream, bytes which we later need to read as bytes.
    So we read the correct bytes here, as bytes, for email Parser
    to parse.

    Note: Monkey-patched version to try to more intelligently determine
    header encoding

    """
    headers = []
    while True:
        line = fp.readline(http.client._MAXLINE + 1)
        if len(line) > http.client._MAXLINE:
            raise http.client.LineTooLong("header line")
        headers.append(line)
        if len(headers) > http.client._MAXHEADERS:
            raise HTTPException("got more than %d headers" % http.client._MAXHEADERS)
        if line in (b'\r\n', b'\n', b''):
            break

    decoded_headers = decode_headers(headers)

    hstring = ''.join(decoded_headers)

    return email.parser.Parser(_class=_class).parsestr(hstring)

http.client.parse_headers = parse_headers

I'm curious how much of a performance improvement this'd result in. It might be a good idea to work up a testing scaffold (or just tell people doing high-throughput stuff to install cchardet).

Lukasa commented 7 years ago

I suspect not much of a perf improvement: there's a lot of time spent looking at each byte in Python code, which is always going to be slow. The way to get it to be faster is to just say: "try to decode the header block with UTF-8; if that fails, try with latin-1". That's the barest minimum of better, but it might be good enough.

safo-bora commented 6 years ago

I have the same problem. Code works on Python 2.7 but was failed with Python 3.6.4

import requests
import logging

from requests.auth import HTTPBasicAuth

logging.basicConfig(level=logging.DEBUG)

r = requests.get('http://test.local',
                 auth=HTTPBasicAuth('USER', 'PASSWORD'))

print(r)

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): test-silhouette.3shape.local
DEBUG:urllib3.connectionpool:http://test.local:80 "GET / HTTP/1.1" 200 None
WARNING:urllib3.connectionpool:Failed to parse headers (url=http://test.local:80/): [MissingHeaderBodySeparatorDefect()], 
unparsed data: 'Access-Control-Expose-Headers : WWW-Authenticate\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Methods: 
GET, POST, OPTIONS, PUT, PATCH, DELETE\r\nAccess-Control-Allow-Credentials: true\r\nAccess-Control-Allow-Headers: accept, authorization, 
Content-Type\r\nDate: Thu, 12 Apr 2018 14:06:25 GMT\r\nContent-Length: 2692\r\n\r\n'
Traceback (most recent call last):
  File "D:\Python3.6.4\lib\site-packages\urllib3\connectionpool.py", line 399, in _make_request
    assert_header_parsing(httplib_response.msg)
  File "D:\Python3.6.4\lib\site-packages\urllib3\util\response.py", line 66, in assert_header_parsing
    raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: 'Access-Control-Expose-Headers : 
WWW-Authenticate\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Methods: GET, POST, OPTIONS, PUT, PATCH, 
DELETE\r\nAccess-Control-Allow-Credentials: true\r\nAccess-Control-Allow-Headers: accept, authorization, Content-Type\r\nDate: Thu, 
12 Apr 2018 14:06:25 GMT\r\nContent-Length: 2692\r\n\r\n'
Traceback (most recent call last):
  File "D:\Python3.6.4\lib\site-packages\urllib3\response.py", line 302, in _error_catcher
    yield
  File "D:\Python3.6.4\lib\site-packages\urllib3\response.py", line 384, in read
    data = self._fp.read(amt)
  File "D:\Python3.6.4\lib\http\client.py", line 449, in read
    n = self.readinto(b)
  File "D:\Python3.6.4\lib\http\client.py", line 493, in readinto
    n = self.fp.readinto(b)
  File "D:\Python3.6.4\lib\socket.py", line 586, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
nateprewitt commented 6 years ago

Hi @safo-bora, the issues you’re encountering is because the server is returning an illegal header.

The first header: 'Access-Control-Expose-Headers : WWW-Authenticate\r\n

There is a space between the header name and the colon which violates RFC 7230 section 3.2.4:

No whitespace is allowed between the header field-name and colon. In the past, differences in the handling of such whitespace have led to security vulnerabilities in request routing and response handling. A server MUST reject any received request message that contains whitespace between a header field-name and colon with a response code of 400 (Bad Request).

safo-bora commented 6 years ago

Thank you!

We have found the problem on our side.

Now works!

image

bahoo commented 5 years ago

For folks who need to alter how requests ( and urllib3 as a whole ) parse headers, if changing the response isn't an option — here's a hint; hope it might help:

https://gist.github.com/bahoo/bdfedfc47cb971840cae489a844a2408

RunningFaster commented 3 years ago

How can you solve this problem