httplib mishandles unknown 1XX status codes

6d5d82f3-c90b-41da-bb7f-abdec4dbac80 commented 7 years ago

BPO	28570
Nosy	@vadmium, @Lukasa
Files	28570.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', 'type-bug', 'library'] title = 'httplib mishandles unknown 1XX status codes' updated_at = user = 'https://github.com/Lukasa' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Lukasa' dependencies = [] files = ['45296'] hgrepos = [] issue_num = 28570 keywords = ['patch'] message_count = 4.0 messages = ['279823', '279826', '279827', '279841'] nosy_count = 3.0 nosy_names = ['SilentGhost', 'martin.panter', 'Lukasa'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue28570' versions = ['Python 3.7'] ```

6d5d82f3-c90b-41da-bb7f-abdec4dbac80 commented 7 years ago

Long story short \~~~~

http.client does not tolerate non-100 or 101 status codes from the 1XX block in a sensible manner: it treats them as final responses, rather than as provisional ones.

Expected behaviour \~~~~~~

When http.client receives a 1XX status code that it does not recognise, it should either surface these to a user that expects them by means of a callback, or it should ignore them and only show the user the eventual final status code. This is required by RFC 7231 Section 6.2 (Informational 1xx), which reads:

A client MUST be able to parse one or more 1xx responses received prior to a final response, even if the client does not expect one. A user agent MAY ignore unexpected 1xx responses.

Actual behaviour \~~~~

http.client treats the 1XX status code as final. It parses the header block as though it were a response in the 2XX or higher range. http.client will assume that there is no content in the 1XX response, and so will leave the following final response unconsumed in the socket buffer. This means that if the HTTPConnection is re-used, the user will see the final response following the 1XX as the response to the next request, even though it is not.

Steps to reproduce \~~~~~~

The following "server" can demonstrate the issue:

import socket
import time

document = b'''<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>title</title>
    <link rel="stylesheet" href="/other/styles.css">
    <script src="/other/action.js"></script>
  </head>
  <body>
    <h1>Hello, world!</h1>
  </body>
</html>
'''

s = socket.socket()
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('localhost', 8080))
s.listen(5)

while True:
    new_socket, _ = s.accept()
    data = b''

    while not data.endswith(b'\r\n\r\n'):
        data += new_socket.recv(8192)

    new_socket.sendall(
        b'HTTP/1.1 103 Early Hints\r\n'
        b'Server: socketserver/1.0.0\r\n'
        b'Link: </other/styles.css>; rel=preload; as=style\r\n'
        b'Link: </other/action.js>; rel=preload; as=script\r\n'
        b'\r\n'
    )
    time.sleep(1)
    new_socket.sendall(
        b'HTTP/1.1 200 OK\r\n'
        b'Server: socketserver/1.0.0\r\n'
        b'Content-Type: text/html\r\n'
        b'Content-Length: %s\r\n'
        b'Link: </other/styles.css>; rel=preload; as=style\r\n'
        b'Link: </other/action.js>; rel=preload; as=script\r\n'
        b'Connection: close\r\n'
        b'\r\n' % len(document)
    )
    new_socket.sendall(document)
    new_socket.close()

If this server is run directly, the following client can be used to test it:

from http.client import HTTPConnection

c = HTTPConnection('localhost', 8080)
c.request('GET', '/')
r = c.getresponse()

print("Status: {} {}".format(r.status, r.reason))
print("Headers: {}".format(r.getheaders()))
print("Body: {}".format(r.read()))

This client will print

Status: 103 Early Hints Headers: [('Content-Length', '0'), ('Server', 'socketserver/1.0.0'), ('Link', '\</other/styles.css>; rel=preload; as=style'), ('Link', '\</other/action.js>; rel=preload; as=script')] Body: b''

This response is wrong: the 200 header block is hidden from the client. Unfortunately, http.client doesn't allow us to ask for the next response: another call to "getresponse()" raises a "ResponseNotReady: Idle" exception.

8977102d-e623-4805-b309-cd0ad35b0d72 commented 7 years ago

The fix could be as small as the attached patched, though I'm not sure that is the correct way of handling 101 code.

6d5d82f3-c90b-41da-bb7f-abdec4dbac80 commented 7 years ago

101 should probably be special-cased, because in that case the underlying protocol is being totally changed.

vadmium commented 7 years ago

Consuming and ignoring 1xx responses seems reasonable for general cases. Treating 101 (Switching Protocols) as a special case also seems reasonable.

I also agree it would be good to provide an API to return these responses (at least after the request is completely sent). Perhaps a new flag, c.getresponse(informational=True), or a new method, c.get_informational_response(timeout=...). The timeout is for servers that do not support “Expect: 100-continue” mode; see bpo-1346874.

There is also bpo-25919 discussing how to handle responses (including 1xx) while concurrently sending the request. That is a harder problem, but may be solvable with a few new APIs.

python / cpython

httplib mishandles unknown 1XX status codes #72756