python / cpython

The Python programming language
https://www.python.org
Other
63.1k stars 30.22k forks source link

urllib.request > Request.add_header("abcd","efgh") fails with character ":" in first parameter string #69756

Closed 29d92137-f0d9-4d8f-8395-04e7d2566a1f closed 8 years ago

29d92137-f0d9-4d8f-8395-04e7d2566a1f commented 8 years ago
BPO 25570
Nosy @bitdancer, @berkerpeksag, @vadmium
Files
  • add_header.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['easy', 'type-feature', 'docs'] title = 'urllib.request > Request.add_header("abcd","efgh") fails with character ":" in first parameter string' updated_at = user = 'https://bugs.python.org/crickert' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'docs@python' closed = True closed_date = closer = 'martin.panter' components = ['Documentation'] creation = creator = 'crickert' dependencies = [] files = ['40984'] hgrepos = [] issue_num = 25570 keywords = ['patch', 'easy'] message_count = 15.0 messages = ['254214', '254216', '254217', '254221', '254227', '254231', '254235', '254238', '254239', '254241', '254243', '254338', '254362', '266793', '266808'] nosy_count = 6.0 nosy_names = ['r.david.murray', 'docs@python', 'python-dev', 'berker.peksag', 'martin.panter', 'crickert'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue25570' versions = ['Python 2.7', 'Python 3.5', 'Python 3.6'] ```

    29d92137-f0d9-4d8f-8395-04e7d2566a1f commented 8 years ago

    A piece of software stopped working after the update from Python 3.4.x to 3.5 (both Linux and Windows).

    I found the cause in the Request.add_header("abcd","efgh") function: If the first parameter string "abcd" contains a colon (":") character, a ValueError is raised and the execution stops.

    This is an example code:

    >>> import urllib.request
    >>> import urllib.parse
    >>> data = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
    >>> data = data.encode('utf-8')
    >>> request = urllib.request.Request("http://requestb.in/xrbl82xr")
    >>> # adding charset parameter to the Content-Type header.
    >>> request.add_header("Content-Type:","application/x-www-form-urlencoded;charset=utf-8")
    >>> with urllib.request.urlopen(request, data) as f:
    ...     print(f.read().decode('utf-8'))
    ...

    This is the error:

    Traceback (most recent call last):
      File "C:\Users\user\Desktop\example.py", line 9, in <module>
        with urllib.request.urlopen(request, data) as f:
      File "C:\python\lib\urllib\request.py", line 162, in urlopen
        return opener.open(url, data, timeout)
      File "C:\python\lib\urllib\request.py", line 465, in open
        response = self._open(req, data)
      File "C:\python\lib\urllib\request.py", line 483, in _open
        '_open', req)
      File "C:\python\lib\urllib\request.py", line 443, in _call_chain
        result = func(*args)
      File "C:\python\lib\urllib\request.py", line 1268, in http_open
        return self.do_open(http.client.HTTPConnection, req)
      File "C:\python\lib\urllib\request.py", line 1240, in do_open
        h.request(req.get_method(), req.selector, req.data, headers)
      File "C:\python\lib\http\client.py", line 1083, in request
        self._send_request(method, url, body, headers)
      File "C:\python\lib\http\client.py", line 1123, in _send_request
        self.putheader(hdr, value)
      File "C:\python\lib\http\client.py", line 1050, in putheader
        raise ValueError('Invalid header name %r' % (header,))
    ValueError: Invalid header name b'Content-Type:'

    Steps to reproduce:

    Add a colon character anywhere into the first parameter of the Request.add_header() function and execute the code using Python 3.5.x.

    29d92137-f0d9-4d8f-8395-04e7d2566a1f commented 8 years ago

    The URL "http://requestb.in/xrbl82xr" is invalid, you'll get a "urllib.error.HTTPError: HTTP Error 404: NOT FOUND" error."

    Instead, please use "http://www.example.com" to confirm.

    bitdancer commented 8 years ago

    This behavior change was part of a security fix, and will appear in the next version of 3.4 as well. See bpo-22928. Header names may not contain colons, the colon separator is added when the header is rendered. Detecting and rejecting them guards against header injection attacks.

    However, that fix was done in httplib. I think it would also be worthwhile to fix add_header so that it rejects invalid header components when called, instead of only having the check done later in httplib, at a point distant from where the problem occurred.

    bitdancer commented 8 years ago

    By the way, that your application worked before was pure luck, here is what httplib sends in 3.4 given your test program:

    POST / HTTP/1.1 Accept-Encoding: identity User-Agent: Python-urllib/3.4 Connection: close Host: xxxx.xxxx.com:9999 Content-Length: 21 Content-Type:: application/x-www-form-urlencoded;charset=utf-8 Content-Type: application/x-www-form-urlencoded

    29d92137-f0d9-4d8f-8395-04e7d2566a1f commented 8 years ago

    Hi David,

    Thanks for looking into this.

    IMHO it would be very helpful for code debugging, if "add_header" rejects invalid header arguments ad hoc. As an alternative, the documentation I used for the implementation [https://docs.python.org/3.6/library/urllib.request.html] could be updated.

    "Content-Type:: application/x-www-form-urlencoded;charset=utf-8" A scope resolution operator thingy in the header? - What could possibly go wrong? ^^

    bitdancer commented 8 years ago

    I don't think the docs need updating if the ValueError is raised by add_header. The 'key' in the docs is the header name, as examples in other sections of that doc make clear. Granted you won't necessarily see that if you just read the add_header entry, but we can't explain every detail in every entry.

    Hmm. Perhaps an example could be added, though.

    29d92137-f0d9-4d8f-8395-04e7d2566a1f commented 8 years ago

    I would suggest a red warning box that informs the reader about her/his responsibility to comply with HTTP specifications - i.e. ensuring compatibility and security.

    vadmium commented 8 years ago

    This change is a result of parameter checking added in bpo-22928. The colon is not meant to be part of a field name. This is what the request being sent by your example looks like in the unpatched 3.4. (I changed to localhost.) Your attempt to override the default Content-Type is not working due to the colon:

    POST /xrbl82xr HTTP/1.1 Accept-Encoding: identity Content-Type: application/x-www-form-urlencoded Host: localhost User-Agent: Python-urllib/3.4 Content-Length: 21 Content-Type:: application/x-www-form-urlencoded;charset=utf-8 Connection: close

    spam=1&eggs=2&bacon=0

    So I don’t think this is a valid bug or regression. What gave you the idea to include the colon? Does the documentation need clarifying?

    Also, I would recommend not trying to set a “charset” attribute with the form-urlencoded content type in general. It is not standardized, and I proposed to remove the recommendation from the documentation in bpo-23360 (feedback welcome). If you really need to send non-ASCII data, you might be able to get away with UTF-8 by default. Also, HTML 5 mentions a _charset_ field, and there is the multipart/form-data content type.

    vadmium commented 8 years ago

    Sorry, it seems I missed some updates to this bug.

    I think the docs have enough big red warning boxes. But an explanation or short example might be good.

    29d92137-f0d9-4d8f-8395-04e7d2566a1f commented 8 years ago

    >So I don’t think this is a valid bug or regression. What gave you the idea to include the colon?

    "We ask that if you are going to consume the API that you pass a custom User-Agent header along with your requests. The User-Agent should primarily include information on how we can contact you, but it is also a good idea to include your application name and version." [https://eveonline-third-party-documentation.readthedocs.org/en/latest/xmlapi/intro/#other-information]

    This is why I created custom User-Agent headers: .add_header("Contact: ","me") .add_header("Software: ","sw") .add_header("Version: ","0.1") I didn't really need the colons, it just worked.

    >Does the documentation need clarifying?

    In the description for "headers" in the "class urllib.request.Request" section, Mozilla's string is given as an example: "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11" Including slashes, semicolons, and braces, I didn't think I would run into problems with using a colon (in the first parameter string).

    >Also, I would recommend not trying to set a “charset” attribute with the form-urlencoded content type in general. It is not standardized, and I proposed to remove the recommendation from the documentation in bpo-23360 (feedback welcome).

    I agree. The examples provided should be good (standard-conform) examples.

    >I think the docs have enough big red warning boxes. But an explanation or short example might be good. Maybe a comment line in the examples?

    vadmium commented 8 years ago

    Ah, I think I see where you are coming from. I guess you aren’t intimately familiar with the HTTP protocol. There are various types of “headers” aka header fields, not only User-Agent, but Content-Type and others. When they suggest passing a User-Agent header, that actually means a call like

    request.add_header("User-Agent", "custom text here")

    using the exact string "User-Agent" as the field name (key).

    If you can given a specific comment line to add to a specific example, that might be useful. Even if you aren’t 100% sure of the details :)

    29d92137-f0d9-4d8f-8395-04e7d2566a1f commented 8 years ago

    >Ah, I think I see where you are coming from. I guess you aren’t intimately familiar with the HTTP protocol.

    Exactly. :)

    >If you can given a specific comment line to add to a specific example, that might be useful. Even if you aren’t 100% sure of the details :)

    [OpenerDirector example] import urllib.request opener = urllib.request.build_opener() # adding custom header value to the 'User-agent' key opener.addheaders = [('User-agent', 'Mozilla/5.0')] opener.open('http://www.example.com/')

    "This is often used to “spoof” the User-Agent header VALUE, which is used by a browser to identify itself ..."

    vadmium commented 8 years ago

    Hmm, I wonder if that OpenerDirector example is a bit obscure. Normally you would use the default urlopen() to set User-Agent, without resorting to a custom OpenerDirector.

    In my patch:

    Let me know what you think (if anything is unnecessary, other suggestions?)

    berkerpeksag commented 8 years ago

    add_header.patch looks good to me. Thanks Martin.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset 320b9a65ac07 by Martin Panter in branch '3.5': Issue bpo-25570: Add example of customizing User-Agent via add_header() https://hg.python.org/cpython/rev/320b9a65ac07

    New changeset 75dc64c8c22b by Martin Panter in branch 'default': Issue bpo-25570: Merge add_header() example from 3.5 https://hg.python.org/cpython/rev/75dc64c8c22b

    New changeset 3aa49900869b by Martin Panter in branch '2.7': Issue bpo-25570: Add example of customizing User-Agent via add_header() https://hg.python.org/cpython/rev/3aa49900869b