Add type checks to urllib.request.Request

ezio-melotti commented 9 years ago

BPO	25439
Nosy	@ezio-melotti, @vadmium, @CuriousLearner
PRs	python/cpython#10616
Files	urllib_request_param_type_check.patch urllib_request_param_type_check_2.patch urllib_request_param_type_check_3.patch urllib_request_param_type_check_4.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/CuriousLearner' closed_at = None created_at = labels = ['3.7', '3.8', 'type-feature', 'library', 'easy'] title = 'Add type checks to urllib.request.Request' updated_at = user = 'https://github.com/ezio-melotti' ``` bugs.python.org fields: ```python activity = actor = 'CuriousLearner' assignee = 'CuriousLearner' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'ezio.melotti' dependencies = [] files = ['40819', '40830', '40850', '40912'] hgrepos = [] issue_num = 25439 keywords = ['patch', 'easy'] message_count = 16.0 messages = ['253173', '253175', '253176', '253231', '253252', '253267', '253519', '253585', '253586', '253587', '253802', '253814', '253815', '257266', '330136', '330192'] nosy_count = 4.0 nosy_names = ['ezio.melotti', 'martin.panter', 'Nan Wu', 'CuriousLearner'] pr_nums = ['10616'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue25439' versions = ['Python 3.6', 'Python 3.7', 'Python 3.8'] ```

ezio-melotti commented 9 years ago

Currently urllib.request.Request seems to accept invalid types silently, only to fail later on with unhelpful errors when the request is passed to urlopen.

This might cause users to go through something similar:

>>> r = Request(b'https://www.python.org')
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
  ...
urllib.error.URLError: <urlopen error unknown url type: b'https>

This unhelpful error might lead users to think https is not supported, whereas the problem is that the url should have been str, not bytes.

The same problem applies to post data:

>>> r = Request('https://www.python.org', {'post': 'data'})
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
  ...
TypeError: memoryview: dict object does not have the buffer interface
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  ...
ValueError: Content-Length should be specified for iterable data of type <class 'dict'> {'post': 'data'}

This error seems to indicate that Content-Length should be specified somewhere, but AFAICT from the docs, the data should be bytes or None -- so let's try to urlencode them:

>>> r = Request('https://www.python.org', urlencode({'post': 'data'}))
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
  ...
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.

OK, maybe I should use bytes in the dict:

>>> r = Request('https://www.python.org', urlencode({b'post': b'data'}))
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
  ...
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.

That didn't work, I also needed to encode the output of urlencode().

Most of these problems could be prevented if Request() raised for non-str URLs, and non-bytes (and non-None) POST data. Unless there some valid reason to accept invalid types, I think they should be rejected early.

vadmium commented 9 years ago

Type checking on the URL should be reasonably straightforward.

For the request data, the question is what types do we currently support? This may be a can of worms. We would have to be careful not to break existing use cases, even undocumented ones.

Looking at bpo-23740, the underlying “http.client” accepts a variety of data types (bytes, Latin-1 str, iterables, files), and this support varies depending if the Content-Length header is supplied. I don’t think all combinations work with urlopen(), but judging by bpo-5038, there is at least partial support for bytes-like, iterables, and file objects.

ezio-melotti commented 9 years ago

For the request data, the question is what types do we currently support?

This is what I was wondering as well. The doc says "data must be a bytes object specifying additional data to send to the server, or None if no such data is needed."0 However one of the error messages says "POST data should be bytes or an iterable of bytes." and "Content-Length should be specified for iterable data" so iterables of bytes also seem to be supported. One option is to simply reject str and dict (or mappings in general), since these seem to me the two most likely errors (the first in case the user forgot to encode the output of urlencode(), the second in case the user forgot urlencode() altogether and passed a dict).

3d3643ef-0805-40b0-8106-5600a91a57f5 commented 9 years ago

Uploaded a patch. Also update the test. Seems there are at least two old test cases using empty dict as data param. has dict been supported before?

vadmium commented 9 years ago

The iterable option is documented under the urlopen(data=...) parameter, albeit not under the Request.data attribute. IMO this isn’t right, because you need to use Request to add Content-Length, but that is a separate documentation issue.

Technically, an empty dict() and a dictionary of bytes keys are both iterables of bytes. I admit it is a strange case though; this is one of those canned worms I mentioned. Example:

headers = {"Content-Length": 6}
data = OrderedDict(((b"abc", 1), (b"def", 2))))
urlopen(Request("http://localhost/", headers=headers, data=data))
# Sends request data b"abcdef"

Using the annotate function \https://hg.python.org/cpython/annotate/0a0aafaa9bf5/Lib/test/test_urllib.py#l1167\ reveals that your empty dict() tests were added by revision 0a0aafaa9bf5. I suspect it was an accident that happened to work, which is really an argument for diagnosing passing in a dict, although as I mentioned this is technically breaking compatibility.

3d3643ef-0805-40b0-8106-5600a91a57f5 commented 9 years ago

I see. Empty dict is considered as iterable of bytes makes good sense. I just left the old tests there. And explicitly give warnings when data field is of type str or type dict with non-bytes key.

vadmium commented 9 years ago

There is already a check for request.data being str in dorequest(). Maybe it is okay to remove this check; I think it would be equivalent to the new check.

Regarding the dict check, here is a strange class that currently works, but would be broken by the patch. Maybe it is okay to break it though, since it certainly goes against the urlopen() documentation.

class ReaderMapping(dict):
    def __init__(self):
        super().__init__({"abc": "xyz"})
        self.mode = "r"
        self.data = "123"
    def read(self, size):
        data = self.data
        self.data = ""
        return data

urlopen(Request("http://localhost/", headers={"Content-Length": 3}, data=ReaderMapping()))

If we wanted to continue to support this kind of usage, perhaps the safest option would be to change the test to “if type(data) is dict” rather than use isinstance().

I also left a comment about eliminating urlencode() in the test.

3d3643ef-0805-40b0-8106-5600a91a57f5 commented 8 years ago

The dorequest() method which is defined in AbstractHTTPHandler seems not cover the check at least for the first case Ezio brought up. unknown_open has been called and gives out a relatively confusing message.

3d3643ef-0805-40b0-8106-5600a91a57f5 commented 8 years ago

Will fix the other two issues. Thanks.

vadmium commented 8 years ago

I meant removing one of the checks in AbstractHTTPHandler.dorequest(). I left a note in the review at the relevant line.

One more thing I forgot to mention, although I am afraid it is 90% nit picking the dict iterable check. For the “http:” scheme, an iterable of “bytes-like” objects is supported (again not documented though). For HTTPS (over SSL), not every bytes-like object is supported, but common ones like bytearray() still work I think.

3d3643ef-0805-40b0-8106-5600a91a57f5 commented 8 years ago

Martin: Sorry for missing that line.

Under https, byte iterable seems has not been supported:
>>> r = Request('https://www.python.org', {b'post': 'data'}, {'Content-Length':10})
>>> urlopen(r)

... hanging here...

Meanwhile, I assumed bytearray works as expected.
>>> r = Request('https://www.python.org', bytearray('post=data', 'utf-8'))
>>> urlopen(r)
....
....
urllib.error.HTTPError: HTTP Error 403: FORBIDDEN

In *4.patch, I updated Request class doc to add these observations and fixed issues appeared in last patch.

vadmium commented 8 years ago

The bytes-like thing applies equally well to both the direct and iterable cases. Your first attempt hangs because the client only sends the four bytes in b"post", but the server is waiting for a total of 10 bytes.

The SSL problem doesn’t show up unless you use more obscure types like ctypes or array.array, where len() does not work as expected. Here is a demo:

>>> import ctypes
>>> byteslike = ctypes.c_char(b"x")
>>> urlopen("http://python.org/", data=byteslike)  # Succeeds
<http.client.HTTPResponse object at 0xb6bc876c>
>>> urlopen("https://python.org/", data=byteslike)
  File "/usr/lib/python3.4/ssl.py", line 720, in sendall
    amount = len(data)
TypeError: object of type 'c_char' has no len()

During handling of the above exception, another exception occurred:

TypeError: data should be a bytes-like object or an iterable, got \<class 'ctypes.c_char'>

vadmium commented 8 years ago

This illustrates my concern with the dict-of-bytes check:

>> headers = {"Content-Length": "3"} >> byteslike_iterable = {memoryview(b"123"): None} >> urlopen(Request("http://python.org/", headers=headers, data=byteslike_iterable))

With the current patch I think this would become an error. But maybe I am being too nitpicky and paranoid. Ezio?

ezio-melotti commented 8 years ago

But maybe I am being too nitpicky and paranoid. Ezio?

IIUC your last example used to work because technically it is an iterable of bytes, but the patch would give an error instead. Even in the unlikely case someone is relying on this behavior, simply adding .keys() not only should make the code work again, but also make it more explicit and understandable.

CuriousLearner commented 5 years ago

I'm working on applying the patch cleanly to master branch for this.

Also adding 3.8 and 3.7 as candidates for the bug.

CuriousLearner commented 5 years ago

PR is up for a review: https://github.com/python/cpython/pull/10616

encukou commented 6 months ago

The PR went unreviewed for years. Sorry for that.

As I said on the PR: I think the issue made much more sense when it was opened, Python 3 was new and mixing up str and bytes was much more common than today. Also, we now have generic tools to detect type errors early -- static type checkers. I think that today, it's best to close without action.

python / cpython

Add type checks to urllib.request.Request #69625