python / cpython

The Python programming language
https://www.python.org
Other
63.55k stars 30.45k forks source link

`urllib.request.OpenerDirector.add_handler()` matches too broadly #124098

Open perey opened 2 months ago

perey commented 2 months ago

Bug report

Bug description:

Issue

The add_handler() method of urllib.request.OpenerDirector matches too broadly, and generates spurious handler entries as a result.

Specifically, any method name that starts with zero or one underscore, followed by error, open, request, or response, and that isn't in the small list of exclusions, will be treated as a handler method.

Minimal reproducible example

>>> import urllib.request
>>> class BogusHandler(urllib.request.BaseHandler):
...     def _open(self):
...         pass
...     def open(self):
...         pass
...     def error(self):
...         pass
... 
>>> opdir = urllib.request.OpenerDirector()
>>> opdir.add_handler(BogusHandler())
>>> opdir.handle_open
{'': [<__main__.BogusHandler...>], 'ope': [<__main__.BogusHandler...>]}
>>> opdir.handle_error
{'erro': {'error': [<__main__.BogusHandler...>]}}

Cause

https://github.com/python/cpython/blob/401fff7423ca3c8bf1d02e594edfd1412616a559/Lib/urllib/request.py#L419-L421

In the normal case, this code splits a method name like http_open on the first underscore character, giving protocol == 'http' and condition == 'open'.

If there is a leading underscore character, this produces a false match with an empty protocol. For example, a method called _open will give protocol == '' and condition == 'open'.

If there is no underscore character, then meth.find('_') == -1, and weird results happen. In this case, condition contains the entire method name, and protocol contains all but the last character. For example, a method called open will give protocol == 'ope' and condition == 'open'.

The subsequent lines branch on the content of condition, with only error*, open, request, and response treated as handler methods.

Impact

Low. It is probably impossible that a request would have the empty string as its protocol, and unlikely that it would be ope, reques, respons, or erro. If one of these did happen, the spurious match wouldn't be called: OpenerDirector would go looking for (say) ope_open, probably not find it, and raise an AttributeError. (It's for this reason that I'm positive this is a bug, and not an undocumented feature where open, etc. are meant to be registered as handlers.)

Spurious entries are added to the internal members handle_open, handle_error, process_response, process_request, and (especially) the catch-all handler, using bisect to perform sort-preserving insertions. This is unnecessary busywork, but I have no data on any actual performance impact.

Solution

It suffices to add a check like so, immediately after meth.find("_"):

if i < 1:
    continue

This catches both a leading underscore (i == 0) and no underscore (i == -1). I've forked the repo and made such a change myself, but submitting a pull request is a big step up from filing an issue, and it expects an issue to be filed first in any case, so… here it is! :smile:

Related issues

I can only see one relevant existing issue, #61924. This was a withdrawn proposal to clean up the add_handler() code. It was more far-reaching than the solution above, but (on my inspection at least) it would've fixed this bug. At the time, @bitdancer commented, "I'll try to keep this patch in mind if we have occasion to fix anything in that code".

The code to extract a specific type of error behaves similarly:

https://github.com/python/cpython/blob/401fff7423ca3c8bf1d02e594edfd1412616a559/Lib/urllib/request.py#L424-L425

However, I'm not positive that it's an error for it to give (for example) kind == 'error' when the method name is *_error (or _error or error), so I haven't explored further.

CPython versions tested on:

3.12

Operating systems tested on:

No response

gpshead commented 2 months ago

Thanks for the bug report and writeup. Care to make a PR? It'd also be good to add a regression test (turn a small version of your example code in this issue into a test case and assert that the dictionary doesn't contain the bogus entries perhaps).