python / cpython

The Python programming language
https://www.python.org
Other
63.52k stars 30.43k forks source link

urlparse.urlparse() parses invalid URI without generating an error (examples provided) #73027

Open 9619b491-afe8-42a1-9718-a53e85b466fe opened 7 years ago

9619b491-afe8-42a1-9718-a53e85b466fe commented 7 years ago
BPO 28841
Nosy @orsenthil, @ericvsmith, @bitdancer, @vadmium, @serhiy-storchaka, @iritkatriel

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library', '3.9', '3.10', '3.11'] title = 'urlparse.urlparse() parses invalid URI without generating an error (examples provided)' updated_at = user = 'https://bugs.python.org/amrith' ``` bugs.python.org fields: ```python activity = actor = 'iritkatriel' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'amrith' dependencies = [] files = [] hgrepos = [] issue_num = 28841 keywords = [] message_count = 9.0 messages = ['282086', '282116', '282117', '282119', '282120', '282122', '282129', '282135', '408219'] nosy_count = 7.0 nosy_names = ['orsenthil', 'eric.smith', 'r.david.murray', 'martin.panter', 'serhiy.storchaka', 'amrith', 'iritkatriel'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue28841' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

9619b491-afe8-42a1-9718-a53e85b466fe commented 7 years ago

The urlparse library incorrectly accepts a URI which specifies an invalid host identifier.

Example:

http://www.example.com:/abc

Looking at the URI specifications, this is an invalid URI.

By definition, you are supposed to specify a "hostport" and a hostport is defined as:

https://www.w3.org/Addressing/URL/uri-spec.html

hostport host [ : port ]

The BNF indicates that : is only valid if a port is also specified.

See current behavior; I submit to you that this should generate an exception.

https://gist.github.com/anonymous/8504f160ff90649890b5a2a82f8028b0

ericvsmith commented 7 years ago

Can you please paste your code into a comment on this issue? Gist content has a habit of vanishing. Thanks!

9619b491-afe8-42a1-9718-a53e85b466fe commented 7 years ago

As requested ...

>>> urlparse.urlparse('http://www.google.com:/abc')
ParseResult(scheme='http', netloc='www.google.com:', path='/abc', params='', query='', fragment='')

I submit to you that this should generate an error.

vadmium commented 7 years ago

The Python documentation refers to RFC 3986, which allows an empty port string introduced by a colon, although it recommends against it: \https://tools.ietf.org/html/rfc3986#section-3.2.3\

The “port” subcomponent of “authority” is designated by an optional port number in decimal following the host and delimited from it by a single colon. . . . URI producers and normalizers should omit the port component and its “:” delimiter if “port” is empty . . .

What problem are you trying to solve by raising an error?

See also bpo-20059, where accessing the “port” field now raises ValueError for out-of-range values, and bpo-20271, discussing invalid URL handling more generally.

vadmium commented 7 years ago

Also, this is in direct contradiction to bpo-20270.

ericvsmith commented 7 years ago

And note that there are tests that explicitly check that the colon with no port works (via bpo-20270).

Given that this behavior has been around for a while, and is explicitly allowed, I would recommend against changing this to an error.

9619b491-afe8-42a1-9718-a53e85b466fe commented 7 years ago

Eric, Martin,

I'm sure this is an interpretation (and I'll be fair and give you both) that a URL of the form:

http://hostname.domain.tld:/path

should be held invalid. The rationale is per the section of the spec you quoted.

The other side of the argument is that the hostname and port are defined as:

hostname [ : port ]

where port is defined as

*DIGIT

This implies that 0 digits is also valid.

I submit to you that the ambiguity would be removed if they:

Absent that, I believe that the paragraph implies that the intent was 1*DIGIT.

And therefore a : with no following number should generate an error.

I raise this because the behavior of urlparse() (the way it does things now) is being cited as the reason why other routines in OpenStack should handle this when the reason we were getting URL's with a : and no port was in fact an oversight.

So, in hearing the rationalization as being that "urlparse() does it, so ..." I figured I'd come to the source and see if we could make urlparse() do things unambiguously.

Now, if the reason it does this is because someone who came before me made the argument that a : with no port should be accepted, I guess I'm out of luck.

bitdancer commented 7 years ago

Yes, I'm afraid so. But the rationale is presumably the same as it is in many RFCs: you should accept broken stuff if it can be interpreted unambiguously, but you should not *produce* the broken stuff. So I'd say the RFC is correct as quoted. It is then up to openstack whether or not it wants to accept such URLs in a given interface, and it if were me I'd base that decision on whether it was an 'internal' interface or an 'external' one. But you can also argue that even an external interface could be strict, depending on the application domain of the interface. urllib, on the other hand, needs to accept it, and we can't change it at this point for backward compatibility reasons if nothing else.

Based on the RFC, one could argue that urlunsplit should omit the colon. That could break backward compatibility, too, though will a lot less likelyhood. So we could still fix it in 3.7 if we decide we should.

iritkatriel commented 2 years ago

Reproduced on 3.11:

>>> urllib.parse.urlparse('http://www.google.com:/abc')
ParseResult(scheme='http', netloc='www.google.com:', path='/abc', params='', query='', fragment='')

The discussion above was mostly leaning towards won't fix, except for the suggestion to

Based on the RFC, one could argue that urlunsplit should omit the colon.