Open 9619b491-afe8-42a1-9718-a53e85b466fe opened 7 years ago
The urlparse library incorrectly accepts a URI which specifies an invalid host identifier.
Example:
Looking at the URI specifications, this is an invalid URI.
By definition, you are supposed to specify a "hostport" and a hostport is defined as:
https://www.w3.org/Addressing/URL/uri-spec.html
hostport host [ : port ]
The BNF indicates that : is only valid if a port is also specified.
See current behavior; I submit to you that this should generate an exception.
https://gist.github.com/anonymous/8504f160ff90649890b5a2a82f8028b0
Can you please paste your code into a comment on this issue? Gist content has a habit of vanishing. Thanks!
As requested ...
>>> urlparse.urlparse('http://www.google.com:/abc')
ParseResult(scheme='http', netloc='www.google.com:', path='/abc', params='', query='', fragment='')
I submit to you that this should generate an error.
The Python documentation refers to RFC 3986, which allows an empty port string introduced by a colon, although it recommends against it: \https://tools.ietf.org/html/rfc3986#section-3.2.3\
The “port” subcomponent of “authority” is designated by an optional port number in decimal following the host and delimited from it by a single colon. . . . URI producers and normalizers should omit the port component and its “:” delimiter if “port” is empty . . .
What problem are you trying to solve by raising an error?
See also bpo-20059, where accessing the “port” field now raises ValueError for out-of-range values, and bpo-20271, discussing invalid URL handling more generally.
Also, this is in direct contradiction to bpo-20270.
And note that there are tests that explicitly check that the colon with no port works (via bpo-20270).
Given that this behavior has been around for a while, and is explicitly allowed, I would recommend against changing this to an error.
Eric, Martin,
I'm sure this is an interpretation (and I'll be fair and give you both) that a URL of the form:
http://hostname.domain.tld:/path
should be held invalid. The rationale is per the section of the spec you quoted.
The other side of the argument is that the hostname and port are defined as:
hostname [ : port ]
where port is defined as
*DIGIT
This implies that 0 digits is also valid.
I submit to you that the ambiguity would be removed if they:
Absent that, I believe that the paragraph implies that the intent was 1*DIGIT.
And therefore a : with no following number should generate an error.
I raise this because the behavior of urlparse() (the way it does things now) is being cited as the reason why other routines in OpenStack should handle this when the reason we were getting URL's with a : and no port was in fact an oversight.
So, in hearing the rationalization as being that "urlparse() does it, so ..." I figured I'd come to the source and see if we could make urlparse() do things unambiguously.
Now, if the reason it does this is because someone who came before me made the argument that a : with no port should be accepted, I guess I'm out of luck.
Yes, I'm afraid so. But the rationale is presumably the same as it is in many RFCs: you should accept broken stuff if it can be interpreted unambiguously, but you should not *produce* the broken stuff. So I'd say the RFC is correct as quoted. It is then up to openstack whether or not it wants to accept such URLs in a given interface, and it if were me I'd base that decision on whether it was an 'internal' interface or an 'external' one. But you can also argue that even an external interface could be strict, depending on the application domain of the interface. urllib, on the other hand, needs to accept it, and we can't change it at this point for backward compatibility reasons if nothing else.
Based on the RFC, one could argue that urlunsplit should omit the colon. That could break backward compatibility, too, though will a lot less likelyhood. So we could still fix it in 3.7 if we decide we should.
Reproduced on 3.11:
>>> urllib.parse.urlparse('http://www.google.com:/abc')
ParseResult(scheme='http', netloc='www.google.com:', path='/abc', params='', query='', fragment='')
The discussion above was mostly leaning towards won't fix, except for the suggestion to
Based on the RFC, one could argue that urlunsplit should omit the colon.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
title = 'urlparse.urlparse() parses invalid URI without generating an error (examples provided)'
updated_at =
user = 'https://bugs.python.org/amrith'
```
bugs.python.org fields:
```python
activity =
actor = 'iritkatriel'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'amrith'
dependencies = []
files = []
hgrepos = []
issue_num = 28841
keywords = []
message_count = 9.0
messages = ['282086', '282116', '282117', '282119', '282120', '282122', '282129', '282135', '408219']
nosy_count = 7.0
nosy_names = ['orsenthil', 'eric.smith', 'r.david.murray', 'martin.panter', 'serhiy.storchaka', 'amrith', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue28841'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']
```