Open mohmad-null opened 1 month ago
Hmm, maybe I'm missing something, could you clarify? The reproducer is a little odd, in the sense that it accesses hostname
, when scheme
was passed. scheme
is not ignored, per the issue title:
from urllib.parse import urlparse
parsed_url = urlparse("www.example.com", scheme="https")
print(parsed_url.scheme) # https
Sure, thanks for checking. Yes, it passes scheme
to ParseResults.scheme
, but it doesn't use it in the parsing of the URL itself.
To flesh out my example: urlparse("www.example.com", scheme="http")
, is treated the exact same way as urlparse("www.example.com")
for parsing the url itself. It parses both as www.example.com
, while the later should definitely be treated as http://www.example.com
.
This means that hostname
is empty, and path
gets a value of www.example.com
, both of which are wrong for http://www.example.com
.
Thank you for the clarification! Looks like a bug. See this reproducer, for any triagers:
from urllib.parse import urlparse
o = urlparse("docs.python.org/")
print(o.path) # docs.python.org/, instead of /
With that being said, I think the issue title is wrong. I don't think this is really related to the scheme
parameter, just invalid parsing of schemes.
With that being said, I think the issue title is wrong. I don't think this is really related to the
scheme
parameter, just invalid parsing of schemes.
I'm not saying you're wrong, but I always (mis?)understood that if no scheme was provided it just assumed it was a relative path rather than an absolute one.
I assumed this is why the scheme
parameter existed, otherwise it seems odd for it to exist at all. None of the other components have parameters, and if you want is to set a default to scheme
, you can do that with a simple if not parsed.scheme: parsed.scheme='http'
.
Per the docs https://docs.python.org/3/library/urllib.parse.html#url-parsing:
Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.
There's an example in there that shows your reproducer to be explicitly intentional.
The scheme argument gives the default addressing scheme, to be used only if the URL does not specify one.
I think your interpretation of this sentence from the docs is incorrect, and perhaps the docs could be improved to make this clearer.
iiuc, providing scheme
to urlparse
is not intended to be parsed as if scheme
is prefixed to the URL, only as a default value for the scheme of the parsed URL if one is not present in the input URL:
>>> print(urlparse('//www.example.com', scheme='https'))
ParseResult(scheme='https', netloc='www.example.com', path='', params='', query='', fragment='')
if a scheme is explicitly present, it will win though:
>>> print(urlparse('https://www.example.com', scheme='http'))
ParseResult(scheme='https', netloc='www.example.com', path='', params='', query='', fragment='')
iiuc, providing scheme to urlparse is not intended to be parsed as if scheme is prefixed to the URL, only as a default value for the scheme of the parsed URL if one is not present in the input URL:
Yes, but that makes no sense.
Consider:
a) Why does scheme get a default and nothing else? There's certainly a case for a default port
given no port raises an exception when parsed_url.port
is called (nasty surprise that one!).
b) Why would a user even need a default value for scheme given how easy it is to set your own? (again: if not parsed_url.scheme: parsed_url.scheme='http'
)
c) To me at least it makes sense that the purpose of the scheme
parameter to is to be used with the URL, given the bit in the docs about RFC 1808 and non-scheme URL's being treated as relative (another nasty surprise!). The alternative is that the user will have to do some manual possibly-ugly pre-parsing of the URL to determine if it has a scheme in the first place, which would seem to defeat the point of urlparse
.
I suspect the default scheme is there because the parsing behaviour depends on the scheme. Consider urlparse('path;param', scheme='file') vs scheme='' or scheme='https'. I believe the query and fragment parsing used to depend on the scheme as well, but not any more.
I suspect the default scheme is there because the parsing behaviour depends on the scheme. Consider urlparse('path;param', scheme='file') vs scheme='' or scheme='https'. I believe the query and fragment parsing used to depend on the scheme as well, but not any more.
This does appear to be the case:
#1
urlparse('some.path;param')`
> ParseResult(scheme='', netloc='', path='some.path', params='param', query='', fragment='')
#2
urlparse('some.path;param', scheme='https')
> ParseResult(scheme='https', netloc='', path='some.path', params='param', query='', fragment='')
#3
urlparse('some.path;param', scheme='file')
> ParseResult(scheme='file', netloc='', path='some.path;param', params='', query='', fragment='')
To me it logically follows that if scheme
changes the handling of the param component when set to file
, it should also effect the other compoents.
Continuing the example:
#4
urlparse('file://some.path;param')
ParseResult(scheme='file', netloc='some.path;param', path='', params='', query='', fragment='')
It seems extremely inconsistent to me that # 3 and # 4 come out with very different results. Personally I'd call that a bug given both are meant to be using the same scheme so should be getting parsed the same way.
Bug report
Bug description:
urlparse
ignores thescheme
parameter when determining what part of a URL is the path and hostname.Should return:
Per the docs:
CPython versions tested on:
3.8, 3.11
Operating systems tested on:
Windows