python / cpython

The Python programming language
https://www.python.org
Other
62.59k stars 30.04k forks source link

`urlparse` ignores the `scheme` parameter when parsing a URL #122565

Open mohmad-null opened 1 month ago

mohmad-null commented 1 month ago

Bug report

Bug description:

urlparse ignores the scheme parameter when determining what part of a URL is the path and hostname.

from urllib.parse import urlparse
# This should be parsed as: http://www.example.com
parsed_url = urlparse('www.example.com', scheme='http')
print(parsed_url)
print(parsed_url.hostname)

ParseResult(scheme='http', netloc='', path='www.example.com', params='', query='', fragment='') [empty string]

Should return:

ParseResult(scheme='http', netloc='', path='', params='', query='', fragment='') www.example.com

Per the docs:

The scheme argument gives the default addressing scheme, to be used only if the URL does not specify one.

CPython versions tested on:

3.8, 3.11

Operating systems tested on:

Windows

ZeroIntensity commented 1 month ago

Hmm, maybe I'm missing something, could you clarify? The reproducer is a little odd, in the sense that it accesses hostname, when scheme was passed. scheme is not ignored, per the issue title:

from urllib.parse import urlparse

parsed_url = urlparse("www.example.com", scheme="https")
print(parsed_url.scheme)  # https
mohmad-null commented 1 month ago

Sure, thanks for checking. Yes, it passes scheme to ParseResults.scheme, but it doesn't use it in the parsing of the URL itself.

To flesh out my example: urlparse("www.example.com", scheme="http"), is treated the exact same way as urlparse("www.example.com") for parsing the url itself. It parses both as www.example.com, while the later should definitely be treated as http://www.example.com. This means that hostname is empty, and path gets a value of www.example.com, both of which are wrong for http://www.example.com.

ZeroIntensity commented 1 month ago

Thank you for the clarification! Looks like a bug. See this reproducer, for any triagers:

from urllib.parse import urlparse

o = urlparse("docs.python.org/")
print(o.path)  # docs.python.org/, instead of /
ZeroIntensity commented 1 month ago

With that being said, I think the issue title is wrong. I don't think this is really related to the scheme parameter, just invalid parsing of schemes.

mohmad-null commented 1 month ago

With that being said, I think the issue title is wrong. I don't think this is really related to the scheme parameter, just invalid parsing of schemes.

I'm not saying you're wrong, but I always (mis?)understood that if no scheme was provided it just assumed it was a relative path rather than an absolute one. I assumed this is why the scheme parameter existed, otherwise it seems odd for it to exist at all. None of the other components have parameters, and if you want is to set a default to scheme, you can do that with a simple if not parsed.scheme: parsed.scheme='http'.

Per the docs https://docs.python.org/3/library/urllib.parse.html#url-parsing:

Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

There's an example in there that shows your reproducer to be explicitly intentional.

itamaro commented 1 month ago

The scheme argument gives the default addressing scheme, to be used only if the URL does not specify one.

I think your interpretation of this sentence from the docs is incorrect, and perhaps the docs could be improved to make this clearer.

iiuc, providing scheme to urlparse is not intended to be parsed as if scheme is prefixed to the URL, only as a default value for the scheme of the parsed URL if one is not present in the input URL:

>>> print(urlparse('//www.example.com', scheme='https'))
ParseResult(scheme='https', netloc='www.example.com', path='', params='', query='', fragment='')

if a scheme is explicitly present, it will win though:

>>> print(urlparse('https://www.example.com', scheme='http'))
ParseResult(scheme='https', netloc='www.example.com', path='', params='', query='', fragment='')
mohmad-null commented 1 month ago

iiuc, providing scheme to urlparse is not intended to be parsed as if scheme is prefixed to the URL, only as a default value for the scheme of the parsed URL if one is not present in the input URL:

Yes, but that makes no sense. Consider: a) Why does scheme get a default and nothing else? There's certainly a case for a default port given no port raises an exception when parsed_url.port is called (nasty surprise that one!). b) Why would a user even need a default value for scheme given how easy it is to set your own? (again: if not parsed_url.scheme: parsed_url.scheme='http') c) To me at least it makes sense that the purpose of the scheme parameter to is to be used with the URL, given the bit in the docs about RFC 1808 and non-scheme URL's being treated as relative (another nasty surprise!). The alternative is that the user will have to do some manual possibly-ugly pre-parsing of the URL to determine if it has a scheme in the first place, which would seem to defeat the point of urlparse.

vadmium commented 1 month ago

I suspect the default scheme is there because the parsing behaviour depends on the scheme. Consider urlparse('path;param', scheme='file') vs scheme='' or scheme='https'. I believe the query and fragment parsing used to depend on the scheme as well, but not any more.

mohmad-null commented 1 month ago

I suspect the default scheme is there because the parsing behaviour depends on the scheme. Consider urlparse('path;param', scheme='file') vs scheme='' or scheme='https'. I believe the query and fragment parsing used to depend on the scheme as well, but not any more.

This does appear to be the case:

#1
urlparse('some.path;param')`
> ParseResult(scheme='', netloc='', path='some.path', params='param', query='', fragment='')

#2
urlparse('some.path;param', scheme='https')
> ParseResult(scheme='https', netloc='', path='some.path', params='param', query='', fragment='')

#3
urlparse('some.path;param', scheme='file')
> ParseResult(scheme='file', netloc='', path='some.path;param', params='', query='', fragment='')

To me it logically follows that if scheme changes the handling of the param component when set to file, it should also effect the other compoents.

Continuing the example:

#4
urlparse('file://some.path;param')
ParseResult(scheme='file', netloc='some.path;param', path='', params='', query='', fragment='')

It seems extremely inconsistent to me that # 3 and # 4 come out with very different results. Personally I'd call that a bug given both are meant to be using the same scheme so should be getting parsed the same way.