python / cpython

The Python programming language
https://www.python.org
Other
63.48k stars 30.4k forks source link

Improper Input Validation in urlparse #91026

Open 0cb61241-3555-4d74-b141-a27d0b92cdce opened 2 years ago

0cb61241-3555-4d74-b141-a27d0b92cdce commented 2 years ago
BPO 46870
Nosy @orsenthil, @P0cas, @lafolle

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-security', '3.9'] title = 'Improper Input Validation in urlparse' updated_at = user = 'https://github.com/P0cas' ``` bugs.python.org fields: ```python activity = actor = 'P0cas' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = creator = 'P0cas' dependencies = [] files = [] hgrepos = [] issue_num = 46870 keywords = [] message_count = 4.0 messages = ['414132', '414133', '414171', '414345'] nosy_count = 3.0 nosy_names = ['orsenthil', 'P0cas', 'karanchaudhary'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'security' url = 'https://bugs.python.org/issue46870' versions = ['Python 3.9'] ```

0cb61241-3555-4d74-b141-a27d0b92cdce commented 2 years ago

If http:@localhost url is entered as an argument value of the urlpasre() function, the parser cannot parse it properly. Since http:@localhost is a valid URL, the character after the @ character must be parsed as a hostname.

Python 3.9.10 (main, Jan 15 2022, 11:48:04)
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlparse
>>> print(urlparse('http:@localhost'))
ParseResult(scheme='http', netloc='', path='@localhost', params='', query='', fragment='')
>>>
0cb61241-3555-4d74-b141-a27d0b92cdce commented 2 years ago
>>> print(urlparse('https:\\google.com'))
ParseResult(scheme='https', netloc='', path='\\google.com', params='', query='', fragment='')
>>> print(urlparse('https://google.com@localhost'))
ParseResult(scheme='https', netloc='google.com@localhost', path='', params='', query='', fragment='')
>>>

Perhaps this parser is not able to parse the URL normally.

3ac729c8-59bf-4488-9724-f05e7b664db5 commented 2 years ago

Here are the results from other languages. Go parses incorrectly at the same time rust does it correctly.

Go- https://go.dev/play/p/nNMhyznuGpn &url.URL{Scheme:"http", Opaque:"@localhost", User:(*url.Userinfo)(nil), Host:"", Path:"", RawPath:"", ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}

Rust- https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=92681b56f7cbd62b7735c962a2f5321e Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("localhost")), port: None, path: "/", query: None, fragment: None }

0cb61241-3555-4d74-b141-a27d0b92cdce commented 2 years ago

Nice Check. So what do you think about this issue? I want to hear your opinions.

vadmium commented 6 months ago

I believe http:@localhost is a valid URL according to RFC 3986. But the at character (@) is part of the path component. There is no hostname in this URL, because there is no double slash (//) introducing an authority component. ~Similarly, for the case with the backslashes (\\). Both at (@) and backslash (\) characters are allowed in path segments in RFC 3986.~

For the google.com@localhost case, the google.com part is the userinfo component.

~In all three cases I believe urlparse is correct, so this can be closed.~

Edit: For the two cases involving at (@) symbols, urlparse is correct. For the case involving backslashes (\\), the URL is not valid.

I misinterpreted reserved and unreserved in the RFC. Here are the rules:

  path-rootless = segment-nz *( "/" segment )
  segment-nz    = 1*pchar
  pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"