Open 344b0e78-fc78-45f0-a0db-3069a4f4db45 opened 3 years ago
urlparse
mishandles certain uses of extra slash or backslash(such as https:/// , https:/, https:) and interprets the URI as a relative path.
A userland logic implementation that bases its decision on the urlparse() function may introduce a security vulnerability due to the unexpected returned values of the function. These vulnerabilities may manifest as an SSRF, Open Redirect, and other types of vulnerabilities related to incorrectly trusting a URL.
from urllib.parse import urlparse
url1=urlparse('https://www.attacker.com/a/b')
url2=urlparse('https:///www.attacker.com/a/b')
url3=urlparse('https:/www.attacker.com/a/b')
url4=urlparse('https:\www.attacker.com/a/b')
print("Normal behaviour: HOSTNAME should be in netloc\n")
print(url1)
print("\nMishandling hostname and returning it as path\n")
print(url2)
print(url3)
print(url4)
OUTPUT:
Normal behaviour: HOSTNAME should be in netloc
ParseResult(scheme='https', netloc='www.attacker.com', path='/a/b', params='', query='', fragment='')
Mishandling hostname and returning it as path
ParseResult(scheme='https', netloc='', path='/www.attacker.com/a/b', params='', query='', fragment='')
ParseResult(scheme='https', netloc='', path='/www.attacker.com/a/b', params='', query='', fragment='')
ParseResult(scheme='https', netloc='', path='\\www.attacker.com/a/b', params='', query='', fragment='')
I don't know if urlparse is actually "mishandling" these URLs.
Looking over RFC 1808 (https://datatracker.ietf.org/doc/html/rfc1808.html) the BNF (https://datatracker.ietf.org/doc/html/rfc1808.html#section-2.2) seems to support what urlparse is reporting, at least for the first two examples.
I'll try to break down each scenario. Maybe you can help me understand what you expect it to report?
https : // /www.attacker.com
^ ^ ^
scheme Net loc Not a valid
Delimeter netloc character, but
is a valid abs_path beginning
(according to the spec net_loc
is allowed to be empty)
https : /www.attacker.com/a/b ^ ^ scheme valid abs_path
https : \ www.attacker.com/a/b ^ ^ scheme This isn't actually matched anywhere in the BNF, so if anything maybe a value error should have been raised?
Some other examples to test this behaviour: urlparse('https:/\/\/\www.attacker.com/a/b') urlparse('https:/\www.attacker.com/a/b')
## Comparing it to other languages/runtimes
How do other languages and their runtimes work with URL parsing functions?
Here's Node.js, also showing that it is missing the host
and hostname
, with a similar behavior to the currently reported "buggy" python urlparse()
one:
node
>require("url").parse("https:/\/\/\www.attacker.com/a/b");
Will return
Url {
protocol: 'https:',
slashes: true,
auth: null,
host: '',
port: null,
hostname: '',
hash: null,
search: null,
query: null,
pathname: '/www.attacker.com/a/b',
path: '/www.attacker.com/a/b',
href: 'https:///www.attacker.com/a/b'
}
But it is already documented that using Node.js url.parse can lead to security issues: https://nodejs.org/dist/latest-v16.x/docs/api/url.html#url_url_parse_urlstring_parsequerystring_slashesdenotehost
Use of the legacy url.parse() method is discouraged. Users should use the WHATWG URL API. Because the url.parse() method uses a lenient, non-standard algorithm for parsing URL strings, security issues can be introduced. Specifically, issues with host name spoofing and incorrect handling of usernames and passwords have been identified.
Here's Ruby, also showing that it is missing the host
and hostname
, with a similar behavior to the currently reported "buggy" python urlparse()
one:
irb(main):001:0> require 'uri'
=> false
irb(main):002:0> uri = URI.parse('https:/www.attacker.com/a/b')
=> #<URI::HTTPS https:/www.attacker.com/a/b>
irb(main):003:0> uri.host
=> nil
irb(main):004:0> uri.hostname
=> nil
irb(main):005:0> uri.scheme
=> "https"
irb(main):006:0> uri.path
=> "/www.attacker.com/a/b"
That said, it seems that Ruby throws on other permutations of the bad URL, which python does not. For example:
irb(main):011:0> other_uri = URI.parse('https:/\/\/\www.attacker.com/a/b')
Traceback (most recent call last):
8: from /usr/bin/irb:23:in `<main>'
7: from /usr/bin/irb:23:in `load'
6: from /Library/Ruby/Gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
5: from (irb):11
4: from (irb):11:in `rescue in irb_binding'
3: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/common.rb:234:in `parse'
2: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/rfc3986_parser.rb:73:in `parse'
1: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/rfc3986_parser.rb:67:in `split'
URI::InvalidURIError (bad URI(is not URI?): "https:/\\/\\/\\www.attacker.com/a/b")
Same for this other URI, which Ruby does not accept (unlike python, which does accept it and returns with a missing host and hostname properties as evident earlier in this report):
irb(main):012:0> other_uri = URI.parse('https:/\www.attacker.com/a/b')
Traceback (most recent call last):
8: from /usr/bin/irb:23:in `<main>'
7: from /usr/bin/irb:23:in `load'
6: from /Library/Ruby/Gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
5: from (irb):12
4: from (irb):12:in `rescue in irb_binding'
3: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/common.rb:234:in `parse'
2: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/rfc3986_parser.rb:73:in `parse'
1: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/rfc3986_parser.rb:67:in `split'
URI::InvalidURIError (bad URI(is not URI?): "https:/\\www.attacker.com/a/b")
Let's look at PHP. PHP's parse_url() function behaves much like python, where it misses to identify the host property for all 3 examples provided in this report:
❯ php -a
Interactive shell
php > var_dump(parse_url('https:/\www.attacker.com/a/b'));
array(2) {
["scheme"]=>
string(5) "https"
["path"]=>
string(22) "/\www.attacker.com/a/b"
}
php > var_dump(parse_url('https:/www.attacker.com/a/b'));
array(2) {
["scheme"]=>
string(5) "https"
["path"]=>
string(21) "/www.attacker.com/a/b"
}
php > var_dump(parse_url('https:/\/\/\www.attacker.com/a/b'));
array(2) {
["scheme"]=>
string(5) "https"
["path"]=>
string(26) "/\/\/\www.attacker.com/a/b"
}
php > var_dump(parse_url('https://www.attacker.com/a/b'));
array(3) {
["scheme"]=>
string(5) "https"
["host"]=>
string(16) "www.attacker.com"
["path"]=>
string(4) "/a/b"
}
The applicability of this vulnerability It seems that, there's no direct way of manipulating a python runtime into a severe impact simply by sending it a malformed URL. However, a userland logic implementation that bases its decision on the python urlparse() function may introduce a security vulnerability due to the unexpected returned values of the function. These vulnerabilities may manifest as an SSRF, Open Redirect and other types of vulnerabilities related to incorrectly trusting a URL.
Node.js is recommending using WHATWG URL API. Which handles all these correctly. We can test the same using https://jsdom.github.io/whatwg-url/
For example test the below and will return the same(correct) for all. https:///www.attacker.com/a/b https:/www.attacker.com/a/b https:\www.attacker.com/a/b https:/\/\/\www.attacker.com/a/b https:/\www.attacker.com/a/b
href https://www.attacker.com/a/b
protocol https:
username (empty string)
password (empty string)
port (empty string)
hostname www.attacker.com
pathname /a/b
search (empty string)
hash (empty string)
SUMMARY: python urlparse() function should also handle all the above in the same way.
If I've understood what you've written correctly, what you want is to change urlparse to use the WHATWG URL Standard (https://url.spec.whatwg.org/).
I'm not a committer or anything, but that seems like a large API change and is not likely to happen quickly if at all. I'm not sure what the path forward here is. There are many options, creating a new function/module, slowly deprecating the current urlparse function, or just say it's not our problem. Hopefully someone else can weigh in.
In the meantime, a project like https://pypi.org/project/whatwg-url/ claims to implement the WHATWG URL standard.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-security', '3.8', '3.9', '3.10', '3.11', '3.7', 'library']
title = '[security] Open redirect attack due to insufficient validation in Urlparse'
updated_at =
user = 'https://github.com/ready-research'
```
bugs.python.org fields:
```python
activity =
actor = 'terry.reedy'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'ready-research'
dependencies = []
files = []
hgrepos = []
issue_num = 44744
keywords = []
message_count = 5.0
messages = ['398232', '398273', '398274', '398276', '398295']
nosy_count = 3.0
nosy_names = ['lys.nikolaou', 'loganasherjones', 'ready-research']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'test needed'
status = 'open'
superseder = None
type = 'security'
url = 'https://bugs.python.org/issue44744'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']
```