[security] Open redirect attack due to insufficient validation in Urlparse

344b0e78-fc78-45f0-a0db-3069a4f4db45 commented 3 years ago

BPO	44744
Nosy	@lysnikolaou, @loganasherjones, @ready-research

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-security', '3.8', '3.9', '3.10', '3.11', '3.7', 'library'] title = '[security] Open redirect attack due to insufficient validation in Urlparse' updated_at = user = 'https://github.com/ready-research' ``` bugs.python.org fields: ```python activity = actor = 'terry.reedy' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'ready-research' dependencies = [] files = [] hgrepos = [] issue_num = 44744 keywords = [] message_count = 5.0 messages = ['398232', '398273', '398274', '398276', '398295'] nosy_count = 3.0 nosy_names = ['lys.nikolaou', 'loganasherjones', 'ready-research'] pr_nums = [] priority = 'normal' resolution = None stage = 'test needed' status = 'open' superseder = None type = 'security' url = 'https://bugs.python.org/issue44744' versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11'] ```

344b0e78-fc78-45f0-a0db-3069a4f4db45 commented 3 years ago

urlparse mishandles certain uses of extra slash or backslash(such as https:/// , https:/, https:) and interprets the URI as a relative path.

A userland logic implementation that bases its decision on the urlparse() function may introduce a security vulnerability due to the unexpected returned values of the function. These vulnerabilities may manifest as an SSRF, Open Redirect, and other types of vulnerabilities related to incorrectly trusting a URL.

from urllib.parse import urlparse
url1=urlparse('https://www.attacker.com/a/b')
url2=urlparse('https:///www.attacker.com/a/b')
url3=urlparse('https:/www.attacker.com/a/b')
url4=urlparse('https:\www.attacker.com/a/b')
print("Normal behaviour: HOSTNAME should be in netloc\n")
print(url1)
print("\nMishandling hostname and returning it as path\n")
print(url2)
print(url3)
print(url4)

OUTPUT:

Normal behaviour: HOSTNAME should be in netloc

ParseResult(scheme='https', netloc='www.attacker.com', path='/a/b', params='', query='', fragment='')

Mishandling hostname and returning it as path

ParseResult(scheme='https', netloc='', path='/www.attacker.com/a/b', params='', query='', fragment='')
ParseResult(scheme='https', netloc='', path='/www.attacker.com/a/b', params='', query='', fragment='')
ParseResult(scheme='https', netloc='', path='\\www.attacker.com/a/b', params='', query='', fragment='')

11f658f1-2e5a-4a63-890b-e5942ace6e73 commented 3 years ago

I don't know if urlparse is actually "mishandling" these URLs.

Looking over RFC 1808 (https://datatracker.ietf.org/doc/html/rfc1808.html) the BNF (https://datatracker.ietf.org/doc/html/rfc1808.html#section-2.2) seems to support what urlparse is reporting, at least for the first two examples.

I'll try to break down each scenario. Maybe you can help me understand what you expect it to report?

https : // /www.attacker.com ^ ^ ^ scheme Net loc Not a valid
Delimeter netloc character, but is a valid abs_path beginning (according to the spec net_loc is allowed to be empty)

https : /www.attacker.com/a/b ^ ^ scheme valid abs_path

https : \ www.attacker.com/a/b ^ ^ scheme This isn't actually matched anywhere in the BNF, so if anything maybe a value error should have been raised?

344b0e78-fc78-45f0-a0db-3069a4f4db45 commented 3 years ago

Some other examples to test this behaviour: urlparse('https:/\/\/\www.attacker.com/a/b') urlparse('https:/\www.attacker.com/a/b')

## Comparing it to other languages/runtimes

How do other languages and their runtimes work with URL parsing functions?

Here's Node.js, also showing that it is missing the host and hostname, with a similar behavior to the currently reported "buggy" python urlparse() one:

node
>require("url").parse("https:/\/\/\www.attacker.com/a/b");

Will return

Url {
  protocol: 'https:',
  slashes: true,
  auth: null,
  host: '',
  port: null,
  hostname: '',
  hash: null,
  search: null,
  query: null,
  pathname: '/www.attacker.com/a/b',
  path: '/www.attacker.com/a/b',
  href: 'https:///www.attacker.com/a/b'
}

But it is already documented that using Node.js url.parse can lead to security issues: https://nodejs.org/dist/latest-v16.x/docs/api/url.html#url_url_parse_urlstring_parsequerystring_slashesdenotehost Use of the legacy url.parse() method is discouraged. Users should use the WHATWG URL API. Because the url.parse() method uses a lenient, non-standard algorithm for parsing URL strings, security issues can be introduced. Specifically, issues with host name spoofing and incorrect handling of usernames and passwords have been identified.

Here's Ruby, also showing that it is missing the host and hostname, with a similar behavior to the currently reported "buggy" python urlparse() one:

irb(main):001:0> require 'uri'
=> false
irb(main):002:0> uri = URI.parse('https:/www.attacker.com/a/b')
=> #<URI::HTTPS https:/www.attacker.com/a/b>
irb(main):003:0> uri.host
=> nil
irb(main):004:0> uri.hostname
=> nil
irb(main):005:0> uri.scheme
=> "https"
irb(main):006:0> uri.path
=> "/www.attacker.com/a/b"

That said, it seems that Ruby throws on other permutations of the bad URL, which python does not. For example:

irb(main):011:0> other_uri = URI.parse('https:/\/\/\www.attacker.com/a/b')
Traceback (most recent call last):
        8: from /usr/bin/irb:23:in `<main>'
        7: from /usr/bin/irb:23:in `load'
        6: from /Library/Ruby/Gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        5: from (irb):11
        4: from (irb):11:in `rescue in irb_binding'
        3: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/common.rb:234:in `parse'
        2: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/rfc3986_parser.rb:73:in `parse'
        1: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/rfc3986_parser.rb:67:in `split'
URI::InvalidURIError (bad URI(is not URI?): "https:/\\/\\/\\www.attacker.com/a/b")

Same for this other URI, which Ruby does not accept (unlike python, which does accept it and returns with a missing host and hostname properties as evident earlier in this report):

irb(main):012:0> other_uri = URI.parse('https:/\www.attacker.com/a/b')
Traceback (most recent call last):
        8: from /usr/bin/irb:23:in `<main>'
        7: from /usr/bin/irb:23:in `load'
        6: from /Library/Ruby/Gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        5: from (irb):12
        4: from (irb):12:in `rescue in irb_binding'
        3: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/common.rb:234:in `parse'
        2: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/rfc3986_parser.rb:73:in `parse'
        1: from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/uri/rfc3986_parser.rb:67:in `split'
URI::InvalidURIError (bad URI(is not URI?): "https:/\\www.attacker.com/a/b")

Let's look at PHP. PHP's parse_url() function behaves much like python, where it misses to identify the host property for all 3 examples provided in this report:

❯ php -a
Interactive shell

php > var_dump(parse_url('https:/\www.attacker.com/a/b'));
array(2) {
  ["scheme"]=>
  string(5) "https"
  ["path"]=>
  string(22) "/\www.attacker.com/a/b"
}
php > var_dump(parse_url('https:/www.attacker.com/a/b'));
array(2) {
  ["scheme"]=>
  string(5) "https"
  ["path"]=>
  string(21) "/www.attacker.com/a/b"
}
php > var_dump(parse_url('https:/\/\/\www.attacker.com/a/b'));
array(2) {
  ["scheme"]=>
  string(5) "https"
  ["path"]=>
  string(26) "/\/\/\www.attacker.com/a/b"
}
php > var_dump(parse_url('https://www.attacker.com/a/b'));
array(3) {
  ["scheme"]=>
  string(5) "https"
  ["host"]=>
  string(16) "www.attacker.com"
  ["path"]=>
  string(4) "/a/b"
}

The applicability of this vulnerability It seems that, there's no direct way of manipulating a python runtime into a severe impact simply by sending it a malformed URL. However, a userland logic implementation that bases its decision on the python urlparse() function may introduce a security vulnerability due to the unexpected returned values of the function. These vulnerabilities may manifest as an SSRF, Open Redirect and other types of vulnerabilities related to incorrectly trusting a URL.

344b0e78-fc78-45f0-a0db-3069a4f4db45 commented 3 years ago

Node.js is recommending using WHATWG URL API. Which handles all these correctly. We can test the same using https://jsdom.github.io/whatwg-url/

For example test the below and will return the same(correct) for all. https:///www.attacker.com/a/b https:/www.attacker.com/a/b https:\www.attacker.com/a/b https:/\/\/\www.attacker.com/a/b https:/\www.attacker.com/a/b

href    https://www.attacker.com/a/b
protocol    https:
username    (empty string)
password    (empty string)
port    (empty string)
hostname    www.attacker.com
pathname    /a/b
search  (empty string)
hash    (empty string)

SUMMARY: python urlparse() function should also handle all the above in the same way.

11f658f1-2e5a-4a63-890b-e5942ace6e73 commented 3 years ago

If I've understood what you've written correctly, what you want is to change urlparse to use the WHATWG URL Standard (https://url.spec.whatwg.org/).

I'm not a committer or anything, but that seems like a large API change and is not likely to happen quickly if at all. I'm not sure what the path forward here is. There are many options, creating a new function/module, slowly deprecating the current urlparse function, or just say it's not our problem. Hopefully someone else can weigh in.

In the meantime, a project like https://pypi.org/project/whatwg-url/ claims to implement the WHATWG URL standard.

python / cpython

[security] Open redirect attack due to insufficient validation in Urlparse #88907