python / cpython

The Python programming language
https://www.python.org
Other
63.42k stars 30.37k forks source link

urllib.parse space handling CVE-2023-24329 appears unfixed #102153

Closed AdrianBunk closed 1 year ago

AdrianBunk commented 1 year ago

Everyone (including the submitter of the now public exploit who submitted the issue half a year ago to security@python.org and the NVD) seems to think that #99421 "accidently fixed" CVE-2023-24329.

Did the Python Security Response Team verify that this vulnerability that was reported to them and that is now public was fixed by #99421?

The PoC from the submitter still works for me with the Debian package 3.11.2-4, which surprised me and makes me wonder whether the fix had any effect at all on the stripping of leading blanks issue in the CVE.

Linked PRs

ned-deily commented 1 year ago

@pablogsal

pablogsal commented 1 year ago

The backport was merged here https://github.com/python/cpython/pull/99446 no?

AdrianBunk commented 1 year ago

@pablogsal #99446 is a backport of #99421 that does not seem to fix CVE-2023-24329:

$ cat test.py 
import urllib.request
from urllib.parse import urlparse
def safeURLOpener(inputLink):
    block_host = ["instagram.com", "youtube.com", "tiktok.com", "example.com"]
    input_hostname = urlparse(inputLink).hostname
    if input_hostname in block_host:
        print("input hostname is forbidden")
        return
    target = urllib.request.urlopen(inputLink)
    content = target.read()
    print(content)

safeURLOpener("https://example.com")
safeURLOpener(" https://example.com")  # CVE-2023-24329
safeURLOpener("+https://example.com")  # 99421
$ python3.10 test.py 
input hostname is forbidden
b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'
input hostname is forbidden
$ python3.11 test.py 
input hostname is forbidden
b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'
Traceback (most recent call last):
  File "/tmp/test.py", line 15, in <module>
    safeURLOpener("+https://example.com")  # 99421
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/test.py", line 9, in safeURLOpener
    target = urllib.request.urlopen(inputLink)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/request.py", line 541, in _open
    return self._call_chain(self.handle_open, 'unknown',
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/lib/python3.11/urllib/request.py", line 1419, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: +https>
$
pablogsal commented 1 year ago

CC: @gpshead

arhadthedev commented 1 year ago
CharlieZhao95 commented 1 year ago

BTW, does this patch (CVE-2023-24329) require a backport to 3.10, 3.9 and older branches?

I noticed that the bug is currently only backported to the 3.11 branch, but it actually affects all versions prior to 3.11.

RSAlderman commented 1 year ago

@CharlieZhao95 that's what I was asking for in https://github.com/python/cpython/issues/102293 - backport of the security vulnerability fix for CVE-2023-24329 to all in-service releases (3.7-3.10).

The request for the backports has been closed as a duplicate of this issue by @gpshead

CharlesBryant-G commented 1 year ago

Maybe it's worth taking a step back and looking at the problem in a wider context.

In the PoC, the vulnerability arises not because parse() returns the wrong answer, but because it interprets the url differently from urlopen(). If they were both wrong in the same way it would be harmless. Why is there more than one piece of code which parses URLs? The DRY principle should apply.

Closely related, note that urlparse() does not have a vulnerability at all - any vulnerability is in code which relies on it and does so in a way in which it creates a vulnerability. In the PoC, the vulnerability is in the code for safeURLOpener().

The way in which urlparse() is implemented is fragile and bug prone. As a general principle, parsing code should not look ahead for known delimiters, it should systematically work from the start, advancing over characters tested to be legitimate. So urlparse('example.com@!$%^&*()_+-={}[]:;"\\|?query#frag') should stop parsing at the '%' as that is not a legal character when not followed by two hex digits. It may return that "example.com@!$" is the path and there are extra characters after the URL (this style of parsing is often convenient when parsing items which may contain things to be parsed), or report failure due to an invalid URL. Instead, an early stage of processing skips ahead to the '?' and '#', so it claims there is a path, query, and fragment. While it could then validate these pieces and realise that the path is invalid, this can be forgotten and makes it unnecessarily difficult and dangerous to make the parser accept a valid URL followed by other characters (because it would need to reliably undo any parsing of anything past the valid part).

gpshead commented 1 year ago

We will backport something that makes sense if we determine this is a security issue, that's why I duped the other issue here. Backporting the existing commit further does not make sense to me until the leading space issue, if present as reported here, is resolved. (I haven't taken the time to look. this is not an emergency)

xiaoge1001 commented 1 year ago
>>> from urllib.parse import urlparse
>>> urlparse(" https://example.com")
ParseResult(scheme='', netloc='', path=' https://example.com', params='', query='', fragment='')

I tested it and the problem doesn't seem to be fixed. I execute urlparse(" https://example.com"), the output before and after merging https://github.com/python/cpython/pull/99421 is the same.

xiaoge1001 commented 1 year ago

CVE-2023-24329 says that supplying a URL that starts with blank characters is bad.

If a URL-scheme is " https", it will jump out of the loop in the following code: https://github.com/python/cpython/blob/50b0415127009119882e32377d25a4d191088a76/Lib/urllib/parse.py#L465

After https://github.com/python/cpython/pull/99421 is merged, it will exit early: https://github.com/python/cpython/blob/2e279e85fece187b6058718ac7e82d1692461e26/Lib/urllib/parse.py#L463

The code in line 468 is not executed before and after the modification, the subsequent code execution will not change: image

when input a URL that starts with blank characters,https://github.com/python/cpython/pull/99421 doesn't seem to have no effect.

xiaoge1001 commented 1 year ago

@gpshead Hello, can you review this pull https://github.com/python/cpython/pull/102470 ?

xiaoge1001 commented 1 year ago

https://nvd.nist.gov/vuln/detail/CVE-2023-24329

Base Score: [7.5 HIGH] vector: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:H/A:N

This is a high-score vulnerability. Can we fix it as soon as possible? If we don't think it's a vulnerability, can we reject it?

illia-v commented 1 year ago

BTW, in JS both leading and trailing space and C0 control are removed by URL parsers.

https://url.spec.whatwg.org/#url-parsing

xiaoge1001 commented 1 year ago

image For URLs with leading whitespace, ruby throws InvalidURIError.

gpshead commented 1 year ago

Adding some historical context - caution, this is long, but worth understanding:

In August 2022 a discussion about this issue was spawned in private on the Python security response team mailing list after the initial report from Yebo Cao came in. Our intent was to file a public issue about it and continue discussion from there, but that never happened. So lets consider this issue to be that one... thanks for filing it!

In the private discussions, which are longer than I want to paste so I'll summarize some points from that, and paste a bunch of bits of other parts:

cc: @PaulMcMillan who did a lot of the above and below analysis last year.

Paul came up with a nice looking list of urlparse potential test cases and demonstrated their current behavior as of August 2022 here https://gist.github.com/PaulMcMillan/70618ca857a0519379af704d88a1c9af as part of the analysis. (even if some of those have changed with other fixes since, the ones with the spaces in what should've been the scheme do not appear to have - I haven't checked all of those behaviors across time and versions).

URL Schemes:

RFC 1808 2.1 specifies that schemes are named as per RFC1738 section 2.1 which says:

   Scheme names consist of a sequence of characters. The lower case
   letters "a"--"z", digits, and the characters plus ("+"), period
   ("."), and hyphen ("-") are allowed.

This seems to reasonably imply that "scheme" should raise a ValueError if other characters (e.g. spaces) are included.

BUT... The tricky question for the Python stdlib is "do we have a scheme with invalid characters?" vs. our fallback of "Otherwise the input is presumed to be a relative URL and thus to start with a path component." per our long standing documented and implemented urllib.parse API. (pause for audience groans...) But the documented API is talking about the netloc there, not the scheme.

What users think they want is: "A string containing :// will try to parse everything to the left of : as the scheme."

However, this doesn't work. If you look at actual user behavior, they have a tendency to do things like put unencoded urls in the query, or possibly even the path, and it doesn't seem to make sense to break schema-less parsing in that case.

Existing user code depends on that API behavior.

Another past issue demonstrating similar problems with changing the behavior of the urlparse() APIs is the past joy of netloc's containing port numbers. See https://github.com/python/cpython/pull/661 (and its linked to issues)

Leaving it unconclusive that there's much we can do about this.

@PaulMcMillan had one suggestion, perhaps a more heuristic approach: If a urlcontains a ://, split on that, if the left hand side of that does not have a / in it, assume it was supposed to be a scheme and raise a ValueError if it contains any invalid scheme characters. Essentially adding a "what you have looks enough like a url that we're pretty sure your schema is invalid" check here: https://github.com/python/cpython/blob/2e279e85fece187b6058718ac7e82d1692461e26/Lib/urllib/parse.py#L465

We're not sure if this edge case is worth the backwards compatibility change since it doesn't cover a myriad of other ways urllib.parse.urlparse() differs from browsers.

Further discussion covered questions such as "can't we just rstrip() or lstrip() or strip() always?", noting other use patterns and reasons why change here in complicated in practice:


At a minimum we should document what happens with invalid schemes. Perhaps we should be recommending better fully WHATWG compliant PyPI maintained libraries. Ideally we'd have shipped one. But we don't today and changing our existing urlparse API to behave that way will break existing users so it is not a security fix... it'd need to be a thoughtful breaking change API transition with a behavior deprecation period and recommendations for code needing any of the old behaviors. Not a security fix.

(The bulk of the above analysis and a bunch of the words come from Paul and some from Guido. I'm opening them up here for a wider audience, I added or rephrased or editorialized and emphasized a few things along the way.)

xiaoge1001 commented 1 year ago

@gpshead Thank you for reply.

So there's a fix plan at the moment for the main branch? Because this is a high-score vulnerability, I hope it can be fixed as soon as possible.

AdrianBunk commented 1 year ago

@gpshead Thanks for the explanations. In my opinion Python does in recent years already break/remove far too much existing functionality, and I am pleasantly surprised by your awareness for maintaining backwards compatibility.

In addition to the technical side you explained, there is also a process problem you should discuss (perhaps not in public) in case you aren't already doing this:

Right now there is a CVE with a high score links to a description of a vulnerability with a PoC and a merge request with a one-line fix - but the PoC still works with the fix. Something went wrong that resulted in people wrongly thinking that #99421 would fix CVE-2023-24329, and for that it is not even relevant whether the final resolution will be a code fix or a documentation update that this is not considered a bug.

Our intent was to file a public issue about it and continue discussion from there, but that never happened.

Apparently:

gpshead commented 1 year ago

Right now there is a CVE with a high score links to a description of a vulnerability with a PoC and a merge request with a one-line fix

Untrue. There is no one-line fix. It is likely that would break other users. It isn't even clear that this should be called a security problem in Python rather than in users applications because it's making expectations of a very old API that was not really designed... let alone designed to do what some people understandably wish it did.


CVEs are a world writable thing, literally anyone can write one and request a number against most any piece of software for any reason with little to no vetting.

This particular CVE is merely one more example of that: No coordination was done with the CPython project regarding the filing of this CVE, the contents of the CVE, or making a blog post about it.

Sure, solid communication on next steps back to the original reporter didn't appear to have happened from our PSRT end (at least not where I can see it - it sounds like there may have been some private communication I can't see) (this is effectively your last bulletpoint), but, critically, we were never asked as a Python security response team if a CVE should be filed or to help determine how serious this actually was. Volunteers don't have due dates for tasks.

Never take CVEs and CVSS scores at face value. They need validating, do not assume them to be true. A CVE being filed and assigned a score does not actually mean what people want it to mean.

And don't assume you can expect due dates to tasks in an all volunteer organization. https://pyfound.blogspot.com/2023/01/the-psf-is-hiring-security-developer-in.html should be of interest.

AdrianBunk commented 1 year ago

Right now there is a CVE with a high score links to a description of a vulnerability with a PoC and a merge request with a one-line fix - but the PoC still works with the fix.

Untrue. There is no one-line fix.

True. The CVE at NVD does link to the #99421 one-line fix that does not fix the CVE.

Sure, solid communication on next steps back to the original reporter didn't appear to have happened from our PSRT end (at least not where I can see it - it sounds like there may have been some private communication I can't see) (this is effectively your last bulletpoint),

If the timeline from the blog post is correct, it took 3 weeks for the initial reply and there was zero private communication in the following 5 months.

but, critically, we were never asked as a Python security response team if a CVE should be filed or to help determine how serious this actually was. Volunteers don't have due dates for tasks.

You are saying that when asked about filing a CVE, these same volunteers suddenly have time? That doesn't make sense.

Everything I see indicates that there was quite some activity and discussion inside the PSRT at the end of August, but then someone dropped the ball and filing a public issue never happened. One common way to handle such tasks would be to track them somewhere, and check the open tasks during a weekly or monthly team meeting.

This particular CVE is merely one more example of that: No coordination was done with the CPython project regarding the filing of this CVE, the contents of the CVE, or making a blog post about it.

The blog post now says: [added] Between 08/25/2022 and 01/20/2023, I sent about 10 follow-ups to security@python.org and the staff responded me but got no replies.

@gpshead If this is true, you owe Yebo Cao an apology for victim blaming.

I cannot judge whether it is true, but you likely have access to the emails to security@python.org and know who the person is who responded on 08/25/2022 to ask about emails received.

If this is true, I would actually be curious to hear from you whether any email did even mention filing a CVE or writing the blog post.

gpshead commented 1 year ago

This is not an appropriate forum for you to blame volunteers for not doing what you desire you want for free. Please take this discussion elsewhere, it has nothing to do with urllib.parse.

PaulMcMillan commented 1 year ago

@AdrianBunk As the person who dropped the ball on this, after picking it up because nobody else did, and spending several weeks of my very limited volunteer time doing a thorough impact analysis, I personally apologize to Yebo Cao, you, and the entire community for not following through on this during a period of personal change, including 3 employers, a significant period of medical leave, and also acquiring and recovering from COVID-19. I'll do my level best not to let those factors impact my performance in the future, as an unpaid volunteer.

@AdrianBunk since you are clearly currently being paid to attend to these issues, perhaps you might be interested in volunteering to write a WHATWG compatible python module which we can include in core Python for future releases. I believe there are several variants on PyPi which might just need your focused professional contributions before they might reasonably be shephearded through the PIP process to improve core Python. This should not be overly burdensome to a developer of your caliber because WHATWG publish an extremely extensive test suite, and you merely need to write the code to pass the tests. I'll commit to personally advocating for backporting such a module to the appropriate prior versions for security reasons, even though it is not our standard process.

Alternatively, if you do not accept that you are personally responsible for this matter, I'm absolutely certain your employer will be interested in sponsoring the PSF to hire a security developer to respond to this type of issues. It's business critical, after all.

AdrianBunk commented 1 year ago

@AdrianBunk As the person who dropped the ball on this, after picking it up and spending several weeks of my very limited volunteer time doing a thorough impact analysis, I personally apologize to Yebo Cao, you, and the entire community for not following through on this during a period of personal change, including 3 employers, a significant period of medical leave, and also acquiring and recovering from COVID-19. I'll do my level best not to let those factors impact my performance in the future, as an unpaid volunteer.

The main apology might have to come from @gpshead to Yebo Cao, because public accusations like we were never asked as a Python security response team if a CVE should be filed or to help determine how serious this actually was are hard to reconcile with statements (if true) like I sent about 10 follow-ups to security@python.org and the staff and CERT got responses from Python in Jan 2023.

@PaulMcMillan I never asked who the person was who dropped the ball on this, or tried to put all the blame on this individual (who turns out to be you).

The process problem I am suspecting is not that you were (understandably) not available, the problem seems to be rather that the other members of your team did not notice and mitigate that there was a task assigned to a person who was no longer available.

gpshead commented 1 year ago

Expect that further non-technical urllib.parse related comments may lead to us locking this issue @AdrianBunk. https://www.python.org/psf/conduct/

You're drawing invalid conclusions based on incomplete information. I was not accusing any individuals of anything, I do understand how you might misunderstand the fragments you've seen to think otherwise. I presume everything Yebo said in the version of the blog I read is true from their perspective. As is what I said here. I'm not willing to engage with you further on this or share more. It isn't relevant to the technical API issue at hand.

AdrianBunk commented 1 year ago

@gpshead When/How is the technical side how to handle CVE-2023-24329 being decided?

At a minimum we should document what happens with invalid schemes. Perhaps we should be recommending better fully WHATWG compliant PyPI maintained libraries.

From your description I got the impression that last year August you had inside the PSRT a consensus or at least a tendency to address this vulnerability by recommending not to use urllib, but it has to be confirmed by you whether this is true or whether I am drawing invalid conclusions based on incomplete information.

If this is the conclusion then CVE-2023-24329 could be updated to point to the MR updating the documentation, similar to https://bugs.python.org/issue36260 (commit 3ba51d5) for CVE-2019-9674.

A secondary question: Was #99421 done to fix anything the PSRT considered a vulnerability and that should get an own CVE number, or is this a pure bugfix unrelated to any security vulnerabilities?

gpshead commented 1 year ago

If you look at the number of issues past and present with urllib in our issue tracker, you'll find a long pile of them both open and closed. Often related to parsing, splitting, and unsplitting, some with CVEs, some without. (and tangentially related ones for the http code that can be intertwined with this)

One in particular happens to contain what I consider a useful message from someone at MITRE regarding how issues within a language standard library are viewed regarding CVEs (ie: does the CVE belong with the language or with the applications using the APIs): I recommend everybody read this 2015 comment https://github.com/python/cpython/issues/67693#issuecomment-1093675666 for "where does a CVE belong?" context.

My gut feeling that "fixing" this in a patch release is likely to cause application breakage based on past experience as changing urllib behaviors such as in https://github.com/python/cpython/issues/88048 or http behaviors in https://github.com/python/cpython/issues/74643 did. (and others, I won't attempt to keep them straight and list them all here - it's a mess)

Where does all that leave us?

I've at least scheduled a huge pile of diverse tests at work to run over the weekend on a patched runtime containing PR #102508 to see if that provides any meaningful insight into practical breakages or not. I haven't spent enough time looking things over to understand if that is a good way to implement such a fix or not.

I think it is fine to tighten our behavior similar to what that PR aims to do in 3.12 given it is security relevant, but am not yet convinced that'd be safe to backport.

From a CVE perspective that leaves things in maybe muddy waters. We could accept some variant of that fix and say "fixed in 3.12"... which does not exist yet and most people won't be ready to use for a few years, but in reality CVEs are not structured to cover the programming language standard library situation well. It'd be considered an application problem for anyone using this API in a security relevant context if we simply document the problem with a cautionary note, even if we attempt mitigation in 3.12. That just opens questions... does each impacted application get its own CVE? who's going to audit applications and even file those? (definitely not us!) But that being unclear and full of unanswered questions is why I've not tried to decide if we should ask for the CVE to be revoked or reassigned to select unidentified applications. Once any form of code fix or documentation update is landed, it'd make sense for any CVE to be pointed at that.


(99421 was a sensible bug fix - adding more CVEs does not seem helpful from my point of view but it is not up to us to file them, let someone with a demonstrably impacted application that no longer after that change file one)

stratakis commented 1 year ago

Thank you for the analysis @gpshead. The main concern with this CVE is that it received a high base score, causing widespread alarm not because of the potential vulnerability per se, but rather the fact that those scores are used by all sorts of automated tooling and CVE scanners to assess risk for systems and infrastructures. This causes a cascading effect throughout the chain.

And while it's not clear if this CVE poses a security threat, a more definitive answer in regards to that (plans of revoking the CVE/verifying its limited impact hence lowering the base score if possible) would help steer the issue towards more productive discussion IMHO.

frenzymadness commented 1 year ago

Let's not discuss whether the vulnerability is in urllib.parse or in applications using this module with wrong expectations and assumptions. Long-term, Python's stdlib is the best place to fix the problem.

It seems to be similar to the very old CVE in tarfile CVE-2007-4559. The documentation is clear that the tarfile module is very powerful and potentially dangerous if you unpack an archive from an untrusted source. The solution for a problem that big and old required a PEP and new safer unpack mechanisms with the standard deprecation period for the old defaults slowly moving to the new defaults while keeping the possibility to use the old behavior, if you know what you are doing.

Would it make sense to use the same approach here? We could implement a new stricter way of parsing URLs (following one of the standards), give users the option to use it, and make it default after a period of raising DeprecationWarning for ambiguous URLs. Also, strictly following one of the standards might be good for future changes to the parser because fixing a bug in the parser in a backward-incompatible way should be allowed if it makes the implementation closer to the standard.

I did some research and my humble opinion is that RFC 3986/3987 is closer to what Python needs. The WHATWG standard has been implemented in NodeJS and it seems there are still some gaps some users and developers need to be filled and some disagreements about whether it really obsoletes the mentioned RFCs. It seems that WHATWG standard works best for web browsers and developers of server-side applications aren't always happy about it. So essentially they are two competing standards. A Rust implementation of the WHATWG standard is available in the url crate. Ruby provides two parsers - one based on RFC 2396 and another one based on RFC 3986.

So more specifically:

There are implementations of parsers for Python based on both main standards so we can talk to their maintainers about including them (or a subset of their functionalities) somehow into the stdlib.

What do you think about it? Is it doable? Does it make sense? Would it require a PEP? Should I open a new discussion somewhere?

ofek commented 1 year ago

I believe this is the most popular and well tested implementation https://github.com/python-hyper/hyperlink

gpshead commented 1 year ago

There are implementations of parsers for Python based on both main standards so we can talk to their maintainers about including them (or a subset of their functionalities) somehow into the stdlib.

What do you think about it? Is it doable? Does it make sense? Would it require a PEP? Should I open a new discussion somewhere?

@frenzymadness I suggest bringing that up on discuss.python.org (Ideas section). The hyperlink library @ofek links to appears to be a nice interesting API shape... but judging by the existing issues filed against that library I doubt it is in long-term-stale shape sufficient to be included in the stdlib.

Before proposing a new thing in the stdlib as a PEP, you need to have a specific goal in mind that'd be solved. If that goal is to be a pedantic always good for use in all contexts library with zero potential flaws that could cause security issues for any possible application... I doubt such a Python url parsing library exists.

Rather than aiming to get a new standard library module for this, I suggest aiming for a module on PyPI to provide that. The standard library is not an appropriate place for rapid development and behavior changing bug fixes.

frenzymadness commented 1 year ago

Before proposing a new thing in the stdlib as a PEP, you need to have a specific goal in mind that'd be solved. If that goal is to be a pedantic always good for use in all contexts library with zero potential flaws that could cause security issues for any possible application... I doubt such a Python url parsing library exists.

Rather than adding a new module into stdlib I'd add a kw argument switching the behavior of the current parsing functions between the old one and a more strict one following the RFC. We can then have a transition period and the more strict variant can be the default one after a few releases.

I'm not saying that a perfect implementation already exists. But, if we say that our strict implementation follows the rules in RFC/WHATWG, possible future bug fixes will make the implementation closer to the standard and those changes might not be considered as breaking changes.

Rather than aiming to get a new standard library module for this, I suggest aiming for a module on PyPI to provide that. The standard library is not an appropriate place for rapid development and behavior changing bug fixes.

I don't think that the RFC is developing rapidly so there is no need for the implementation to be different.

I'm afraid that the situation here is the same as with the mentioned tarfile CVE. The documentation says that using tarfile with untrusted archives is a bad idea but still, the best place to fix is the stdlib. It took us 14 years to prepare a fix and will take some more months to change the defaults to something safer. I think it's a good idea to do something similar here so we don't have to discuss it anymore.

kenballus commented 1 year ago

I am the author of #99421, the PR that was erroneously credited with fixing the bug that is the subject of this thread. I have spent the last few months doing a survey of the behavior of various URL parsers, mostly written in Python.

Rather than aiming to get a new standard library module for this, I suggest aiming for a module on PyPI to provide that. The standard library is not an appropriate place for rapid development and behavior changing bug fixes.

I have found parsing bugs in every URL parser I have tested form the Python ecosystem. We do not need yet another URL parsing module on PyPI. I agree with @frenzymadness that introducing a better parser into this module is the appropriate action to take. As is mentioned above, the RFCs change very infrequently, so aiming for RFC-compliance parser in the stdlib is a much more achievable goal than aiming for WHATWG compliance.

Before proposing a new thing in the stdlib as a PEP, you need to have a specific goal in mind that'd be solved. If that goal is to be a pedantic always good for use in all contexts library with zero potential flaws that could cause security issues for any possible application... I doubt such a Python url parsing library exists.

My goal is to ensure that the behavior of urllib.parse matches a standard, so that users can more easily reason about how the library will behave without reading its source. A sensible translation of the ABNF from RFCs 3986 and 3987 into regular expressions should be sufficient for this purpose. As a side effect, I expect that we should see fewer potential security flaws than we do with our current shotgun parsing approach.

ofek commented 1 year ago

I have found parsing bugs in every URL parser I have tested form the Python ecosystem. We do not need yet another URL parsing module on PyPI.

Why not choose the one with the least amount of bugs and devote time to opening PRs to fix? I think we should not encourage growth of the standard library. Encouraging the use of packages not only distributes the maintenance burden but also allows for fixes and features independent of the Python version.

kenballus commented 1 year ago

I have found parsing bugs in every URL parser I have tested form the Python ecosystem. We do not need yet another URL parsing module on PyPI.

Why not choose the one with the least amount of bugs and devote time to opening PRs to fix? I think we should not encourage growth of the standard library. Encouraging the use of packages not only distributes the maintenance burden but also allows for fixes and features independent of the Python version.

The URL parsing libraries in PyPI have their own set of backwards-compatibility issues to maintain; submitting PRs will not fix all of these problems. If we can update urllib.parse with a more principled parser, we will eventually reduce the amount of code in the stdlib after deprecated code is removed. As it is, urllib.parse is not particularly maintainable or useful, hence the need for all of these third-party libraries.

frenzymadness commented 1 year ago

I have found parsing bugs in every URL parser I have tested form the Python ecosystem. We do not need yet another URL parsing module on PyPI.

Why not choose the one with the least amount of bugs and devote time to opening PRs to fix? I think we should not encourage growth of the standard library. Encouraging the use of packages not only distributes the maintenance burden but also allows for fixes and features independent of the Python version.

I don't see it as growth. I see it as looking for a way to do backward-incompatible bug fixes in stdlib and limit the need for long discussions in the future.

I don't think people will stop reporting "security issues" in urllib just because we inform them about our special backward-compatible implementation of URL parsing in the documentation.

gpshead commented 1 year ago

(edited with final status) The fixes have landed in the following releases:

phbno1 commented 1 year ago

The fix will appear in the next 3.10 and 3.11 releases (later this month), and pending the release manager merging, in 3.9 as well. That leaves the status of this being fixed in CPython versions:

  • [x] fixed in >= 3.12
  • [x] fixed in 3.11.x >= 3.11.4
  • [x] fixed in 3.10.x >= 3.10.12
  • [x] fixed in 3.9.x >= 3.9.17

I'm not inclined to bother backporting to 3.7 and 3.8 as this isn't a serious issue and the automated backports failed on that (merge conflicts, thus human attention required). If anyone from an OS distro intends to apply backports of this to their 3.8 or 3.7 interpreter, feel free to create it as a PR here and we can merge it in those as well so long as they're still open for security fixes.

I think it is necessary to backport the code to 3.7 and 3.8, and it will help us eliminate the security risks of versions 3.7 and 3.8. However, due to the code differences, we don't know if the fix mentioned 3.9 is feasible, so we need the help of community experts.