Open orsenthil opened 3 years ago
Mike Lissner reported that a set test suites that exercise extreme conditions with URLs, but in conformance with url.spec.whatwg.org was maintained here:
https://github.com/web-platform-tests/wpt/tree/77da471a234e03e65a22ee6df8ceff7aaba391f8/url
These test cases were used against urlparse and urljoin method.
https://gist.github.com/mlissner/4d2110d7083d74cff3893e261a801515
Quoting verbatim
The basic idea is to iterate over the test cases and try joining and parsing them. The script wound up messier than I wanted b/c there's a fair bit of normalization you have to do (e.g., the test cases expect blank paths to be '/', while urlparse returns an empty string), but you'll get the idea.
The bad news is that of the roughly 600 test cases fewer than half pass. Some more normalization would fix some more of this, and I don't imagine all of these have security concerns (I haven't thought through it, honestly, but there are issues with domain parsing too that look meddlesome). For now I've taken it as far as I can, and it should be a good start, I think.
The final numbers the script cranks out are:
Done. 231/586 successes. 1 skipped.
It would be interesting to test also with the yarl module. It is based on urlparse and urljoin, but does extra normalization of %-encoding.
See also bpo-43882.
FWIW rather than implementing our own URL parsing at all... wrapping a library extracted from a compatible-license major browser (Chromium or Firefox) and keeping it updated would avoid disparities.
Unfortunately, I'm not sure how feasible this really is. Do all of the API surfaces we must support in the stdlib for compatibility's sake with urllib line up with such a browser core URL parsing library?
Something to ponder. Unlikely something we'll actually do.
If urllib.parse
is to be tested for whatwg conformance, shouldn't it be urlsplit
? And maybe urlparse
should finally be deprecated?
urlparse's behaviour was superseded by RFC 2396 back in 1998 (which is what resulted in the addition of urlsplit: #35466), the whatwg url standard is a descendant of that (itself intending to supersede RFC 3986). As of the url standard (and already RFC 3986) the entire concept of path segment parameter has been dropped.
By its very nature and purpose, urlparse can't pass the conformance suite, because any time it encounters a ;
within a path segment it's going to move data out of the pathname which it should not. And the suite likely has multiple test cases where the path contains ;
characters.
I'm not sure if this is the correct issue or not, but it seemed related. Chromium (and derivative browsers) recently implemented a change to URL parsing that broke a bunch of custom protocols. We have an application that uses the protocol pw://
which is broken by this change. We have a django application that was using the django URLValidator to validate those custom URLs.
The way I understand it, the correct format for custom protocols is to not have the //
. I don't believe urlsplit supports these because it is looking for the //
in order to extract the netloc. At least it is in 3.12, I haven't looked at 3.13.
Thanks and please let me know if this is the wrong issue or if you need more information.
also @sethmlarson as FYI - it isn't clear we'd ever even be able to do what this issue suggests or that we should FWIW. there are multiple competing interests in the world who have valid needs to use URLs (URIs?) in different ways.
I don't think we should chase browsers, and therefore we shouldn't implement WHATWG. They are allowed to update their "fleet" globally on a whim, they primarily deal with users physically typing things into address bars, we have things called backwards compatibility guarantees. I think implementing RFC 3986 is appropriate for software like Python.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/orsenthil' closed_at = None created_at =
labels = ['type-bug']
title = 'Making urlparse WHATWG conformant'
updated_at =
user = 'https://github.com/orsenthil'
```
bugs.python.org fields:
```python
activity =
actor = 'gregory.p.smith'
assignee = 'orsenthil'
closed = False
closed_date = None
closer = None
components = []
creation =
creator = 'orsenthil'
dependencies = []
files = []
hgrepos = []
issue_num = 43883
keywords = []
message_count = 4.0
messages = ['391344', '391347', '391427', '392969']
nosy_count = 6.0
nosy_names = ['gregory.p.smith', 'orsenthil', 'vstinner', 'serhiy.storchaka', 'Mike.Lissner', 'xtreak']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue43883'
versions = []
```