Making urlparse WHATWG conformant

orsenthil commented 3 years ago

BPO	43883
Nosy	@gpshead, @orsenthil, @vstinner, @serhiy-storchaka, @mlissner, @tirkarthi

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/orsenthil' closed_at = None created_at = labels = ['type-bug'] title = 'Making urlparse WHATWG conformant' updated_at = user = 'https://github.com/orsenthil' ``` bugs.python.org fields: ```python activity = actor = 'gregory.p.smith' assignee = 'orsenthil' closed = False closed_date = None closer = None components = [] creation = creator = 'orsenthil' dependencies = [] files = [] hgrepos = [] issue_num = 43883 keywords = [] message_count = 4.0 messages = ['391344', '391347', '391427', '392969'] nosy_count = 6.0 nosy_names = ['gregory.p.smith', 'orsenthil', 'vstinner', 'serhiy.storchaka', 'Mike.Lissner', 'xtreak'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue43883' versions = [] ```

orsenthil commented 3 years ago

Mike Lissner reported that a set test suites that exercise extreme conditions with URLs, but in conformance with url.spec.whatwg.org was maintained here:

https://github.com/web-platform-tests/wpt/tree/77da471a234e03e65a22ee6df8ceff7aaba391f8/url

These test cases were used against urlparse and urljoin method.

https://gist.github.com/mlissner/4d2110d7083d74cff3893e261a801515

Quoting verbatim

The basic idea is to iterate over the test cases and try joining and parsing them. The script wound up messier than I wanted b/c there's a fair bit of normalization you have to do (e.g., the test cases expect blank paths to be '/', while urlparse returns an empty string), but you'll get the idea.

The bad news is that of the roughly 600 test cases fewer than half pass. Some more normalization would fix some more of this, and I don't imagine all of these have security concerns (I haven't thought through it, honestly, but there are issues with domain parsing too that look meddlesome). For now I've taken it as far as I can, and it should be a good start, I think.

The final numbers the script cranks out are:

Done. 231/586 successes. 1 skipped.

serhiy-storchaka commented 3 years ago

It would be interesting to test also with the yarl module. It is based on urlparse and urljoin, but does extra normalization of %-encoding.

vstinner commented 3 years ago

See also bpo-43882.

gpshead commented 3 years ago

FWIW rather than implementing our own URL parsing at all... wrapping a library extracted from a compatible-license major browser (Chromium or Firefox) and keeping it updated would avoid disparities.

Unfortunately, I'm not sure how feasible this really is. Do all of the API surfaces we must support in the stdlib for compatibility's sake with urllib line up with such a browser core URL parsing library?

Something to ponder. Unlikely something we'll actually do.

xmo-odoo commented 1 year ago

If urllib.parse is to be tested for whatwg conformance, shouldn't it be urlsplit? And maybe urlparse should finally be deprecated?

urlparse's behaviour was superseded by RFC 2396 back in 1998 (which is what resulted in the addition of urlsplit: #35466), the whatwg url standard is a descendant of that (itself intending to supersede RFC 3986). As of the url standard (and already RFC 3986) the entire concept of path segment parameter has been dropped.

By its very nature and purpose, urlparse can't pass the conformance suite, because any time it encounters a ; within a path segment it's going to move data out of the pathname which it should not. And the suite likely has multiple test cases where the path contains ; characters.

idahogray commented 23 hours ago

I'm not sure if this is the correct issue or not, but it seemed related. Chromium (and derivative browsers) recently implemented a change to URL parsing that broke a bunch of custom protocols. We have an application that uses the protocol pw:// which is broken by this change. We have a django application that was using the django URLValidator to validate those custom URLs.

The way I understand it, the correct format for custom protocols is to not have the //. I don't believe urlsplit supports these because it is looking for the // in order to extract the netloc. At least it is in 3.12, I haven't looked at 3.13.

Thanks and please let me know if this is the wrong issue or if you need more information.

gpshead commented 23 hours ago

also @sethmlarson as FYI - it isn't clear we'd ever even be able to do what this issue suggests or that we should FWIW. there are multiple competing interests in the world who have valid needs to use URLs (URIs?) in different ways.

sethmlarson commented 23 hours ago

I don't think we should chase browsers, and therefore we shouldn't implement WHATWG. They are allowed to update their "fleet" globally on a whim, they primarily deal with users physically typing things into address bars, we have things called backwards compatibility guarantees. I think implementing RFC 3986 is appropriate for software like Python.

python / cpython

Making urlparse WHATWG conformant #88049