python / cpython

The Python programming language
https://www.python.org
Other
62.33k stars 29.94k forks source link

urlsplit manufactures hostnames because it strips off tabs before validating them #122761

Open gh-andre opened 1 month ago

gh-andre commented 1 month ago

Bug report

Bug description:

import urllib.parse

# prints "abcxyz.test"
print(urllib.parse.urlsplit("http://abc\txyz.test/").netloc)

Current urlsplit is implemented according to this spec:

https://url.spec.whatwg.org/#concept-basic-url-parser

The spec does say in item 3 to strip tabs, but I believe there's a bug in the specification (perhaps they wanted to say leading/trailing whitespace) because the item 7 in host parsing says

If asciiDomain contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.

, and tab is listed as a "forbidden domain code point". If tabs are stripped from the entire input before any other work is done, checking for tabs in host names wouldn't make much sense.

I created a bug in the specification project, so maybe they will provide some guidance later on.

https://github.com/whatwg/url/issues/829

CPython versions tested on:

3.10

Operating systems tested on:

Linux, Windows

gh-andre commented 1 month ago

It was pointed out to me in the referenced issue that the spec already has the provision for tracking tabs and newlines as non-terminating validation errors. Here's related items 2 and 3 from the spec:

https://url.spec.whatwg.org/#concept-basic-url-parser

If input contains any ASCII tab or newline, invalid-URL-unit validation error.

Remove all ASCII tab or newline from input.

Validation errors are indeed not supposed to terminate parsing, but given that in many contexts silently stripping tabs and newlines manufactures modifies domains, my suggestion would be to provide an optional parameter in urlsplit, which would fail early if tabs, newlines and non-URL-code-points are encountered in the URLs.

In other words, allow callers to request urlsplit to fail on validation errors in this step.

This will keep the existing behavior exactly as-is, but will allow callers operating in contexts where URLs are expected to be syntactically valid to fail early and not introduce modified domains and other URL components with stripped tabs and newlines or spend CPU cycles on fully parsing malformed URLs.