Open gh-andre opened 3 months ago
It was pointed out to me in the referenced issue that the spec already has the provision for tracking tabs and newlines as non-terminating validation errors. Here's related items 2 and 3 from the spec:
https://url.spec.whatwg.org/#concept-basic-url-parser
If input contains any ASCII tab or newline, invalid-URL-unit validation error.
Remove all ASCII tab or newline from input.
Validation errors are indeed not supposed to terminate parsing, but given that in many contexts silently stripping tabs and newlines manufactures modifies domains, my suggestion would be to provide an optional parameter in urlsplit
, which would fail early if tabs, newlines and non-URL-code-points are encountered in the URLs.
In other words, allow callers to request urlsplit
to fail on validation errors in this step.
This will keep the existing behavior exactly as-is, but will allow callers operating in contexts where URLs are expected to be syntactically valid to fail early and not introduce modified domains and other URL components with stripped tabs and newlines or spend CPU cycles on fully parsing malformed URLs.
Bug report
Bug description:
Current
urlsplit
is implemented according to this spec:https://url.spec.whatwg.org/#concept-basic-url-parser
The spec does say in item 3 to strip tabs, but I believe there's a bug in the specification (perhaps they wanted to say leading/trailing whitespace) because the item 7 in host parsing says
, and tab is listed as a "forbidden domain code point". If tabs are stripped from the entire input before any other work is done, checking for tabs in host names wouldn't make much sense.
I created a bug in the specification project, so maybe they will provide some guidance later on.
https://github.com/whatwg/url/issues/829
CPython versions tested on:
3.10
Operating systems tested on:
Linux, Windows