publicsuffix / list

The Public Suffix List
https://publicsuffix.org/
Mozilla Public License 2.0
1.93k stars 1.18k forks source link

tool/internal/parser: sanitize input to clean, valid UTF-8 #2005

Closed danderson closed 3 weeks ago

danderson commented 3 weeks ago

The PSL's canonical is valid UTF-8 with no BOM. However, to try and report useful lint errors, the parser tries to detect and normalize all forms of UTF-16, as well as UTF-8 with BOM. Anything other than the specified canonical encoding is reported in validation errors.


The upcoming PR that implements validation of block ordering needs to compare block names, which requires comparing with a Unicode collation. Collation is already confusing enough by itself, without also feeding it invalid UTF-8 :). So, this is a best-effort attempt to normalize non-compliant inputs into clean UTF-8, so that we report those errors early and the rest of the parser can assume strings don't contain garbage, at least at the encoding level.

simon-friedberger commented 3 weeks ago

Since the goal here is to achieve what the current tool in the linter folder is doing, have you considered reading the test files in there?

danderson commented 3 weeks ago

Since the goal here is to achieve what the current tool in the linter folder is doing, have you considered reading the test files in there?

Hmm, I didn't think of that. Looking at the test files, It's tricky to check for 1:1 match of validation because it's checking for the specific errors the python linter outputs. But this parser definitely needs to cover all these test inputs, and more. For the "classes" of inputs that this parser already handles, the new tests are more exhaustive in every case. Obviously I still need to cover the remaining classes.

Breakdown of the pslint tests: