tool/internal/parser: sanitize input to clean, valid UTF-8

danderson commented 3 weeks ago

The PSL's canonical is valid UTF-8 with no BOM. However, to try and report useful lint errors, the parser tries to detect and normalize all forms of UTF-16, as well as UTF-8 with BOM. Anything other than the specified canonical encoding is reported in validation errors.

The upcoming PR that implements validation of block ordering needs to compare block names, which requires comparing with a Unicode collation. Collation is already confusing enough by itself, without also feeding it invalid UTF-8 :). So, this is a best-effort attempt to normalize non-compliant inputs into clean UTF-8, so that we report those errors early and the rest of the parser can assume strings don't contain garbage, at least at the encoding level.

simon-friedberger commented 3 weeks ago

Since the goal here is to achieve what the current tool in the linter folder is doing, have you considered reading the test files in there?

danderson commented 3 weeks ago

Since the goal here is to achieve what the current tool in the linter folder is doing, have you considered reading the test files in there?

Hmm, I didn't think of that. Looking at the test files, It's tricky to check for 1:1 match of validation because it's checking for the specific errors the python linter outputs. But this parser definitely needs to cover all these test inputs, and more. For the "classes" of inputs that this parser already handles, the new tests are more exhaustive in every case. Obviously I still need to cover the remaining classes.

Breakdown of the pslint tests:

test_allowedchars and test_spaces are a subset of what this PR tests.
test_section* is partially implemented: I'm missing an error for duplicate sections, and for missing required sections. Added both to my TODO.
test_{dots,duplicate,exception,punycode,wildcard} are suffix validations, not yet implemented. Added a note to my TODO that when I implement it, tests need to cover a superset of these inputs.
test_NFKC is missing, and important to break homoglyph attacks. I need to refresh my memory on Unicode normalization forms, IDNA, homoglyph attacks, and how golang.org/x/text handles normalization, to make sure I implement it correctly. Added to my TODOs.

publicsuffix / list

tool/internal/parser: sanitize input to clean, valid UTF-8 #2005