tools: add a validating parser for PSL files

I was looking to contribute some additional automation and validation to help manage the PSL. I started off looking at porting #1953 to Go, to match the gTLD updating code. Then I got carried away and ended up with this parser. As it's getting fairly large I wanted to send a PR now to see if the direction makes sense, and if so get the initial bones merged before continuing in smaller increments.

Some highlights;

Parses the current PSL correctly[*]. Explicit goal is to parse the current ground truth and not require any PSL format/style changes. If changes are desired, they will happen as separate PRs later.
Extracts metadata (company name, URL, contact email) from block comments. Extraction is forgiving and understands the empirical variations present in the PSL, rather than require perfect structure (but prefers current PSL conventions before trying fuzzier things).
Preserves source locations, so errors can point at specific lines and quote source text.
Supports validation exceptions, so we can add new validations rules without having to immediately fix existing violations.
Good test coverage: 100% coverage of the parse/validation logic if you include the tests that parse the real PSL, ~80% with synthetic cases only. Followup PRs will address the latter coverage gaps.
Lots of code comments, particularly in places where the parser is doing any kind of fuzzy/lenient parsing for compatibility.

Despite the overall PR size, the meat of the parser is only about 700LoC. The rest is comments, tests and validation exceptions. To review, I suggest this reading order:

file.go defines the data types for a parsed file.
parser.go is the meat of the parser.
validate.go contains some post-parse validation passes.
exceptions.go is the exceptions mechanism mentioned above.
errors.go is mostly boilerplate, defining types for each error the parser can report.
parser_test.go is the tests: synthetic test cases at the top, and tests against the real PSL below.

Some notes and caveats:

I made the parser Go package internal, meaning it cannot be used outside of this repo. Maybe someday it'd be nice to publish a package for others to use, but to start with I'd like to be free to make breaking API changes as we keep building this and maybe discover that my initial API needs tweaks.
The parser is not currently hooked into CI, and PR senders aren't steered to it. My plan is to firm up the validations and then add some CI, but this PR is already fairly large and I want to get a sanity check on direction before I go deeper. I did add a tiny CLI that runs the parser and prints errors: cd tools && go run ./govalidate ../public_suffix_list.dat. Add --with-warnings to see validation errors that have legacy exceptions.
The parser can parse the real PSL without errors, but I'm only about halfway through my manual inspection of the output. As I finish that inspection I may find some mis-parses, in which case I'll fix them in followup PRs and add regression tests.
The inspection of the PSL parse revealed some issues with the "extract metadata from comments" logic. Not unexpected as it tries to be lenient on format to allow for legacy style variations. I plan to fix these in followup parser PRs, but calling them out for completeness:
- 5 ICANN blocks end up with extracted names like bv : No registrations at this time instead of just bv.
- 3 ICANN blocks list their contact information as Confirmed by registry <email>, which is not recognized as a contact email, e.g. .kw.
- 4 blocks list multiple "see also" URLs on a single line, which get misparsed as a single URL, e.g. .fr and dyn.com blocks.
- 2 blocks list multiple contact emails, which get misparsed as a single address, e.g. Adobe and Cloud DNS Ltd blocks.
- The unicode/punycode ICANN TLDs parse correctly, but end up with slightly unwieldy extracted block names like xn--54b7fta0cc ("Bangla", Bangla) : BD. The comment blocks for these TLDs have a regular structure that could be used to get a more pleasing result.
- Similarly the autogenerated gTLD section gets names like <tld> : <registry name>, which are a correct parse but could be improved.

Future plans, to get an idea of where I'd like to go with this:

More validations, obviously. The glaring missing one right now is any validation of the suffixes themselves! Another example is the ordering of the private domains section, which I implemented but then cut because the private section is not currently in correct order, and encoding exceptions to ordering is tricky enough that it deserves a separate PR.
Network-based validations (opt-in, since they're more expensive). The code can easily do things like check _psl DNS records and domain expiration dates. The parser's support for exceptions is crucial here since many entries predate these requirements and will need to be brought into the fold gradually.
Ongoing validations. Similar to previous, but not just at PR time. Say, monthly, rerun network validations on all entries, and flag non-compliant ones for evaluation or removal.
Diff validation. When evaluating a PR, only report new errors related to the PR's changes, and only run network validations that are relevant to the change.
Machine editing. This is a bit fuzzier and may not be worth it, but it may be nice to do things like gTLD updating as an edit of typed objects and then write the diff back out. I wrote the parser with this in mind, but the API needs a bunch more work before this is realistic. And again, may not be worth it overall vs. simpler editing logic followed by using this parser to verify the change.
Gradual PSL cleanups. The parser code documents some hacks that only 1-2 entries need in order to parse cleanly. I would like to PR fixes where uncontroversial, and remove the compat hacks from the parser.

Go package docs are tricky to view for internal packages and on PRs, so here's a temporary view of the parser package docs: https://vega.yak-minor.ts.net/pkg/github.com/publicsuffix/list/tools/internal/parser/

More PRs are generally good if you do distinct things. We squash&merge in this repo so separate PRs for separate work items stay more nicely separated.

Once you look at creating a Github action for checking PRs have a look at this https://github.com/publicsuffix/list/blob/master/.github/workflows/check_pr.yml That was basically what I did there but it doesn't actually work and I haven't had time to debug why yet. I'm guessing GH is somehow automatically rebasing when showing the changes in the web UI but the CLI doesn't do that.

If it doesn't simplify the code let's not change the format.

Changes to the .dat should definitely be in a different PR. Like this: https://github.com/publicsuffix/list/pull/1987/commits/9cd01aeb3c9fa80e442a37f6ba19c19bc9fc4a1b

I removed the .dat edits, and the file format change. That means unit tests currently fail because of.by gets flagged with invalid metadata. I'll send a separate PR to fix that in .dat.

The parser unit tests don't get run by current github actions workflows (added to my todo!), so if you're happy with this it can be merged without breaking the build, without waiting for the of.by change. Your call.

For the record, once this is merged here are my TODOs for next PRs. Each item is independent so will be a separate PR. They are not in priority order, my current abstract goal is "this tool can report issues on PRs without annoying maintainers" and I'm picking stuff from the todos based on that.

refactor: make helper for string twiddling ops, to make the parser business logic easier to follow refactor: adjust error reporting to make it easier to custom format errors (for better display in github actions) refactor: split errors into "hard" parse errors and "lint" validation errors, so caller can tell if the file is usable to evaluate fqdns

parsing: be more strict with some whitespace/newlines/unicode garbage in the input parsing: recognize ": " in block metadata parsing: recognize "Confirmed by registry" for email contact metadata parsing: handle multiple URLs in metadata parsing: handle multiple contact emails in metadata parsing: try to improve extracted metadata for punycode TLDs (cosmetic only) parsing: try to improve extracted metadata for gTLDs (cosmetic only)

validation: parse and validate the suffix lines, including wildcard exceptions validation: port all missing validations from pslint.py validation: add API to eval FQDNs, so we can run tests/tests.txt validation: block sorting check for private domain section (#1953) validation: check existence/validity of _psl DNS records

automation: support diff validation (only report issues for changes in a PR) automation: make validator run on PRs and report errors, once it's robust enough automation: support for generating exemplars for suffixes ("here is a fqdn under this suffix that evals true, is that what you expected?") automation: support for machine editing? (low priority, after everything else)

tests: increase synthetic test coverage tests: run parser unit tests in CI

@danderson thanks for your contribution. Out of curiosity, are you aware of https://github.com/weppos/publicsuffix-go? Admittedly, the library I built is not a linter but rather an SDK to access the list. In fact, it sounds like what you have done may finally replace this 8+ year stub. 😄

I have not looked deeply yet into the implementation. It sounds like it would be extensible enough to run some additional validations, and will happily co-exists with publicsuffix-go which instead focuses on the transformation of a well-formed list into a consumable struct.

@simon-friedberger I saw some reference to the Python sorting script which is still open. I imagine at this point we truly need to figure out which language and toolkit we prefer to use consistently for the project.

I heard we wanted to use Python, but here we have a pretty big counter proposal.

I also wonder if we should consider this linter as a replacement for https://github.com/publicsuffix/list/tree/master/linter. If not yet, we likely have to find the uncovered parts otherwise, we have two overlapping linters.

@danderson thanks for your contribution. Out of curiosity, are you aware of https://github.com/weppos/publicsuffix-go? Admittedly, the library I built is not a linter but rather an SDK to access the list. In fact, it sounds like what you have done may finally replace this 8+ year stub. 😄

I have not looked deeply yet into the implementation. It sounds like it would be extensible enough to run some additional validations, and will happily co-exists with publicsuffix-go which instead focuses on the transformation of a well-formed list into a consumable struct.

Yes, I think that's right! This parser is a slow(er), more extensive validator that does much more exhaustive checking than PSL users need. Notably, it keeps source text references for everything and does not construct efficient data structures for evaluation, so it has higher memory cost and lower performance.

In exchange, once I implement more validations it should be able to enforce ~all PSL submission policies, with the exception of those that are necessarily human driven ("does this submission feel like it's trying to do something sneaky?"). So, I think it makes sense for both implementations to exist.

I also wonder if we should consider this linter as a replacement for https://github.com/publicsuffix/list/tree/master/linter. If not yet, we likely have to find the uncovered parts otherwise, we have two overlapping linters.

That is my plan currently. One of my todos is to port the missing python linter checks to this parser. Right now I'm doing a bit of refactoring because I discovered a code structure that makes the parser easier to understand, but after that the plan is pretty much implement all the validations and hook it up to automation!

publicsuffix / list

tools: add a validating parser for PSL files #1987