publicsuffix / list

The Public Suffix List
https://publicsuffix.org/
Mozilla Public License 2.0
1.93k stars 1.18k forks source link

tools: add a validating parser for PSL files #1987

Closed danderson closed 1 month ago

danderson commented 1 month ago

I was looking to contribute some additional automation and validation to help manage the PSL. I started off looking at porting #1953 to Go, to match the gTLD updating code. Then I got carried away and ended up with this parser. As it's getting fairly large I wanted to send a PR now to see if the direction makes sense, and if so get the initial bones merged before continuing in smaller increments.

Some highlights;

Despite the overall PR size, the meat of the parser is only about 700LoC. The rest is comments, tests and validation exceptions. To review, I suggest this reading order:

Some notes and caveats:

Future plans, to get an idea of where I'd like to go with this:

danderson commented 1 month ago

Go package docs are tricky to view for internal packages and on PRs, so here's a temporary view of the parser package docs: https://vega.yak-minor.ts.net/pkg/github.com/publicsuffix/list/tools/internal/parser/

simon-friedberger commented 1 month ago

More PRs are generally good if you do distinct things. We squash&merge in this repo so separate PRs for separate work items stay more nicely separated.

Once you look at creating a Github action for checking PRs have a look at this https://github.com/publicsuffix/list/blob/master/.github/workflows/check_pr.yml That was basically what I did there but it doesn't actually work and I haven't had time to debug why yet. I'm guessing GH is somehow automatically rebasing when showing the changes in the web UI but the CLI doesn't do that.

If it doesn't simplify the code let's not change the format.

simon-friedberger commented 1 month ago

Changes to the .dat should definitely be in a different PR. Like this: https://github.com/publicsuffix/list/pull/1987/commits/9cd01aeb3c9fa80e442a37f6ba19c19bc9fc4a1b

danderson commented 1 month ago

I removed the .dat edits, and the file format change. That means unit tests currently fail because of.by gets flagged with invalid metadata. I'll send a separate PR to fix that in .dat.

The parser unit tests don't get run by current github actions workflows (added to my todo!), so if you're happy with this it can be merged without breaking the build, without waiting for the of.by change. Your call.

For the record, once this is merged here are my TODOs for next PRs. Each item is independent so will be a separate PR. They are not in priority order, my current abstract goal is "this tool can report issues on PRs without annoying maintainers" and I'm picking stuff from the todos based on that.

refactor: make helper for string twiddling ops, to make the parser business logic easier to follow refactor: adjust error reporting to make it easier to custom format errors (for better display in github actions) refactor: split errors into "hard" parse errors and "lint" validation errors, so caller can tell if the file is usable to evaluate fqdns

parsing: be more strict with some whitespace/newlines/unicode garbage in the input parsing: recognize ": " in block metadata parsing: recognize "Confirmed by registry" for email contact metadata parsing: handle multiple URLs in metadata parsing: handle multiple contact emails in metadata parsing: try to improve extracted metadata for punycode TLDs (cosmetic only) parsing: try to improve extracted metadata for gTLDs (cosmetic only)

validation: parse and validate the suffix lines, including wildcard exceptions validation: port all missing validations from pslint.py validation: add API to eval FQDNs, so we can run tests/tests.txt validation: block sorting check for private domain section (#1953) validation: check existence/validity of _psl DNS records

automation: support diff validation (only report issues for changes in a PR) automation: make validator run on PRs and report errors, once it's robust enough automation: support for generating exemplars for suffixes ("here is a fqdn under this suffix that evals true, is that what you expected?") automation: support for machine editing? (low priority, after everything else)

tests: increase synthetic test coverage tests: run parser unit tests in CI

weppos commented 1 month ago

@danderson thanks for your contribution. Out of curiosity, are you aware of https://github.com/weppos/publicsuffix-go? Admittedly, the library I built is not a linter but rather an SDK to access the list. In fact, it sounds like what you have done may finally replace this 8+ year stub. 😄

I have not looked deeply yet into the implementation. It sounds like it would be extensible enough to run some additional validations, and will happily co-exists with publicsuffix-go which instead focuses on the transformation of a well-formed list into a consumable struct.


@simon-friedberger I saw some reference to the Python sorting script which is still open. I imagine at this point we truly need to figure out which language and toolkit we prefer to use consistently for the project.

I heard we wanted to use Python, but here we have a pretty big counter proposal.

I also wonder if we should consider this linter as a replacement for https://github.com/publicsuffix/list/tree/master/linter. If not yet, we likely have to find the uncovered parts otherwise, we have two overlapping linters.

danderson commented 1 month ago

@danderson thanks for your contribution. Out of curiosity, are you aware of https://github.com/weppos/publicsuffix-go? Admittedly, the library I built is not a linter but rather an SDK to access the list. In fact, it sounds like what you have done may finally replace this 8+ year stub. 😄

I have not looked deeply yet into the implementation. It sounds like it would be extensible enough to run some additional validations, and will happily co-exists with publicsuffix-go which instead focuses on the transformation of a well-formed list into a consumable struct.

Yes, I think that's right! This parser is a slow(er), more extensive validator that does much more exhaustive checking than PSL users need. Notably, it keeps source text references for everything and does not construct efficient data structures for evaluation, so it has higher memory cost and lower performance.

In exchange, once I implement more validations it should be able to enforce ~all PSL submission policies, with the exception of those that are necessarily human driven ("does this submission feel like it's trying to do something sneaky?"). So, I think it makes sense for both implementations to exist.

I also wonder if we should consider this linter as a replacement for https://github.com/publicsuffix/list/tree/master/linter. If not yet, we likely have to find the uncovered parts otherwise, we have two overlapping linters.

That is my plan currently. One of my todos is to port the missing python linter checks to this parser. Right now I'm doing a bit of refactoring because I discovered a code structure that makes the parser easier to understand, but after that the plan is pretty much implement all the validations and hook it up to automation!