Closed danderson closed 1 month ago
Go package docs are tricky to view for internal packages and on PRs, so here's a temporary view of the parser package docs: https://vega.yak-minor.ts.net/pkg/github.com/publicsuffix/list/tools/internal/parser/
More PRs are generally good if you do distinct things. We squash&merge in this repo so separate PRs for separate work items stay more nicely separated.
Once you look at creating a Github action for checking PRs have a look at this https://github.com/publicsuffix/list/blob/master/.github/workflows/check_pr.yml That was basically what I did there but it doesn't actually work and I haven't had time to debug why yet. I'm guessing GH is somehow automatically rebasing when showing the changes in the web UI but the CLI doesn't do that.
If it doesn't simplify the code let's not change the format.
Changes to the .dat should definitely be in a different PR. Like this: https://github.com/publicsuffix/list/pull/1987/commits/9cd01aeb3c9fa80e442a37f6ba19c19bc9fc4a1b
I removed the .dat edits, and the file format change. That means unit tests currently fail because of.by gets flagged with invalid metadata. I'll send a separate PR to fix that in .dat.
The parser unit tests don't get run by current github actions workflows (added to my todo!), so if you're happy with this it can be merged without breaking the build, without waiting for the of.by change. Your call.
For the record, once this is merged here are my TODOs for next PRs. Each item is independent so will be a separate PR. They are not in priority order, my current abstract goal is "this tool can report issues on PRs without annoying maintainers" and I'm picking stuff from the todos based on that.
refactor: make helper for string twiddling ops, to make the parser business logic easier to follow refactor: adjust error reporting to make it easier to custom format errors (for better display in github actions) refactor: split errors into "hard" parse errors and "lint" validation errors, so caller can tell if the file is usable to evaluate fqdns
parsing: be more strict with some whitespace/newlines/unicode garbage in the input parsing: recognize "
: " in block metadata parsing: recognize "Confirmed by registry" for email contact metadata parsing: handle multiple URLs in metadata parsing: handle multiple contact emails in metadata parsing: try to improve extracted metadata for punycode TLDs (cosmetic only) parsing: try to improve extracted metadata for gTLDs (cosmetic only) validation: parse and validate the suffix lines, including wildcard exceptions validation: port all missing validations from pslint.py validation: add API to eval FQDNs, so we can run tests/tests.txt validation: block sorting check for private domain section (#1953) validation: check existence/validity of _psl DNS records
automation: support diff validation (only report issues for changes in a PR) automation: make validator run on PRs and report errors, once it's robust enough automation: support for generating exemplars for suffixes ("here is a fqdn under this suffix that evals true, is that what you expected?") automation: support for machine editing? (low priority, after everything else)
tests: increase synthetic test coverage tests: run parser unit tests in CI
@danderson thanks for your contribution. Out of curiosity, are you aware of https://github.com/weppos/publicsuffix-go? Admittedly, the library I built is not a linter but rather an SDK to access the list. In fact, it sounds like what you have done may finally replace this 8+ year stub. 😄
I have not looked deeply yet into the implementation. It sounds like it would be extensible enough to run some additional validations, and will happily co-exists with publicsuffix-go
which instead focuses on the transformation of a well-formed list into a consumable struct.
@simon-friedberger I saw some reference to the Python sorting script which is still open. I imagine at this point we truly need to figure out which language and toolkit we prefer to use consistently for the project.
I heard we wanted to use Python, but here we have a pretty big counter proposal.
I also wonder if we should consider this linter as a replacement for https://github.com/publicsuffix/list/tree/master/linter. If not yet, we likely have to find the uncovered parts otherwise, we have two overlapping linters.
@danderson thanks for your contribution. Out of curiosity, are you aware of https://github.com/weppos/publicsuffix-go? Admittedly, the library I built is not a linter but rather an SDK to access the list. In fact, it sounds like what you have done may finally replace this 8+ year stub. 😄
I have not looked deeply yet into the implementation. It sounds like it would be extensible enough to run some additional validations, and will happily co-exists with
publicsuffix-go
which instead focuses on the transformation of a well-formed list into a consumable struct.
Yes, I think that's right! This parser is a slow(er), more extensive validator that does much more exhaustive checking than PSL users need. Notably, it keeps source text references for everything and does not construct efficient data structures for evaluation, so it has higher memory cost and lower performance.
In exchange, once I implement more validations it should be able to enforce ~all PSL submission policies, with the exception of those that are necessarily human driven ("does this submission feel like it's trying to do something sneaky?"). So, I think it makes sense for both implementations to exist.
I also wonder if we should consider this linter as a replacement for https://github.com/publicsuffix/list/tree/master/linter. If not yet, we likely have to find the uncovered parts otherwise, we have two overlapping linters.
That is my plan currently. One of my todos is to port the missing python linter checks to this parser. Right now I'm doing a bit of refactoring because I discovered a code structure that makes the parser easier to understand, but after that the plan is pretty much implement all the validations and hook it up to automation!
I was looking to contribute some additional automation and validation to help manage the PSL. I started off looking at porting #1953 to Go, to match the gTLD updating code. Then I got carried away and ended up with this parser. As it's getting fairly large I wanted to send a PR now to see if the direction makes sense, and if so get the initial bones merged before continuing in smaller increments.
Some highlights;
Despite the overall PR size, the meat of the parser is only about 700LoC. The rest is comments, tests and validation exceptions. To review, I suggest this reading order:
Some notes and caveats:
cd tools && go run ./govalidate ../public_suffix_list.dat
. Add--with-warnings
to see validation errors that have legacy exceptions.bv : No registrations at this time
instead of justbv
.Confirmed by registry <email>
, which is not recognized as a contact email, e.g..kw
..fr
anddyn.com
blocks.Adobe
andCloud DNS Ltd
blocks.xn--54b7fta0cc ("Bangla", Bangla) : BD
. The comment blocks for these TLDs have a regular structure that could be used to get a more pleasing result.<tld> : <registry name>
, which are a correct parse but could be improved.Future plans, to get an idea of where I'd like to go with this:
_psl
DNS records and domain expiration dates. The parser's support for exceptions is crucial here since many entries predate these requirements and will need to be brought into the fold gradually.