publicsuffix / list

The Public Suffix List
https://publicsuffix.org/
Mozilla Public License 2.0
2.03k stars 1.22k forks source link

Automatic PSL formatting: what fixes should be applied? #2028

Closed danderson closed 4 weeks ago

danderson commented 3 months ago

I have a WIP that can automatically fix formatting issues in a PSL file, as well as apply machine edits while preserving correct formatting. It will take a few more PRs to merge it all, but you can peek at https://github.com/danderson/list/tree/danderson/psl-parser-format .

The idea is to give contributors a tool that they run before sending a PR, and it (a) tells them everything that they need to fix, but also (b) just automatically fixes things that don't require human judgement, like "did you sort these strings correctly". This also makes automated editing easier, because scripts can operate on an abstract syntax tree, and not have to worry about exactly how to lay out text bytes. Just adjust AST nodes, then validate+format+write the file back out and get a clean diff.

I'm opening this issue to get feedback from PSL maintainers: what formatting fixes do you want me to implement? @dnsguru @simon-friedberger

I've already implemented/thought of several, listed below. Each one requires a one-time PSL reformat to fix existing issues, with diffs ranging from <10 lines to >1000 lines. After that, we can require new contributions to run the formatter before sending the PR (and enforce it in CI), and there will be no random churn.

Each of the listed things is independent from the others, you can pick any combination. These changes are all purely cosmetic, and do not change the set of public suffixes, or the meaning of any comments (modulo bugs of course, but I'm confident I can reduce the risk of that with tests, pre/post format sanity checks, etc.).

  1. Basic consistency edits: remove leading/trailing whitespace, list wildcard exceptions immediately after corresponding wildcard, consistent inter-block spacing: patch

    • Technically this one does 3 independent things right now. I can separate them if you only want a subset. These are the changes that are more costly to not do, because avoiding this diff means I have to make the parser more complex to preserve the inconsistencies.
  2. Sort blocks in the private section, by company/product/owner name. This automates #1858, or at least one half of it. Somewhat large change, but not unjustified: all those blocks are indeed in the wrong order today. patch

  3. Sort suffixes in the private section in addition to the blocks. This is the other half of #1858. Patch for this one is tiny, 2 suffixes move by a few lines. patch

  4. Sort suffixes in the ICANN section. Large initial diff, due to the older TLD/ccTLD blocks with hand-curated information, and order changes in non-Latin scripts (I spot-checked some diffs with native speakers, they agree the new order is "correct" for their languages). patch

  5. Canonicalize metadata comments in the private section. The format I selected for the diff is "the most popular current format", to get a minimal diff. We can pick any format we like though. patch

    • The formatter only changes a comment when it has high confidence that it's not altering the meaning. This still has bugs in my WIP and some comments get mangled, please assume I will fix them and verify the diff if you want this reformat.
  6. Sort blocks in the ICANN section. I haven't tried this yet, but I assume the diff is going to be massive, and it needs coordination with the GTLD update script so they don't fight each other. But it's possible and pretty easy to implement, if you want it.

FWIW, my personal opinion:

simon-friedberger commented 2 months ago

I agree with what I think you're proposing. Let's do the private section bits first and do all of them. Let's figure out how to integrate this with the already automatic script for the ICANN section later.

danderson commented 2 months ago

Acknowledged. I've set the initial configuration to focus on the private section, and we can look at the icann block later.

dnsguru commented 1 month ago

@danderson looks like there is a strange trailing crlf just before the ICANN section close that is causing the automation to trigger a PR daily.

Were you the last to touch this? Can you look at that?

danderson commented 1 month ago

Likely related to the reformatting change, yes. I'm on it, apologies for the noise.

dnsguru commented 1 month ago

all good. your contributions towards evolving this are app appreciated immensely and things are never perfect in change.

please @ me on merges on the automaion

danderson commented 1 month ago

Looking at the timings on the git history, I believe #2093 fixed this, along with the submission of #2103. Timeline (times in US Pacific timezone because that's what github's giving me):

Bottom line is this mess was caused by my neglecting to adjust the gtld updater's output to match the reformat done in #2088, which caused the cronjob to keep trying to make the diff go away until Simon merged a fix. And then one last noisy change to make that newline go away once and for all. I'll be more careful in future, and will add "check what the automation runs would do" to my PR checklist.

The good news is the noise should be over now. The gtld updater and 'psltool fmt' both agree on what the file layout should be, so there should be no more spurious PRs.

dnsguru commented 1 month ago

ok - another thing to watch is that the date will always be different in ICANN's json file as they generate it daily, so maybe ignore it if it is the sole diff

danderson commented 1 month ago

The gtld updater already handles that gracefully, thankfully. The timestamped header isn't included in the "has this changed" check, so something other than the timestamp has to change for it to write a new file and create a PR.