publicsuffix / list

The Public Suffix List
https://publicsuffix.org/
Mozilla Public License 2.0
1.93k stars 1.18k forks source link

tools/internal/parser: validate the sort order of the private section #2012

Closed danderson closed 2 weeks ago

danderson commented 2 weeks ago

The diffstat is pretty huge, but most of it is tests and exceptions for the entries that are in the wrong order today. The core of the change is unicode.go (huge explanation and 4 lines of code), and validate.go for the ordering check itself.

I added all the currently incorrectly sorted things to the exception list for now, once I implement a pslfmt command we can reorder all of them at once instead of fixing each one manually.

Validation error looks like this with a single wrong block:

$ govalidate public_suffix_list.dat
Suffix block "SpaceKit" is in the wrong place, it should go immediately after block "SourceHut"

And with multiple wrong blocks:

$ govalidate public_suffix_list.dat
3 suffix blocks are in the wrong place, make these changes to fix:
    move block: Sony Interactive Entertainment LLC
         after: Snowplow Analytics
    move block: SourceLair PC
         after: SourceHut
    move block: SpaceKit
         after: SourceLair PC

Note there is also a fixup commit that deletes tests for workarounds that were deleted in 3b1b0d0c5efa278396d976b659e38d8ca346eba0.

simon-friedberger commented 2 weeks ago

Let's use the Basic English Ordering because it seems quite odd that "b-data" goes between "base" and "beagleboard".

simon-friedberger commented 2 weeks ago

Merging this and then looking at a pslfmt(1) is also good.

danderson commented 2 weeks ago

Sounds good, you should have a completed PR for review tomorrow (in your timezone). Thanks!

danderson commented 2 weeks ago

Okay, change is ready for review now, thanks for bearing with me @simon-friedberger. I updated the PR description with sample validation errors and other info.

simon-friedberger commented 2 weeks ago

@danderson Regarding superblocks: Let's simplify the format to simplify the parsing!

Maybe we can generate a JSON version to make the structure we want more explicit and discuss any other data we might want to add?

Since you already are massaging the data into an internal structure and testing that you can write it back, do you think you could make a JSON output as well? Such that this could function as a two way converter?

danderson commented 2 weeks ago

@danderson Regarding superblocks: Let's simplify the format to simplify the parsing!

We can do that! Given the risk of bikeshedding and churn with new formats, I was trying to stick closely to the existing bytes. However, once we have an automatic formatter that does the right thing, we can experiment much easier. I have a vague idea for a convention for metadata in comments that is easy for humans and machines to read, I will propose it once more TODOs are done :)

Maybe we can generate a JSON version to make the structure we want more explicit and discuss any other data we might want to add?

Since you already are massaging the data into an internal structure and testing that you can write it back, do you think you could make a JSON output as well? Such that this could function as a two way converter?

Yes, I think it's possible, with caveats.

The parser needs some changes to generate a tree (I was too optimistic outputting a flat list :sob:). I have those partially done already, they're necessary for precise reformatting anyway. Once we have the parse tree, outputting it in another form is easy.

JSON output specifically: the output in that case will have to look like a "raw" parse tree if we want it to be a bidirectional transform, because we have to preserve the layout and raw text of comments. That's still a useful format to discuss structure changes, but a downstream-friendly format will be a one-way transform from .dat to .json.

Alternatively, we can convert to "JSON With Commas and Comments" (JWCC), aka HuJSON, aka a subset of JSON5. I believe the HuJSON Go parser already preserves enough layout information so the transformation can work both ways, and I think it would be possible to make this into a format that doesn't suck for external use as well, if we want to.

Either way, my TODOs are the same: make the parser generate a real AST not a flat list, then we can do automated edits on the current format, and also look at producing other representations.

simon-friedberger commented 2 weeks ago

JSON output specifically: the output in that case will have to look like a "raw" parse tree if we want it to be a bidirectional transform, because we have to preserve the layout and raw text of comments. That's still a useful format to discuss structure changes, but a downstream-friendly format will be a one-way transform from .dat to .json.

Medium-term, I hope we can streamline the layout to where this can be generated from a tree.