publicsuffix / list

The Public Suffix List
https://publicsuffix.org/
Mozilla Public License 2.0
1.97k stars 1.2k forks source link

Add an alphabetic check for submissions #1953

Closed sjledoux closed 2 months ago

sjledoux commented 4 months ago

This PR attempts to add a check that will assess whether the "business name" of entries in the private domain are in alphabetical order, and that the domains listed under each entry are in the correct order at time of submission.

The check contained here attempts to use a regular expression to identify entries into the private domains list. However, this is not perfect - it seems like there is some inconsistency with the formatting of entries, making it difficult to pick a perfect way of identifying entries in the current list. The current expression identifies the vast majority of entries from what I can tell, but happy to use an alternative method if one exists.

Additionally, while creating this check it became apparent that there are currently many entries in the list that are not sorted correctly according to the existing guidelines. I'm happy to re-sort the list accordingly in this PR or in a preceding PR before this is merged.

anthonyryan1 commented 4 months ago

Not a part of this project, but I think it would be valuable to add this to make test as well, which is part of the submission instructions. It'll allow people to evaluate order problems before submitting to GitHub.

It's beneficial to ensure all tests are run as part of make test in my opinion, or people may mistakenly think they've completed all the checks without running this.

simon-friedberger commented 4 months ago

Domains sorted by the ruby script linked from the docs:

cdn77-ssl.net r.cdn77.net ssl.origin.cdn77-secure.org c.cdn77.org rsc.cdn77.org

Which makes this script complain:

The submission for CDN77.com contains domains in an improper order, which are: ('cdn77-ssl.net', 'r.cdn77.net') ('ssl.origin.cdn77-secure.org', 'c.cdn77.org')

I think the ruby script is wrong. It sorts cdn77-ssl before cdn77 because - < . but that is just an artifact of it concatenating the strings.

You get the same behavior with this Python code:

#!/usr/bin/env python3

import sys
lines = sys.stdin.readlines()
lines = list(map(lambda line: line.strip(), lines))
sorted_lines = sorted(lines, key=lambda x: ".".join(list(reversed(x.split(".")))))
for line in sorted_lines:
    print(line)

while what we want is

sorted_lines = sorted(lines, key=lambda x: list(reversed(x.split("."))))
simon-friedberger commented 3 months ago

@sjledoux Relaying @weppos: Can you please add tests?

simon-friedberger commented 2 months ago

@yahesh Would you maybe be interested in writing some tests for this?

weppos commented 2 months ago

@sjledoux Relaying @weppos: Can you please add tests?

Python is not my primary area of expertise, hence I'll need to first figure out how what is the standard way to unit tests in this language. That will take some extra time.

weppos commented 2 months ago

@simon-friedberger given https://github.com/publicsuffix/list/pull/1987, should we migrate this code to the linter and have a single toolkit?

simon-friedberger commented 2 months ago

@simon-friedberger given #1987, should we migrate this code to the linter and have a single toolkit?

Yes, definitely. Since @danderson already said he will implement this check I will close this.

danderson commented 2 months ago

Confirming, I have it partially implemented, should have a PR ready in a day or two once I've relearned unicode collations properly and made it give good diagnostics.