publicsuffix / list

The Public Suffix List
https://publicsuffix.org/
Mozilla Public License 2.0
1.93k stars 1.18k forks source link

Incorrect PSL evaluation rules in the wiki regarding implicit wildcard rules #1989

Closed ko-zu closed 4 weeks ago

ko-zu commented 1 month ago

Hello,

I believe there is confusion regarding the wildcard evaluation rules in the wiki. https://github.com/publicsuffix/list/wiki/Format

In the example,

  1. com
  2. *.foo.com

...

Cookies may be set for foo.com.

This is incorrect. Cookies may NOT be set for foo.com. According to the linter, *.example.com implies example.com.

https://github.com/publicsuffix/list/blame/de747b657fb0f479667015423c12f98fd47ebf1d/linter/pslint.py#L230

The current rules in the wiki states that the domain example.com does not match *.example.com. However, some TLDs (e.g. *.bd) do not have explicit TLD declarations, which are required and checked by the linter. ~~According to the wiki rules, we can set cookies on bd. This is not the expected behavior. ~~

kobe.jp does have explicit public suffix declarations while *.kobe.jp exists, which is required and checked by the linter. According to the wiki rules, we can set cookies on kobe.jp. (Fixed to reflect simon-friedberger's comment. Thanks!)

Implementations should assume example.com is also declared if *.example.com exists.

In the DNS world, this assumption is incorrect. However, the test data includes lines like:

checkPublicSuffix('city.kobe.jp', 'city.kobe.jp');
checkPublicSuffix('www.city.kobe.jp', 'city.kobe.jp');

While the PSL only has:

jp
*.kobe.jp
!city.kobe.jp

If we follow the wiki rules, kobe.jp should be on the right.

Problems in the Wiki

The example above comes from the following evaluation rule definition:

  • A domain is said to match a rule if and only if all of the following conditions are met:
  • When the domain and rule are split into corresponding labels, that the domain contains as many or more labels than the rule.
  • Beginning with the right-most labels of both the domain and the rule, and continuing for all labels in the rule, one finds that for every pair, either they are identical, or that the label from the rule is "*".

According to this definition, the domain example.com cannot match *.example.com because the domain has fewer labels than the listed rule.

If we need to maintain this scheme, I believe we should add:

Additionally, the statement "leading and trailing dots are ignored" is also incorrect. The test data explicitly disallows leading dot.

If you have write permission to the wiki, please take a look into the above issues.

Thank you.

simon-friedberger commented 1 month ago

Thanks @ko-zu! Let me try to take this step by step...

In the example,

  1. com
  2. *.foo.com

... Cookies may be set for foo.com.

This is incorrect. Cookies may NOT be set for foo.com. According to the linter, *.example.com implies example.com.

Well, whether you can or cannot set a cookie on a public suffix is an issue for the browsers, so I'll translate "Cookies may NOT be set on ..." to "... is a public suffix."

This is a known issue: #694 Firefox and (afaik) Chrome behave the same as the linter. I'll add a note to the wiki. However, both Firefox and Chrome still let you set cookies on public suffixes.

simon-friedberger commented 1 month ago

The current rules in the wiki states that the domain example.com does not match *.example.com. However, some TLDs (e.g. *.bd) do not have explicit TLD declarations, which are required and checked by the linter.

I'm not sure what your question is. https://github.com/publicsuffix/list/wiki/Format#algorithm states that If no rules match, the prevailing rule is "*".

So specific entries e.g. for bd are not necessary. This is useful because sometimes new TLDs are created but the PSL used e.g. in a users's browser doesn't get updated. Does that answer your question? (This has also been a point of contention in the past: https://github.com/rockdaboot/libpsl/issues/48)

According to the wiki rules, we can set cookies on bd. This is not the expected behavior. Implementations should assume example.com is also declared if *.example.com exists.

In the DNS world, this assumption is incorrect. However, the test data includes lines like:

What assumption? Sorry, I'm not following here.

checkPublicSuffix('city.kobe.jp', 'city.kobe.jp');
checkPublicSuffix('www.city.kobe.jp', 'city.kobe.jp');

While the PSL only has:

jp
*.kobe.jp
!city.kobe.jp

If we follow the wiki rules, kobe.jp should be on the right.

Indeed, that looks like a bug in the test cases. Since !city.kobe.jp is on the list it can hardly be returned as a public suffix.

simon-friedberger commented 1 month ago

Problems in the Wiki

The example above comes from the following evaluation rule definition:

  • A domain is said to match a rule if and only if all of the following conditions are met:
  • When the domain and rule are split into corresponding labels, that the domain contains as many or more labels than the rule.
  • Beginning with the right-most labels of both the domain and the rule, and continuing for all labels in the rule, one finds that for every pair, either they are identical, or that the label from the rule is "*".

According to this definition, the domain example.com cannot match *.example.com because the domain has fewer labels than the listed rule.

And indeed, by the algorithm given in the wiki - which is, as stated above, contentious and not the algorithm in browsers - example.com should not match *.example.com. I hope that the warning I added is enough to point to the issue for now. When somebody is using the PSL, they probably need to apply some judgement on how to interpret it.

Additionally, the statement "leading and trailing dots are ignored" is also incorrect. The test data explicitly disallows leading dot.

I don't have a particular opinion here. Is there any case when such inputs might actually occur?

ko-zu commented 1 month ago

I'm not sure what your question is. https://github.com/publicsuffix/list/wiki/Format#algorithm states that If no rules match, the prevailing rule is "*".

I understand. *.bd should not be used as an example here. Sorry for the confusion. The point was that coexisting *.kobe.jp and kobe.jp is denied by the linter while kobe.jp should be a public suffix.

I think the wiki's rule is flawed because:

1. If we strictly follow the wiki, kobe.jp must NOT be a public suffix. However, according to JPRS site, kobe.jp is reserved since 2005 and not registrable, so it should be considered a public suffix. The official declarations can be found in 2.1 都道府県名および政令指定都市名, and the table ■付録2. 政令指定都市ラベル https://jprs.jp/doc/rule/prefecturejp-reserved.html

The current wiki's rule requires kobe.jp must be explicitly declared, but that conflicts with linter's rule and existing tests, and JPRS's explanation. Only the wiki states otherwise.

2.

Indeed, that looks like a bug in the test cases. Since !city.kobe.jp is on the list it can hardly be returned as a public suffix.

I think this is a bug in the wiki. #1998 should not alter the test case. The existing test case matches linter's rule. The test case is correct under the rule that *.kobe.jp implies kobe.jp, and also follows JPRS's explanation. Only the wiki states it is incorrect.

  1. Additionally, the statement "leading and trailing dots are ignored" is also incorrect. The test data explicitly disallows leading dot.

    I don't have a particular opinion here. Is there any case when such inputs might actually occur?

I believe this is a bug in the wiki. A single dot on the left, which means an empty label, is considered not valid, as the test case defined.

simon-friedberger commented 1 month ago

I don't understand your logic here. You're saying something.kobe.jp is not registrable but kobe.jp should still be a public suffix. Why?

ko-zu commented 1 month ago

I don't understand your logic here. You're saying something.kobe.jp is not registrable but kobe.jp should still be a public suffix. Why?

jp, kobe.jp, something.kobe.jp are public suffix, which cannot be registered by anyone. This is derived from the list, that declares jp and kobe.jp implied by *.kobe.jp, and *.kobe.jp not excluded by !city.kobe.jp respectively.

example.something.kobe.jp is the first non-public suffix domain in the hierarchy, which can be registered by an appropriate organization.

The wiki rule (incorrectly, I think) states that only jp and something.kobe.jp are public suffix, kobe.jp is registrable.

simon-friedberger commented 1 month ago

I agree with you that the algorithm in the wiki says kobe.jp is not a public suffix while a lot of implementations say kobe.jp is a public suffix. I am asking why you are saying

However, according to JPRS site, kobe.jp is reserved since 2005 and not registrable, so it should be considered a public suffix.

While the first sentence on publicsuffix.org is

A "public suffix" is one under which Internet users can (or historically could) directly register names.

Why do you think it should be defined as a public suffix? You're basically saying you want the second algorithm and I would like to know why as input for #694.

ko-zu commented 1 month ago

If you mean that any public suffix must allow its direct child domains to be registerable, I don't think so. Could you clarify your point if this is not what you meant?

The bd TLD would be an example in this case. The children of bd (e.g., example.bd) are not allowed to be registered as declared by *.bd, but grandchildren (e.g., example2.example.bd) are allowed. However, the bd TLD is still considered a public suffix.

A "public suffix" is one under which Internet users can (or historically could) directly register names.

I think this sentence describes a possible situation, not an exhaustive definition. If one strictly follows this sentence as the only definition, then bd must not be a public suffix, and the TLD rule (implied * if absent) must be removed. That is against expectation.

I believe a domain that is reserved to allocate some (or all) of its descendants (including direct children) for registrants is a sufficient condition, but not an exhaustive definition, for being a public suffix. It is not limited to this one criterion. The private section of a public suffix might have a different form of definition. Historical TLD use might have another exceptional rule.

The reason I posted this issue was to resolve the conflicting rules between the wiki and the test case/linters. The self-contradictory definitions in this repository should be resolved either way. I believe the test case and linter are correct, supported by implementations and intended use cases.

For the old issue, It seems some comments were concerned about existing implementations that do not follow the test case. I do not think this could be a reason to leave the conflicting rules. I believe it would be better to have one self-consistent rule in this repository and put a notice about the possible differences between implementations instead.

If someone needs to follow a definition from a specific revision of the wiki, they can choose such an implementation regardless of what the current rule is.

simon-friedberger commented 4 weeks ago

I've added a warning with a link to #694 to the description in the wiki. I think the rest of this is becoming a discussion that belongs in #694. Please add your arguments there!