Inconsistent leading dot tests for "example.com"

bkirz commented 8 years ago

Hi! My team is building a publicsuffix library for Elixir and has run into an issue with the leading dot tests in the test file being inconsistent with the spec. The spec says "A domain or rule can be split into a list of labels using the separator '.' (dot). The separator is not part of any of the labels. Empty labels are not permitted, meaning that leading and trailing dots are ignored." This seems inconsistent with the following tests, which suggest that a domain with a leading dot should not match any rules:

checkPublicSuffix('.example.com', null);
checkPublicSuffix('.example.example', null);

and

checkPublicSuffix('example.example', 'example.example');

Is there an inconsistency here, or are we understanding the spec and/or tests incorrectly?

rockdaboot commented 8 years ago

The tests seem to be ok, but the spec is not really clear:

"Empty labels are not permitted" -> A leading or trailing dot implies an empty label, thus the check should return NULL.

"meaning that leading and trailing dots are ignored", well... ignored if you check for the string being a public suffix. But 'checkPublicSuffix()' does not do that. The right argument is the 'shortest private suffix' that can be constructed from the left argument. And in that the test should fail if there is a leading or trailing dot.

What we need is a test file with domains and a boolean value that says if the domain is a public suffix or not. IMO, there would be much less confusion.

BTW: AFAIR, there is tests.txt obsoleting test_psl.txt (just easier to parse).

weppos commented 8 years ago

@benkirzhner I think @rockdaboot essentially answered the question. Notice that in the first two lines, the second argument is null which essentially means "the input is invalid or not processable".

How you should specifically handle this, it's an implementation detail that is not enforced by the PSL guidelines. In my Ruby lib I perform some pre-validations and return an error.

PS. As @rockdaboot mentioned, I encourage you to use the new tests.txt file. I'll place a note on the old file, it will eventually be removed at some point.

The tests are currently the same, but the tests.txt file is a more language and implementation independent.

PSS. By coincidence (and I'm kind of shocked) I started working on an Elixir library a few days ago (I already developed a Ruby lib and Go lib. I'd be happy to contribute if you want, I definitely don't want to double the effort if you are at more advanced stage.

bkirz commented 8 years ago

@rockdaboot:

The tests seem to be ok, but the spec is not really clear:

"meaning that leading and trailing dots are ignored", well... ignored if you check for the string being a public suffix. But 'checkPublicSuffix()' does not do that. The right argument is the 'shortest private suffix' that can be constructed from the left argument. And in that the test should fail if there is a leading or trailing dot.

Since the tested behavior appears to be what's supported by existing libraries, we'll change our implementation to conform. Still, the spec's formal algorithm doesn't seem unclear or ambiguous:

"Here is an algorithm for determining the Public Suffix of a domain."
"A domain or rule can be split into a list of labels using the separator "." (dot). The separator is not part of any of the labels. Empty labels are not permitted, meaning that leading and trailing dots are ignored."

"...leading and trailing dots are ignored" sounds like it means that .foo.bar.com should be treated the same as foo.bar.com. I don't know how to read it any other way. Can the language in the spec be changed to improve the clarity?

What we need is a test file with domains and a boolean value that says if the domain is a public suffix or not. IMO, there would be much less confusion.

This sounds like a great idea in addition to the existing tests file. The business reason we have for building the Elixir library is so we can group URLs by shortest private suffix, and having explicit test cases for that behavior is useful.

@weppos:

PS. As @rockdaboot mentioned, I encourage you to use the new tests.txt file. I'll place a note on the old file, it will eventually be removed at some point.

Thanks for the link; we've already switched to the new one. FYI, the website currently links to the old test data file. Are there plans to update the website to link to the new file?

PSS. By coincidence (and I'm kind of shocked) I started working on an Elixir library a few days ago (I already developed a Ruby lib and Go lib. I'd be happy to contribute if you want, I definitely don't want to double the effort if you are at more advanced stage.

Our project is currently a private repo as we build out the initial implementation and as we going through the process of making it public. Once we get the thumbs up to open source, we'd be happy to coordinate and add you as a contributor.

weppos commented 8 years ago

"...leading and trailing dots are ignored" sounds like it means that .foo.bar.com should be treated the same as foo.bar.com. I don't know how to read it any other way. Can the language in the spec be changed to improve the clarity?

We have to distinguish between a rule defined as .foo.bar.com and an input passed as .foo.bar.com.

A rule cannot contain trailing dots. We have a test suite that ensures we don't incorporate such kind of rules by mistake. An input is, of course, dependent by the application itself. The constraint/suggestion is that you should pass an input without traling and leading dots.

I will be happy to try to reformulate the docs.

What we need is a test file with domains and a boolean value that says if the domain is a public suffix or not. IMO, there would be much less confusion.

This sounds like a great idea in addition to the existing tests file. The business reason we have for building the Elixir library is so we can group URLs by shortest private suffix, and having explicit test cases for that behavior is useful.

@rockdaboot @benkirzhner can you elaborate? For the purpose of the algorithm, there is no difference between a private domain or a registry suffix. The process is the same.

The semantic is currently assigned by the position in the list.

FYI, the website currently links to the old test data file. Are there plans to update the website to link to the new file?

Thanks, I'll update it.

FYI, the website currently links to the old test data file. Are there plans to update the website to link to the new file?

👍

rockdaboot commented 8 years ago

@benkirzhner @weppos

What we need is a test file with domains and a boolean value that says if the domain is a public suffix or not. IMO, there would be much less confusion.

This sounds like a great idea in addition to the existing tests file. The business reason we have for building the Elixir library is so we can group URLs by shortest private suffix, and having explicit test cases for that behavior is useful.

libpsl uses [1] a hand-selected list of tests and [2] auto-generated tests from the PSL itself.

While we could put the test data from [1] into a new file + plus your suggestions, [2] maybe should stay more of an algorithm !? I also could provide a script to auto-generate a file with all those tests from [2]. WDYT ?

[1] test-is-public.c [2] test-is-public-all.c

P.S.: the algorithm of [2] in short (not all yet implemented):

take each rule frompublic_suffix_list.dat
if rule is an exception:
- domain is NOT be a PS
- random replacing the left label IS a PS
- domain without left label is a PS (there should be a wildcard matching rule) !?
if rule is a wildcard:
- domain is a PS
- random left label (instead of star) IS a PS
if rule is regular:
- domain is a PS And all that with respect to the section (ICANN/PRIVATE)

WDYT ?

weppos commented 8 years ago

@rockdaboot can you provide a small example (a few lines) of how the test file would look like?

rockdaboot commented 8 years ago

I thought of a very simple format.

Comments have '#' as first character.
Empty lines allowed.
Everything else should be
domain is_psl, with is_psl = 0|1

We could think about leading/trailing dots for domain, I included one example with a dot. Empty domain, NULL domain, etc. should IMO be in the responsibility of the implementor's test suite.

# this is a comment
www.example.com 0
com.ar 1
.com.ar 1

# exception from *.ck
www.ck 0 

# unknown TLD
adfhoweirh 1

publicsuffix / list

Inconsistent leading dot tests for "example.com" #208