google / guava

Google core libraries for Java
Apache License 2.0
50.18k stars 10.9k forks source link

Public suffix list is used as though it were designed to be exhaustive, but it's not #1618

Open gissuebot opened 10 years ago

gissuebot commented 10 years ago

Original issue created by cpovirk@google.com on 2013-12-18 at 05:30 PM


""" The issue appears to be in the API of InternetDomainName.findPublicSuffix() - https://github.com/google/guava/blob/ab29b173055a1ff647516848b176265fc6792ba0/guava/src/com/google/common/net/InternetDomainName.java#L167

The issue appears to be that this class is disregarding Step 2 of "The Algorithm", described at http://publicsuffix.org/list/ - that is, "If no rules match, the prevailing rule is *".

In this model, any domain not on the list is assumed to be registerable at the second level. For example, "au" is not included in the PSL. This should cause "foo.au" to fail to match any rules, and thus fall into the default wildcard rule. In the default wildcard rule, the public suffix is ".au" - and CSIRO is treated as a registerable name.

This is especially important with the many new registries that ICANN is approving; a decision has not been made to automatically add them to the PSL, and so I fear this may cause issues for Java applications in validating these domains.

If the goal is to ensure a name is "valid" (that is, assigned/approved by ICANN), then IANA has a data file that is updated twice daily at http://data.iana.org/TLD/tlds-alpha-by-domain.txt that contains all IANA-assigned gTLDs. It may make sense to incorporate this data into the PSL trie to have a proper "fail open" behaviour.

...

For plausability checks, then the IANA list is a much better resource, for sure. For security checks, the PSL is the best source of data for this.

...

The point of the PSL is not to replace the IANA list but to further reduce scope of registerable labels.

There would be no benefit to the PSL's including the full IANA list, and real performance harm, since step 2 of the algorithm implicitly covers these domains. """

What would change in InternetDomainName? I would want to talk more to the original bug reporter and to others, but here are some guesses:

gissuebot commented 10 years ago

Original comment posted by kevinb@google.com on 2013-12-18 at 06:12 PM


(No comment entered for this change.)


CC: cberry@google.com

cpovirk commented 9 years ago

We got a report internally (just days after I opened this bug) that the TLD list was changing to include all the IANA Root Zone Database. That seems to be the case, or at least it seems to be close (maybe differing just by lagging a little?):

$ wget http://www.iana.org/domains/root/db ... $ wget http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 ...

$ comm -1 -3 <(sed -e 's#//.##' -e 's/.[.]//' effective_tld_names.dat\?raw\=1 | sort -u) <(egrep -o '/domains/root/db/\w+.html' db | egrep -o '\w+[.]' | tr -d . | sort) bl bq eh mf movie plus ss tech tickets um

There is also a question of whether proposed names should be accepted. I think that "proposed names" may be those at http://icannwiki.com/All_New_gTLD_Applications (that aren't WITHDRAWN?). But I need to look into this.

cpovirk commented 9 years ago

Semi-related: If we ever decide to more heavily design the public-suffix support of InternetDomainName, we should glance at the API used by https://github.com/whois-server-list/public-suffix-list

cpovirk commented 9 years ago

Here's what I've learned:

There are successively more restrictive checks that we could offer for a TLD:

muhammadismailkhan0009 commented 1 year ago

any updates on this issue? Thank you