thom4parisot / tld.js

JavaScript API to work easily with complex domain names, subdomains and well-known TLDs.
https://npmjs.com/tldjs
MIT License
460 stars 55 forks source link

tldjs.getDomain('acc1sub1-dot-moisestest.appspot.com') returns 'acc1sub1-dot-moisestest.appspot.com' #120

Open mbelchin opened 6 years ago

mbelchin commented 6 years ago

Hi,

I'm just testing this library and I'm surprised by this results:

tldjs.getDomain('acc1sub1-dot-moisestest.appspot.com') returns 'acc1sub1-dot-moisestest.appspot.com'

Shouldn't return appspot.com instead ?

tldjs.getSubdomain('acc1sub1-dot-moisestest.appspot.com') returns 'acc1sub1-dot-moisestest.appspot.com'

Shouldn't return acc1sub1-dot-moisestest instead ?

Thanks in advance.

remusao commented 6 years ago

Hi @mbelchin,

Thanks for reaching out! I get slightly different results for this domain (with latest version of the rules):

> tldjs.parse('acc1sub1-dot-moisestest.appspot.com')
{ hostname: 'acc1sub1-dot-moisestest.appspot.com',
  isValid: true,
  isIp: false,
  tldExists: true,
  publicSuffix: 'appspot.com',
  domain: 'acc1sub1-dot-moisestest.appspot.com',
  subdomain: '' }
> tldjs.getSubdomain('acc1sub1-dot-moisestest.appspot.com')
''

The result is still surprising, but it can be explained by the fact that appspot.com is a valid public suffix according to the rule-set we use. It's a long standing issue that is not trivial to fix unfortunately. It was recently reported in #117 (and others before). The current state is this:

This is a long-standing and known issue which is not trivial to fix. It stems from the fact that the public suffix list was originally designed to check under which domains, sub-domains can be registered, and cookies can be set. In turn, it can lead to surprising/un-intuitive results such as the ones you encountered.

We've thought about this situation in the past, and I can see a few solutions, none of which is perfect. But maybe it would be "good enough":

  1. One hacky fix you can use right now without any update of TLD is to detect when domain is null, and instead use the value publicSuffix as the domain. This will work for a lot of domains (I will try to investigate more how many domain would return the wrong result with this solution, but I expect not so many).
  2. There are currently two parts in the public suffix list: ICANN and PRIVATE. I suspect that most of the surprising cases come from the PRIVATE part. We could add an option in tld.js to only take ICANN domains into account (this would fix the examples you found and many others).
  3. Combine both 1. and 2. (most of the counter-examples seem to be japan domain, but we need to investigate a bit more to see if there are some non-trivial cases).

None of the solution is perfect as there are known counter-examples. If this is an option for you, I would suggest you give a try to 1. and I will try to implement 2.. Also, as far as I know, this should be a limitation for all libraries using the public suffix lists unfortunately.

I did not have time to experiment with the proposed solution since then unfortunately. But if 1. works for you, that could be a reasonable work-around until then.

mbelchin commented 6 years ago

Hi, thanks for that quick and detailed answer. I'm gonna try the 1. approach. Just one clarification, when you say

One hacky fix you can use right now without any update of TLD is to detect when domain is null, and instead use the value publicSuffix as the domain

you meant subdomain right?

becase

> tldjs.parse('acc1sub1-dot-moisestest.appspot.com')
{ hostname: 'acc1sub1-dot-moisestest.appspot.com',
  isValid: true,
  isIp: false,
  tldExists: true,
  publicSuffix: 'appspot.com',
  domain: 'acc1sub1-dot-moisestest.appspot.com',
  subdomain: '' }

parse will return empty for subdomain

remusao commented 6 years ago

@mbelchin Right, I think the "trick" works for a slightly different use-case. If you would try to parse appspot.com, then publicSuffix would be appspot.com and domain would be empty. In this case you can use publicSuffix as the domain.

On the other hand, in your case, checking if subdomain is empty is not enough information to know what the actual domain is. Which means we will have to implement something to allow retrieving only ICANN domains and not PRIVATE ones, but even that will not work 100% of the time. I will try to find time for that in the next few days.