thom4parisot / tld.js

JavaScript API to work easily with complex domain names, subdomains and well-known TLDs.
https://npmjs.com/tldjs
MIT License
460 stars 55 forks source link

a_b_c.domain.com — Neither domain, nor publicSuffix? (but valid) #73

Open ikari-pl opened 8 years ago

ikari-pl commented 8 years ago

The URL http://wsc4_1.webspectator.com/ is returning null for both getDomain and getPublicSuffix. I can't even find webspectator.com on public suffix list, so I assume the corect result would be webspectator.com for domain and com for public suffix.

Demo:

var tld = require('tldjs');
tld.getDomain('http://wsc4_1.webspectator.com/'); // null
tld.getDomain('wsc4_1.webspectator.com'); // null
tld.getPublicSuffix('http://wsc4_1.webspectator.com/'); // null
tld.isValid('http://wsc4_1.webspectator.com/'); // true

but:

> tld.getDomain('wsc41.webspectator.com')
'webspectator.com'

So it seems it's all about the _ character. See:

> tld.getDomain('a_b.google.com')
null
> tld.getDomain('a-b.google.com')
'google.com'
ZLightning commented 8 years ago

Technically host names containing an underscore are not RFC compliant (only A-Z, a-z, 0-9, -, and . are allowed), however a newer RFC notes that a DNS server can be used to serve arbitrary data, and no DNS server should refuse to load a zone that contains invalid characters in host names.

thom4parisot commented 8 years ago

Yes indeed it is tight to the character _.

@ZLightning do you have a link towards the new RFC change?

A possibility could be to have a strict mode or not (I guess, disabled by default) in order to properly extract domains and such. For cookie creation, we might want to stick to the RFC compliant mode but that's something to discuss later on.

What do you think folks?

ZLightning commented 8 years ago

RFC2181 is only a proposed standard, but I have confirmed subdomains with an _ in them still resolve. I think a strict and sloppy mode would be a great feature. The default being strict is a good idea for backwards compatibility.

nebulade commented 6 years ago

Is there any update on that, as I also just hit that unfortunately.

LesBarstow commented 6 years ago

Note if anyone's still following this: HOSTNAMES cannot contain underscores, but other DNS entries can. e.g. _spf.google.com is a valid DNS name.

$ dig +short TXT _spf.google.com "v=spf1 include:_netblocks.google.com include:_netblocks2.google.com include:_netblocks3.google.com ~all"

The DNS itself places only one restriction on the particular labels that can be used to identify resource records. That one restriction relates to the length of the label and the full name. [...] Implementations of the DNS protocols must not place any restrictions on the labels that can be used. In particular, DNS servers must not refuse to serve a zone because it contains labels that might not be acceptable to some DNS client programs.

AFAIK, no registrar allows you to register a domain under a TLD with an underscore, but technically that too is allowed.

thom4parisot commented 6 years ago

@LesBarstow I find your comment valuable but I did not have in mind the context of hostnames in regard of DNS entries.

There is a proposal in issue #122 to be either strict or lenient on hostnames with underscores.

Do you think it will address what you mention?

LesBarstow commented 6 years ago

My personal opinion: the only calls that should care about character restrictions (aside from length) are isValidHostname() and the isValid property returned by parse(). We use both tldExists() and getDomain(), and those shouldn't care, ever.

For isValidHostname() and parse().isValid: FWIW, the defaults in PHP filtering and Perl Net regex patterns are both lenient, with options for strict. This matches the DNS RFC itself - no restrictions except for proper hostnames, which are limited by RFCs 952 and 1123.

Just my two cents.

LesBarstow commented 6 years ago

Alternately, the code could care about the validity of the publicSuffix in a strict form while the rest of the domain name would be lenient. (No registrar registers domains with an underscore as they can't be used for hostnames at all...) This is more annoying, though, because if someone does want to be lenient on the publicSuffix, now you have to have two flag options: reallyStrict, default, and reallyLenient.

remusao commented 6 years ago

Hi @LesBarstow and thanks for the great feedback! It's really interesting to get another perspective. I would like to add the following, which is just my opinion on the matter. Currently isValid is used for two different purposes internally:

  1. It's used to quickly check if the input to any of the function is already a valid hostname, in which case we can skip the expensive parsing step. In this case, we could probably use the lenient version of isValid.
  2. It is used to indicate if the input is a valid url/hostname through the two functions: parse, isValid, exposed as part of the public API.

So what we could do perhaps is to use the lenient mode for 1. (as an internal optimization). And for 2. allow an extra parameter to provide options about the behavior isValid.

Last but not least, we had similar discussions in the past regarding hostname parsing (which is hard and different libraries have different behaviors). In the end, we made the opinionated choice of using a specific module but gave the flexibility for a user of the library to provide their own parsing logic. In a way, tldjs is not about validating urls/hostnames. So maybe it is ok to pick one option (let's say we always validate hostnames in a lenient way), and let users who need it use some more complex ways to validate the hostnames depending on their usecase.

As was pointed out, tldjs could only care about validating the public suffix part, since it's what the library is about.

We can of course recommend/suggest other libraries which can be used along-side tld.js to do this validation.

7c commented 6 years ago

Hi, i am using parse() function with real world urls from squid logs to determine domainnames. I understand that this repo is all about publicsuffix but look at this real-world-example:

console.log(parse('http://spons_700.spns.nrb-apps.com/ajax/footpanel_process.php'));
{ hostname: 'spons_700.spns.nrb-apps.com',
  isValid: false,
  isIp: false,
  tldExists: false,
  publicSuffix: null,
  domain: null,
  subdomain: null }

console.log(parse('http://spons700.spns.nrb-apps.com/ajax/footpanel_process.php'));
{ hostname: 'spons700.spns.nrb-apps.com',
  isValid: true,
  isIp: false,
  tldExists: true,
  publicSuffix: 'com',
  domain: 'nrb-apps.com',
  subdomain: 'spons700.spns' }

many bigger providers do have _ in their hostnames and if the purphose of parse() is to determine publicSuffix then this function fails with real-world urls

remusao commented 5 years ago

Hi @taskinosman, thank you for your input. I proposed a solution a few weeks ago in the form of an option to enable a "lenient mode" for hostname validation in the following PR: #122 but unfortunately the PR was not merged/reviewed yet. In the meanwhile I forked and published tldts which is based on tld.js (but re-written in Typescript + a few other modifications) and provides a different set of default; among which the more permissive hostname validation is enabled by default. Maybe this would solve your problem? Don't hesitate to give me any feedback on it.

7c commented 5 years ago

Thanks, sorry i should have seen #122 . I have commented that one