john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.83k stars 210 forks source link

Semantics of `registered_domain` property for private domains #138

Open tuler opened 6 years ago

tuler commented 6 years ago

Suppose the following url: tuler.github.io github.io is a private domain in the PSL.

When parsed with include_psl_private_domains=True we get subdomain='', domain=tuler, suffix=github.io.

The registered_domain property just joins domain and suffix, giving me tuler.github.io, but IMHO it still should be github.io, as this is the domain registered with the registrar, and can be found in a whois query.

One problem to implement this is that when a URL is parsed, we can't know if the parsed domain is a private domain or a ICANN domain, because this is not kept internally when the PSL is read.

Any thoughts?

john-kurkowski commented 6 years ago

(Note to self, if we need to track public vs. private at runtime, #66 is a requirement.)

john-kurkowski commented 6 years ago

Yeah, I bet most will associate it with registrar registration, as you have.

In my mind, tldextract has been consistent, working as designed, via a more abstract interpretation of "registered." Excluding private domains, GitHub registered github.io with a registrar, who controlled the domain. Including private domains, GitHub user tuler "registered" tuler.github.io with GitHub, who controlled the domain.

I have no strong evidence if my interpretation is broadly useful. It was for a very specific case, when I originally wrote this lib. Or maybe both interpretations are useful.

tuler commented 6 years ago

I see your point.

Nonetheless, keeping runtime information regarding each domain from the PSL can be useful to handle this appropriately by the application. Something like a is_private method, or a is_private flag added to the ExtractResult.

john-kurkowski commented 6 years ago

Yes, at the very least we should do #66 and expose is_private.

I'd then consider a new registered_domain-like that was constant in the face of not/private. Just needs a new name.

Renaming today's registered_domain is also a possibility, but then we're burdened with backwards compat and legacy association with today's wording.

john-kurkowski commented 4 years ago

The PR for #66 currently tracks the source of an extraction, whether the official public suffix list, the private domains in the public suffix list, or user-provided extra suffixes. We haven't figured out how to expose that yet. It's tricky, since it's a namedtuple.