john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Incorrect exctraction of domain #278

Closed aimtsou closed 1 year ago

aimtsou commented 1 year ago

Good morning,

Description:

I use tldextract but today I have found a bug while extracting a url. Samples provided below

Version Tested:

Successfully installed requests-file-1.5.1 tldextract-3.4.0

Samples:

tldextract.extract('http://vic.gov.au/') tldextract.extract('http://www.vic.gov.au/')

Execution:

ExtractResult(subdomain='', domain='', suffix='vic.gov.au') ExtractResult(subdomain='', domain='www', suffix='vic.gov.au')

Although vic should be the domain in both cases. As shown in publicsuffixlist gov.au is a valid 2LD.

john-kurkowski commented 1 year ago

As shown in publicsuffixlist gov.au is a valid 2LD.

You're looking at this line, referencing gov.au, declaring it a public suffix, right? Look a few lines down in that same hunk. vic.gov.au is also a public suffix.

aimtsou commented 1 year ago

Hi @john-kurkowski,

yes you are right, i did not notice that. Although it is kind of confusing now because all 3LD could be possible domain names also. IE: catholic.edu.au is a 3LD although it can be a domain too as seen with vic.gov.au besides there is a rule I do not know.

aimtsou commented 1 year ago

Good morning @john-kurkowski,

In the publicsuffix list: krakow.pl

In tldextract it is not extracted as 2LD but as domain plus public suffix. is that correct? Registered Domain: krakow.pl | Domain: krakow | FQDN: www.cm-uj.krakow.pl | Suffix: pl

If yes why?

PS: The same happens with other suffixes in the list ie: ras.ru, url.tw

elliotwutingfeng commented 1 year ago

Hi @aimtsou,

The suffixes krakow.pl, ras.ru, and url.tw appear after the line // ===BEGIN PRIVATE DOMAINS=== and are considered private domains, which are excluded from extraction by default.

Hence, they are treated differently from vic.gov.au, which appears before // ===BEGIN PRIVATE DOMAINS===.

To include private domain extraction, refer to https://github.com/john-kurkowski/tldextract#public-vs-private-domains.

john-kurkowski commented 1 year ago

Although it is kind of confusing now because all 3LD could be possible domain names also. IE: catholic.edu.au is a 3LD although it can be a domain too as seen with vic.gov.au besides there is a rule I do not know.

I can see why that would be confusing, but that is the point of the Public Suffix List (and this library wrapping it), to know the rules/inventory of possible public suffixes, so you don't have to. I think this issue is the library working as intended.