john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

bad parsing #237

Closed erpatrik closed 2 years ago

erpatrik commented 2 years ago

Hi, found 2 issues while parsing domains.

tldextract.extract("hokkaido.jp") ExtractResult(subdomain='', domain='', suffix='hokkaido.jp')

tldextract.extract("ketrzyn.pl") ExtractResult(subdomain='', domain='', suffix='ketrzyn.pl')

ShmuelTreiger commented 2 years ago

Having the same issue with ne.jp

Not sure if relevant, but ne.jp is actually incorrect, it should be www.ne.jp. Working with a legacy system which strips www from urls. When run on www.ne.jp it works, but that causes other bugs for me.

ShmuelTreiger commented 2 years ago

These are all suffixes on the public sources list. That's why it's like this.

https://publicsuffix.org/list/public_suffix_list.dat

brycedrennan commented 2 years ago

@erpatrik, @ShmuelTreiger is correct, the domains you're testing are in the public sufffix list so this library is correctly returning them as such.