john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.84k stars 210 forks source link

The TLD .com.ru is handled incorrectly #270

Closed hkopp closed 2 years ago

hkopp commented 2 years ago

Hi, I have encountered a bug when handling .com.ru domains.

In [1]: import tldextract

In [2]: tldextract.extract('http://lamy.com.ru/')
Out[2]: ExtractResult(subdomain='lamy', domain='com', suffix='ru')

I would have expected the following:

Out[2]: ExtractResult(subdomain='', domain='lamy', suffix='com.ru')

.com.ru is in the list of public suffixes, so this is clearly a bug: https://publicsuffix.org/list/public_suffix_list.dat

Thanks for creating the library by the way. I first tried sed with increasingly complex regular expressions, but that quickly grew out of hand. Your library was exactly what I needed.

hkopp commented 2 years ago

And my version number:

In [5]: tldextract.__version__
Out[5]: '3.3.1'
hkopp commented 2 years ago

And troubleshooting of the cache:

(venv) $ pwd
/home/user/.cache/python-tldextract
(venv) $ ls
3.10.5.final__venv__699cb6__tldextract-3.3.1
3.10.6.final__venv__699cb6__tldextract-3.3.1
(venv) $ cat */publicsuffix.org-tlds/*.tldextract.json| jq '.'| grep ".com.ru"
    "com.ru",
    "com.ru",
john-kurkowski commented 2 years ago

Thanks for the thorough diagnostic info! That suffix is pretty far down the list, so it's in the private domains section. See the FAQ.