john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Incorrect parsing with domain in cache, but correct parsing when specified as extra suffix #238

Closed Arraying closed 2 years ago

Arraying commented 2 years ago

I have a domain ending in .net.ru, let's call it example.net.ru. According to this list, which should be the one this project uses, it is present.

However, if I run this code snippet:

    netloc_extract = tldextract.extract(netloc)
    print(f"netloc '{netloc}' extracts {netloc_extract}")

It gives the following output:

netloc 'example.net.ru' extracts ExtractResult(subdomain='example', domain='net', suffix='ru')

This should output ExtractResult(subdomain='', domain='example', suffix='net.ru').

If I check my local cache, then I can clearly see net.ru being in there. For example, in this setup, I can clearly inspect the cache location and see that "net.ru" is present in the JSON:

    foo = tldextract.TLDExtract(
        cache_dir='./superdupercache/',
        fallback_to_snapshot=False
    )

However, this yields me the same incorrect result. Disabling the cache entirely (as per the README) like so also does not yield the correct result:

    foo = tldextract.TLDExtract(
        cache_dir=False,
        fallback_to_snapshot=False
    )

Very curiously, if I manually specify the TLD as shown, then it does work:

    foo = tldextract.TLDExtract(
        cache_dir=False,
        fallback_to_snapshot=False,
        extra_suffixes=["net.ru"]
    )

Needless to say, I'm quite confused about this behaviour. It seems that although the TLD is present in the list, something is stopping it from being detected. Could someone let me know if this is perhaps a fault of the library? And if not, how can I ensure this does not happen with other domains, without manually specifying every domain in extra_suffixes manually?

brycedrennan commented 2 years ago

Try this from the docs:

tldextract.extract("example.net.ru", include_psl_private_domains=True)