john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Suffix detection broken for private `uk.com` suffix in version 3.4.3 #288

Closed kevinmarsh closed 1 year ago

kevinmarsh commented 1 year ago

I think the trie suffix detection in #285 in version 3.4.3 might have broken looking up uk.com private suffix (which is included in the bundled snapshot) https://github.com/john-kurkowski/tldextract/blob/6f45fed6c56f377e8a9a77ce43c50712281940d8/tldextract/.tld_set_snapshot#L10570

Comparing 3.4.2:

>>> import tldextract
>>> tldextract.__version__
'3.4.2'
>>> extractor = tldextract.TLDExtract(include_psl_private_domains=True)
>>> extractor("foo.uk.com")
ExtractResult(subdomain='', domain='foo', suffix='uk.com')

to 3.4.3:

>>> import tldextract
>>> tldextract.__version__
'3.4.3'
>>> extractor = tldextract.TLDExtract(include_psl_private_domains=True)
>>> extractor("foo.uk.com")
ExtractResult(subdomain='foo', domain='uk', suffix='com')

you can see that the uk.com suffix is no longer recognized but instead thinks uk is the domain.

Although weirdly just using the tldextract.extract wrapper function in both versions give the exact same (correct) results

>>> import tldextract
>>> tldextract.extract("foo.uk.com", include_psl_private_domains=True)
ExtractResult(subdomain='', domain='foo', suffix='uk.com')
john-kurkowski commented 1 year ago

I'm looking into this. /cc @elliotwutingfeng

john-kurkowski commented 1 year ago

Fixed in 3.4.4. Thanks for the detailed report! That really eased tracking down the bug.