john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Wildcard support for collisions #315

Closed rrr2rrr closed 7 months ago

rrr2rrr commented 7 months ago

According to https://publicsuffix.org/list/public_suffix_list.dat there are domains with wildcards.

It works correctly for *.pg but don't take into account collisions:

Wildcard: *.cn-north-1.airflow.amazonaws.com.cn Domain: a.b.c.d.cn-north-1.airflow.amazonaws.com.cn Expected: ExtractResult(subdomain='a.b', domain='c', suffix='d.cn-north-1.airflow.amazonaws.com.cn') Result: ExtractResult(subdomain='a.b.c.d.cn-north-1.airflow', domain='amazonaws', suffix='com.cn')

because com.cn also presented as suffix You should first check wildcards, then sort suffixes by length

Here is my PostgreSQL implementation https://stackoverflow.com/a/77544774/21920723

john-kurkowski commented 7 months ago

*.amazonaws.com.cn entries are all in the private section of the PSL. See this FAQ entry. When I treat public and private suffixes the same, I get the result you want.

>>> import tldextract
>>> tldextract.extract("a.b.c.d.cn-north-1.airflow.amazonaws.com.cn", include_psl_private_domains=True)
ExtractResult(subdomain='a.b', domain='c', suffix='d.cn-north-1.airflow.amazonaws.com.cn', is_private=True)
rrr2rrr commented 7 months ago

Thanks!