john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.84k stars 210 forks source link

Incorrectly parsing unicode dot #252

Closed jruere closed 2 years ago

jruere commented 2 years ago

Domain angelinablog。com.de is parsed as:

It should be parsed as:

john-kurkowski commented 2 years ago

Confirmed issue! This library has only ever split domain names by the ASCII ., since at least as far back as a9694d739038b595e0934e1f1bb5f661c13c8a76. Interesting this hasn't come up.

Here is a workaround. The IDNA encoding library interpretation of the input string does handle converting the Unicode dot to ASCII. If you expect you're working with internationalized domain names, preprocess before calling tldextract.

>>> import tldextract
>>>
>>> maybe_internationalized = "angelinablog。com.de"
>>>
>>> # Python's built-in IDNA 2003
>>> tldextract.extract(maybe_internationalized.encode("idna").decode("ascii"))
ExtractResult(subdomain='angelinablog', domain='com', suffix='de')
>>>
>>> # IDNA 2008
>>> import idna
>>> tldextract.extract(idna.decode(maybe_internationalized))
ExtractResult(subdomain='angelinablog', domain='com', suffix='de')
john-kurkowski commented 2 years ago

(Very slightly related: #113)

jruere commented 2 years ago

Thank you!