Incorrectly parsing unicode dot

jruere commented 2 years ago

Domain angelinablog。com.de is parsed as:

Domain: angelinablog。com
Suffix: de

It should be parsed as:

angelinablog
com
de

john-kurkowski commented 2 years ago

Confirmed issue! This library has only ever split domain names by the ASCII ., since at least as far back as a9694d739038b595e0934e1f1bb5f661c13c8a76. Interesting this hasn't come up.

Here is a workaround. The IDNA encoding library interpretation of the input string does handle converting the Unicode dot to ASCII. If you expect you're working with internationalized domain names, preprocess before calling tldextract.

>>> import tldextract
>>>
>>> maybe_internationalized = "angelinablog。com.de"
>>>
>>> # Python's built-in IDNA 2003
>>> tldextract.extract(maybe_internationalized.encode("idna").decode("ascii"))
ExtractResult(subdomain='angelinablog', domain='com', suffix='de')
>>>
>>> # IDNA 2008
>>> import idna
>>> tldextract.extract(idna.decode(maybe_internationalized))
ExtractResult(subdomain='angelinablog', domain='com', suffix='de')

john-kurkowski commented 2 years ago

(Very slightly related: #113)

jruere commented 2 years ago

Thank you!

john-kurkowski / tldextract

Incorrectly parsing unicode dot #252