Closed jruere closed 2 years ago
Confirmed issue! This library has only ever split domain names by the ASCII .
, since at least as far back as a9694d739038b595e0934e1f1bb5f661c13c8a76. Interesting this hasn't come up.
Here is a workaround. The IDNA encoding library interpretation of the input string does handle converting the Unicode dot to ASCII. If you expect you're working with internationalized domain names, preprocess before calling tldextract.
>>> import tldextract
>>>
>>> maybe_internationalized = "angelinablog。com.de"
>>>
>>> # Python's built-in IDNA 2003
>>> tldextract.extract(maybe_internationalized.encode("idna").decode("ascii"))
ExtractResult(subdomain='angelinablog', domain='com', suffix='de')
>>>
>>> # IDNA 2008
>>> import idna
>>> tldextract.extract(idna.decode(maybe_internationalized))
ExtractResult(subdomain='angelinablog', domain='com', suffix='de')
(Very slightly related: #113)
Thank you!
Domain angelinablog。com.de is parsed as:
It should be parsed as: