Closed 011121 closed 1 year ago
Whew, that FAQ is interesting but a doozy! If I were to summarize, a lot of the entries are of the form, "Q: Shouldn't ABC be simple? A: No, consider exception XYZ." 😏 Is there a clear-cut mapping and algorithm for unwinding these deviations? Maybe I missed it. To include the algorithm in tldextract, it depends on how complicated the algorithm is and how many input options it has to handle. Perhaps it belongs in another library.
@john-kurkowski Python 2.7 and above already embedded IDNA encoding into the string. (With the caveat of Python 2 string as byte-string vs python 3 string as unicode) . This including various punycode(homoglphy) conversion into the same idna encoding function.
#python3
>>> purl = 'http://faß.de'
>>> url.encode('idna')
b'http://fass.de'
>>>url.encode('idna').decode("ascii")
url.encode('idna').decode("ascii")
>>>punyurl = "http://ρaypal.com"
>>>punyurl.encode('idna')
b'xn--http://aypal-v3i.com'
punyurl.encode('idna').decode("ascii")
'xn--http://aypal-v3i.com'
# In python 3, it is crucial to decode idna conversion bytes back to unicode, to avoid surprises.
here is the trouble of python2 , the
>>>url = 'http://faß.de'
>>>url.encode('idna')
TypeError: normalize() argument 2 must be unicode, not str
>>>url.decode("utf-8").encode('idna')
'http://fass.de'
>>>punyurl = "http://ρaypal.com"
>>>punyurl.decode("utf-8").encode("idna")
'xn--http://aypal-v3i.com'
I just found out that tldextract use the old import idna libraries, when diagnosing an error throw by tldextract . The probelm is caused by an old import idna
. Although this issue can be weed out by normalised the host name or url in the beginning.
hostname = 'Natürlich!.com'
tldextract.extract(hostanme.encode('idna'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/virtualenv/xxx/lib/python3.5/site-packages/tldextract/tldextract.py", line 329, in extract
return TLD_EXTRACTOR(url)
File "/home/user/virtualenv/xxx/lib/python3.5/site-packages/tldextract/tldextract.py", line 186, in __call__
netloc = SCHEME_RE.sub("", url) \
TypeError: cannot use a string pattern on a bytes-like object
Python 2.7 and above already embedded IDNA encoding into the string ... I just found out tldextract use the old import idna libraries, which is due old
import idna
Actually the idna package claims it is the newer one:
This acts as a suitable replacement for the “encodings.idna” module that comes with the Python standard library, but only supports the old, deprecated IDNA specification (RFC 3490).
Relevant to the discussion here, the idna package eschews mapping on purpose:
As described in RFC 5895, the IDNA specification no longer normalizes input from different potential ways a user may input a domain name. This functionality, known as a “mapping”, is now considered by the specification to be a local user-interface issue distinct from IDNA conversion functionality.
@011121 i've added the label low priority: can be solved by pre/post processing
. if that's not the case, let me know and i'll remove that label.
also marked with contributions welcome
-- while this is some interesting nuts-and-bolts stuff, and was definitely worth discussing ... most users do not need it, seemingly.
… most users do not need it, seemingly.
Closing stale issue.
There are four characters handled specially by IDN according to unicode.org:
My understanding is that there are domain differences in the handling of these characters, at least in some cases. For instance the sharp-s character only appears to be legal IDN for .de domains. For anyone else it should be converted to 'ss.'
Should this functionality be included in tldextract?
i.e. right now
tldextract.extract('http://faß.de').domain
correctly gives'faß'
buttldextract.extract('http://faß.au').domain
also gives'faß'
when it probably should give'fass'