john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.84k stars 210 forks source link

Special character domains? #113

Closed 011121 closed 1 year ago

011121 commented 7 years ago

There are four characters handled specially by IDN according to unicode.org:

Q: Which four characters are interpreted differently?

A: Four characters can cause an IDNA2008 implementation to go to a different web page than an IDNA2003 implementation, given the same source, such as href="http://faß.de". These four characters include some that are quite common in languages such as German, Greek, Farsi, and Sinhala:

U+00DF ( ß ) LATIN SMALL LETTER SHARP S U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA U+200C ( ) ZERO WIDTH NON-JOINER U+200D ( ) ZERO WIDTH JOINER

For the purposes of discussion of differences between IDNA versions, these characters are called "deviations". http://unicode.org/faq/idn.html

My understanding is that there are domain differences in the handling of these characters, at least in some cases. For instance the sharp-s character only appears to be legal IDN for .de domains. For anyone else it should be converted to 'ss.'

Should this functionality be included in tldextract?

i.e. right now tldextract.extract('http://faß.de').domain correctly gives 'faß' but tldextract.extract('http://faß.au').domain also gives 'faß' when it probably should give 'fass'

john-kurkowski commented 7 years ago

Whew, that FAQ is interesting but a doozy! If I were to summarize, a lot of the entries are of the form, "Q: Shouldn't ABC be simple? A: No, consider exception XYZ." 😏 Is there a clear-cut mapping and algorithm for unwinding these deviations? Maybe I missed it. To include the algorithm in tldextract, it depends on how complicated the algorithm is and how many input options it has to handle. Perhaps it belongs in another library.

commutecat commented 7 years ago

@john-kurkowski Python 2.7 and above already embedded IDNA encoding into the string. (With the caveat of Python 2 string as byte-string vs python 3 string as unicode) . This including various punycode(homoglphy) conversion into the same idna encoding function.

#python3 
>>> purl = 'http://faß.de'
>>> url.encode('idna')
b'http://fass.de'

>>>url.encode('idna').decode("ascii")
url.encode('idna').decode("ascii")

>>>punyurl = "http://ρaypal.com" 
>>>punyurl.encode('idna')
b'xn--http://aypal-v3i.com'

punyurl.encode('idna').decode("ascii")
'xn--http://aypal-v3i.com'

# In python 3, it is crucial to decode idna conversion bytes back to unicode, to avoid surprises. 

here is the trouble of python2 , the is a bytes string. str.encode('idna') only recognise unicode.

>>>url = 'http://faß.de'
>>>url.encode('idna')
TypeError: normalize() argument 2 must be unicode, not str

>>>url.decode("utf-8").encode('idna')
'http://fass.de'
>>>punyurl = "http://ρaypal.com" 
>>>punyurl.decode("utf-8").encode("idna")
'xn--http://aypal-v3i.com'

I just found out that tldextract use the old import idna libraries, when diagnosing an error throw by tldextract . The probelm is caused by an old import idna . Although this issue can be weed out by normalised the host name or url in the beginning.

hostname = 'Natürlich!.com' 
tldextract.extract(hostanme.encode('idna')) 

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/virtualenv/xxx/lib/python3.5/site-packages/tldextract/tldextract.py", line 329, in extract
    return TLD_EXTRACTOR(url)
  File "/home/user/virtualenv/xxx/lib/python3.5/site-packages/tldextract/tldextract.py", line 186, in __call__
    netloc = SCHEME_RE.sub("", url) \
TypeError: cannot use a string pattern on a bytes-like object
john-kurkowski commented 7 years ago

Python 2.7 and above already embedded IDNA encoding into the string ... I just found out tldextract use the old import idna libraries, which is due old import idna

Actually the idna package claims it is the newer one:

This acts as a suitable replacement for the “encodings.idna” module that comes with the Python standard library, but only supports the old, deprecated IDNA specification (RFC 3490).

Relevant to the discussion here, the idna package eschews mapping on purpose:

As described in RFC 5895, the IDNA specification no longer normalizes input from different potential ways a user may input a domain name. This functionality, known as a “mapping”, is now considered by the specification to be a local user-interface issue distinct from IDNA conversion functionality.

floer32 commented 5 years ago

@011121 i've added the label low priority: can be solved by pre/post processing. if that's not the case, let me know and i'll remove that label.

also marked with contributions welcome -- while this is some interesting nuts-and-bolts stuff, and was definitely worth discussing ... most users do not need it, seemingly.

john-kurkowski commented 1 year ago

… most users do not need it, seemingly.

Closing stale issue.