kjd / idna

Internationalized Domain Names for Python (IDNA 2008 and UTS #46)
BSD 3-Clause "New" or "Revised" License
249 stars 91 forks source link

Codepoint U+2603 not allowed #136

Closed Gallaecio closed 1 year ago

Gallaecio commented 1 year ago
>>> import idna
>>> idna.encode("☃")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/adrian/temporal/venv/lib/python3.10/site-packages/idna/core.py", line 360, in encode
    s = alabel(label)
  File "/home/adrian/temporal/venv/lib/python3.10/site-packages/idna/core.py", line 269, in alabel
    check_label(label)
  File "/home/adrian/temporal/venv/lib/python3.10/site-packages/idna/core.py", line 250, in check_label
    raise InvalidCodepoint('Codepoint {} at position {} of {} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2603 at position 1 of '☃' not allowed
>>> 

However, range 2600-2613 is marked as valid in the IDNA mapping table, at least in versions 5.2.0-15.0.0.

kjd commented 1 year ago

It is an emoji, it is not PVALID.

kjd commented 1 year ago

To elaborate on this, from the README:

Emoji. It is an occasional request to support emoji domains in this library. Encoding of symbols like emoji is expressly prohibited by the technical standard IDNA 2008 and emoji domains are broadly phased out across the domain industry due to associated security risks. For now, applications that wish need to support these non-compliant labels may wish to consider trying the encode/decode operation in this library first, and then falling back to using encodings.idna. See https://github.com/kjd/idna/issues/18 for more discussion.

kjd commented 1 year ago

Following up further (apologies), your other recently opened issue directed me to this text from UTS46 that I think is relevant:

Note that this preprocessing allows some characters that are invalid according to IDNA2008. However, the IDNA2008 processing will catch those characters. For example, a Unicode string containing a character listed as DISALLOWED in IDNA2008, such as U+2665 (♥) BLACK HEART SUIT, will pass the preprocessing step without an error, but subsequent application of the IDNA2008 processing will fail with an error, indicating that the string is not a valid IDN according to IDNA2008.

While this applies to a heart, it similarly applies to all emojis. The key thing to note in your supplied link to UTS46 mapping tables is the column that reads NV8 — this means that range is Not Valid in IDNA2008

Gallaecio commented 1 year ago

Thanks. I will use the workaround for now.

j-bernard commented 1 year ago

FYI, here is an explanation of why emojis are prohibited: https://www.icann.org/en/system/files/files/idn-emojis-domain-names-13feb19-en.pdf. This may not help with your issue but it is useful to understand why those choices have been made.