Closed 755e6bb3-9fa1-4469-b0cb-a8436077196e closed 4 years ago
For a specific Cherokee string of three symbols b'\\u13e3\\u13b3\\u13a9' generating punycode representation fails.
What steps will reproduce the problem?
Execute 'ꮳꮃꭹ'.encode('idna') of even more reliable Execute '\u13e3\u13b3\u13a9'.encode('idna')
What is the expected result?
'xn--f9dt7l'
What happens instead?
'xn--tz9ata7l'
Version affected.
Tested on Python 3.8.3 Windows and Python 3.6.8 CentOS.
Other information.
I was testing if our product supports internationalized domain names. So I had written a Python script which generated DNS zone file with punycode encoded names and JavaScript file for a browser to send requests to URLs containing internationalized domain names. Strings were taken from Common Locale Data Repository. 193 various URL, one per language.
When executed in Google Chrome, Mozilla Firefox and Microsoft EDGE, domain name 'ꮳꮃꭹ.myhost.local' is converted to 'xn--f9dt7l.myhost.local', but we have 'xn--tz9ata7l.myhost.local' in DNS zone file and this is how I had found the bug. For 192 other languages I have tested everything works just fine. hese are Afrikaans, Aghem, Akan, Amharic, Arabic, Assamese, Asu, Asturian, Azerbaijani, Basaa, Belarusian, Bemba, Bena, Bulgarian, Bambara, Bangla, Tibetan, Breton, Bodo, Bosnian, Catalan, Chakma, Chechen, Cebuano, Chiga, Czech, Church Slavic, Welsh, Danish, Taita, German, Zarma, Lower Sorbian, Duala, Jola-Fonyi, Dzongkha, Embu, Ewe, Greek, English, Esperanto, Spanish, Estonian, Basque, Ewondo, Persian, Fulah, Finnish, Filipino, Faroese, French, Friulian, Western Frisian, Irish, Scottish Gaelic, Galician, Swiss German, Gujarati, Gusii, Manx, Hausa, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Interlingua, Indonesian, Sichuan Yi, Icelandic, Italian, Japanese, Ngomba, Machame, Javanese, Georgian, Kabyle, Kamba, Makonde, Kabuverdianu, Kikuyu, Kako, Kalaallisut, Kalenjin, Khmer, Kannada, Korean, Konkani, Kashmiri, Shambala, Bafia, Colognian, Kurdish, Cornish, Kyrgyz, Langi, Luxembourgish, Ganda, Lakota, Lingala, Lao, Lithuanian, Luba-Katanga, Luo, Luyia, Latvian, Maithili, Masai, Meru, Malagasy, Makhuwa-Meetto, Metaʼ, Maori, Macedonian, Malayalam, Mongolian, Manipuri, Marathi, Malay, Maltese, Mundang, Burmese, Mazanderani, Nama, North Ndebele, Low German, Nepali, Dutch, Kwasio, Norwegian Nynorsk, Nyankole, Oromo, Odia, Ossetic, Punjabi, Polish, Prussian, Pashto, Portuguese, Quechua, Romansh, Rundi, Romanian, Rombo, Russian, Kinyarwanda, Rwa, Samburu, Santali, Sangu, Sindhi, Northern Sami, Sena, Sango, Tachelhit, Sinhala, Slovak, Slovenian, Inari Sami, Shona, Somali, Albanian, Serbian, Swedish, Swahili, Tamil, Telugu, Teso, Tajik, Thai, Tigrinya, Turkish, Tatar, Uyghur, Ukrainian, Urdu, Uzbek, Vai, Volapük, Vunjo, Walser, Wolof, Xhosa, Soga, Yangben, Yiddish, Cantonese, Standard Moroccan Tamazight, Chinese, Traditional Chinese, Zulu.
Somehow specifically Cherokee code points trigger the bug.
On top of that, https://www.punycoder.com/ converts 'ꮳꮃꭹ' into 'xn--f9dt7l' and back. However 'xn--tz9ata7l' is reported as an invalid punycode.
For the record:
>>> 'ꮳꮃꭹ'.encode('punycode')
b'tz9ata7l'
>>> '\u13e3\u13b3\u13a9'.encode('punycode')
b'f9dt7l'
Also, your unicode-escaped string is an upper-cased version of the first string.
This is how I extract data from Common Locale Data Repository v37 script assumes common\main working directory
from os import walk
from xml.etree import ElementTree
en_root = ElementTree.parse('en.xml')
for (dirpath, dirnames, filenames) in walk('.'):
for filename in filenames:
if filename.endswith('.xml'):
code = filename[:-4]
xx_root = ElementTree.parse(filename)
xx_lang = xx_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']')
en_lang = en_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']')
if en_lang.text == 'Cherokee':
print(en_lang.text)
print(xx_lang.text)
print(xx_lang.text.encode("unicode_escape"))
print(xx_lang.text.encode('idna'))
print(ord(xx_lang.text[0]))
print(ord(xx_lang.text[1]))
print(ord(xx_lang.text[2]))
script outputs
Cherokee ᏣᎳᎩ b'\\u13e3\\u13b3\\u13a9' b'xn--tz9ata7l' 5091 5043 5033
If I change text to lower case
print(en_lang.text.lower())
print(xx_lang.text.lower())
print(xx_lang.text.lower().encode("unicode_escape"))
print(xx_lang.text.lower().encode('idna'))
print(ord(xx_lang.text.lower()[0]))
print(ord(xx_lang.text.lower()[1]))
print(ord(xx_lang.text.lower()[2]))
then script outputs
cherokee ꮳꮃꭹ b'\\uabb3\\uab83\\uab79' b'xn--tz9ata7l' 43955 43907 43897
I am not sure where do you get '\u13e3\u13b3\u13a9' string. '\u13e3\u13b3\u13a9'.lower().encode('unicode_escape') gives b'\\uabb3\\uab83\\uab79'
I took it from your msg370615:
of even more reliable Execute '\u13e3\u13b3\u13a9'.encode('idna')
There are two IDNA standard. Python's standard library only provides IDNA 2003 and does not support IDNA 2008.
# IDNA 2003
>>> '\u13e3\u13b3\u13a9'.encode('idna')
b'xn--tz9ata7l'
# idna package with IDNA 2008
>>> idna.encode('\u13e3\u13b3\u13a9')
b'xn--f9dt7l'
The bug report is a duplicate of bpo-17305.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['3.8', 'type-bug', '3.7', 'expert-unicode']
title = 'idna encoding fails for Cherokee symbols'
updated_at =
user = 'https://bugs.python.org/RomanAkopov'
```
bugs.python.org fields:
```python
activity =
actor = 'christian.heimes'
assignee = 'none'
closed = True
closed_date =
closer = 'christian.heimes'
components = ['Unicode']
creation =
creator = 'Roman Akopov'
dependencies = []
files = []
hgrepos = []
issue_num = 40845
keywords = []
message_count = 5.0
messages = ['370615', '370617', '370628', '370629', '370634']
nosy_count = 5.0
nosy_names = ['vstinner', 'christian.heimes', 'ezio.melotti', 'SilentGhost', 'Roman Akopov']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40845'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']
```