python / cpython

The Python programming language
https://www.python.org
Other
62.85k stars 30.1k forks source link

unicodedata.name() doesn't have names for control characters #71683

Open f996885c-52e8-4df9-9d2c-007fec093be8 opened 8 years ago

f996885c-52e8-4df9-9d2c-007fec093be8 commented 8 years ago
BPO 27496
Nosy @ezio-melotti, @bitdancer, @eryksun

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', '3.8', '3.9', '3.10', 'library', 'expert-unicode'] title = "unicodedata.name() doesn't have names for control characters" updated_at = user = 'https://bugs.python.org/zwol' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)', 'Unicode'] creation = creator = 'zwol' dependencies = [] files = [] hgrepos = [] issue_num = 27496 keywords = [] message_count = 4.0 messages = ['270242', '270245', '270247', '270254'] nosy_count = 4.0 nosy_names = ['ezio.melotti', 'r.david.murray', 'eryksun', 'zwol'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue27496' versions = ['Python 3.8', 'Python 3.9', 'Python 3.10'] ```

f996885c-52e8-4df9-9d2c-007fec093be8 commented 8 years ago

unicodedata.name() doesn't have name information for the C0 and C1 control characters. To see this, run

pprint.pprint(["U+{:04X} {}".format(n, unicodedata.name(chr(n), "\<missing>")) for n in range(256)])

and you will observe \<missing> printed for U+0000 through U+001F and U+007F through U+009F. These characters do have official Unicode names and they should be known to name().

I may see if I can come up with a patch for this one, in my copious free time.

bitdancer commented 8 years ago

That information is programatically generated from data files obtained from the unicode project, as far as I know.

eryksun commented 8 years ago

Character names are in field 1 of UnicodeData.txt 1. For controls the name is just "\<control>". In Tools/unicode/makunicodedata.py, the makeunicodename function skips names that start with "\<". Instead of skipping the character, it could fall back on the Unicode 1.0 name (field 10), if it's defined. For controls, this is the ISO 6429 name:

(10) Old name as published in Unicode 1.0 or ISO 6429 names 
for control functions. This field is empty unless it is 
significantly different from the current name for the 
character. No longer used in code chart production. See 
Name_Alias. 

The names of control characters are also in NameAliases.txt, which gets processed as the unicode.aliases list of (name, char) tuples.

f996885c-52e8-4df9-9d2c-007fec093be8 commented 8 years ago

It looks to me as if NameAliases.txt is the better reference for the C0 and C1 controls. It matches the UnicodeData.txt field 10 names for most entries where the field 1 name is "\<control>", but it has names for U+0080, U+0081, U+0084, and U+0099, which have no field 10 name. The only catch is that NameAliases may have *several* names for the same character, with the same category tag, e.g.

0009;CHARACTER TABULATION;control 0009;HORIZONTAL TABULATION;control

It probably makes sense to consistently use the first listed.