Open f996885c-52e8-4df9-9d2c-007fec093be8 opened 8 years ago
unicodedata.name() doesn't have name information for the C0 and C1 control characters. To see this, run
pprint.pprint(["U+{:04X} {}".format(n, unicodedata.name(chr(n), "\<missing>")) for n in range(256)])
and you will observe \<missing> printed for U+0000 through U+001F and U+007F through U+009F. These characters do have official Unicode names and they should be known to name().
I may see if I can come up with a patch for this one, in my copious free time.
That information is programatically generated from data files obtained from the unicode project, as far as I know.
Character names are in field 1 of UnicodeData.txt 1. For controls the name is just "\<control>". In Tools/unicode/makunicodedata.py, the makeunicodename function skips names that start with "\<". Instead of skipping the character, it could fall back on the Unicode 1.0 name (field 10), if it's defined. For controls, this is the ISO 6429 name:
(10) Old name as published in Unicode 1.0 or ISO 6429 names
for control functions. This field is empty unless it is
significantly different from the current name for the
character. No longer used in code chart production. See
Name_Alias.
The names of control characters are also in NameAliases.txt, which gets processed as the unicode.aliases list of (name, char) tuples.
It looks to me as if NameAliases.txt is the better reference for the C0 and C1 controls. It matches the UnicodeData.txt field 10 names for most entries where the field 1 name is "\<control>", but it has names for U+0080, U+0081, U+0084, and U+0099, which have no field 10 name. The only catch is that NameAliases may have *several* names for the same character, with the same category tag, e.g.
0009;CHARACTER TABULATION;control 0009;HORIZONTAL TABULATION;control
It probably makes sense to consistently use the first listed.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-bug', '3.8', '3.9', '3.10', 'library', 'expert-unicode']
title = "unicodedata.name() doesn't have names for control characters"
updated_at =
user = 'https://bugs.python.org/zwol'
```
bugs.python.org fields:
```python
activity =
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'Unicode']
creation =
creator = 'zwol'
dependencies = []
files = []
hgrepos = []
issue_num = 27496
keywords = []
message_count = 4.0
messages = ['270242', '270245', '270247', '270254']
nosy_count = 4.0
nosy_names = ['ezio.melotti', 'r.david.murray', 'eryksun', 'zwol']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue27496'
versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']
```