Closed 8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 closed 10 years ago
Python's definition of a character does not match that of Unicode. Python's documentation should, at a minimum, explain how python definition compares to Unicode's definition of a code unit, code point, glyph, grapheme cluster, or character.
Unicode's definition of a character can be found here: http://unicode.org/reports/tr17/
Python seems to use the Code Units option given here: http://www.unicode.org/faq/char_combmark.html#7
Logged In: YES user_id=21627
The Python string type is not at all Unicode compliant, so I don't see a need to use Unicode terminology to explain it.
Logged In: YES user_id=12364
Sorry, I wasn't clear. I only intended this to be about the unicode type.
Logged In: YES user_id=21627
Ok. Can you come up with a patch?
Logged In: YES user_id=12364
Not at the moment.
Anyone brave enough can find the mentioned definitions in the thread below. Reading all of it is necessary, as there are some contradictory quotes and interpretations before an agreement is (sort of) achieved.
http://mail.python.org/pipermail/python-dev/2008-July/080886.html
See this talk for an explanation of the various Unicode terms and how they map to Python's implementation:
http://www.egenix.com/library/presentations/#PythonAndUnicode
Also note that the Unicode standard has evolved a lot since Unicode support was added to Python in late 1999. Some terms used in Python differ from those used in Unicode 5.0 or have been defined in more strict ways than were common at the time.
And finally: don't forget that Python provides ways of *working* with Unicode, i.e. it does not guarantee that a Python Unicode string always contains all code points required for e.g. UTF-16. It is well possible to store lone surrogates and invalid or unassigned code points in a Python Unicode string.
Without patch, I don't see how this issue can be moved forward.
Adding a list of such Unicode term definitions would at best cause additional confusion and only address people knowledgable in the Unicode field.
Note that Python's use of code units and code points matches those of the Unicode standard in most respects. Glyphs and all higher-level definitions are out-of-scope for Python.
Can this be tied in with the work being done on the unicode howto bpo-20906?
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['type-feature', 'expert-unicode', 'docs']
title = 'Definition of a "character" is wrong'
updated_at =
user = 'https://bugs.python.org/Rhamphoryncus'
```
bugs.python.org fields:
```python
activity =
actor = 'ezio.melotti'
assignee = 'docs@python'
closed = True
closed_date =
closer = 'ezio.melotti'
components = ['Documentation', 'Unicode']
creation =
creator = 'Rhamphoryncus'
dependencies = []
files = []
hgrepos = []
issue_num = 1581182
keywords = []
message_count = 9.0
messages = ['61023', '61024', '61025', '61026', '61027', '84524', '84554', '112466', '214532']
nosy_count = 8.0
nosy_names = ['lemburg', 'loewis', 'georg.brandl', 'Rhamphoryncus', 'ajaksu2', 'ezio.melotti', 'docs@python', 'BreamoreBoy']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = '20906'
type = 'enhancement'
url = 'https://bugs.python.org/issue1581182'
versions = ['Python 2.6', 'Python 3.1', 'Python 2.7', 'Python 3.2']
```