Definition of a "character" is wrong

8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 17 years ago

BPO	1581182
Nosy	@malemburg, @loewis, @birkenfeld, @devdanzin, @ezio-melotti
Superseder	bpo-20906: Issues in Unicode HOWTO

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'expert-unicode', 'docs'] title = 'Definition of a "character" is wrong' updated_at = user = 'https://bugs.python.org/Rhamphoryncus' ``` bugs.python.org fields: ```python activity = actor = 'ezio.melotti' assignee = 'docs@python' closed = True closed_date = closer = 'ezio.melotti' components = ['Documentation', 'Unicode'] creation = creator = 'Rhamphoryncus' dependencies = [] files = [] hgrepos = [] issue_num = 1581182 keywords = [] message_count = 9.0 messages = ['61023', '61024', '61025', '61026', '61027', '84524', '84554', '112466', '214532'] nosy_count = 8.0 nosy_names = ['lemburg', 'loewis', 'georg.brandl', 'Rhamphoryncus', 'ajaksu2', 'ezio.melotti', 'docs@python', 'BreamoreBoy'] pr_nums = [] priority = 'normal' resolution = 'duplicate' stage = 'resolved' status = 'closed' superseder = '20906' type = 'enhancement' url = 'https://bugs.python.org/issue1581182' versions = ['Python 2.6', 'Python 3.1', 'Python 2.7', 'Python 3.2'] ```

8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 17 years ago

Python's definition of a character does not match that of Unicode. Python's documentation should, at a minimum, explain how python definition compares to Unicode's definition of a code unit, code point, glyph, grapheme cluster, or character.

Unicode's definition of a character can be found here: http://unicode.org/reports/tr17/

Python seems to use the Code Units option given here: http://www.unicode.org/faq/char_combmark.html#7

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 17 years ago

Logged In: YES user_id=21627

The Python string type is not at all Unicode compliant, so I don't see a need to use Unicode terminology to explain it.

8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 17 years ago

Logged In: YES user_id=12364

Sorry, I wasn't clear. I only intended this to be about the unicode type.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 17 years ago

Logged In: YES user_id=21627

Ok. Can you come up with a patch?

8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 17 years ago

Logged In: YES user_id=12364

Not at the moment.

90baf024-6604-450d-8341-d796fe6858f3 commented 15 years ago

Anyone brave enough can find the mentioned definitions in the thread below. Reading all of it is necessary, as there are some contradictory quotes and interpretations before an agreement is (sort of) achieved.

http://mail.python.org/pipermail/python-dev/2008-July/080886.html

malemburg commented 15 years ago

See this talk for an explanation of the various Unicode terms and how they map to Python's implementation:

http://www.egenix.com/library/presentations/#PythonAndUnicode

Also note that the Unicode standard has evolved a lot since Unicode support was added to Python in late 1999. Some terms used in Python differ from those used in Unicode 5.0 or have been defined in more strict ways than were common at the time.

And finally: don't forget that Python provides ways of *working* with Unicode, i.e. it does not guarantee that a Python Unicode string always contains all code points required for e.g. UTF-16. It is well possible to store lone surrogates and invalid or unassigned code points in a Python Unicode string.

malemburg commented 14 years ago

Without patch, I don't see how this issue can be moved forward.

Adding a list of such Unicode term definitions would at best cause additional confusion and only address people knowledgable in the Unicode field.

Note that Python's use of code units and code points matches those of the Unicode standard in most respects. Glyphs and all higher-level definitions are out-of-scope for Python.

83d2e70e-e599-4a04-b820-3814bbdb9bef commented 10 years ago

Can this be tied in with the work being done on the unicode howto bpo-20906?

python / cpython

Definition of a "character" is wrong #44150