python / cpython

The Python programming language
https://www.python.org
Other
61.87k stars 29.76k forks source link

Definition of a "character" is wrong #44150

Closed 8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 closed 10 years ago

8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 17 years ago
BPO 1581182
Nosy @malemburg, @loewis, @birkenfeld, @devdanzin, @ezio-melotti
Superseder
  • bpo-20906: Issues in Unicode HOWTO
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'expert-unicode', 'docs'] title = 'Definition of a "character" is wrong' updated_at = user = 'https://bugs.python.org/Rhamphoryncus' ``` bugs.python.org fields: ```python activity = actor = 'ezio.melotti' assignee = 'docs@python' closed = True closed_date = closer = 'ezio.melotti' components = ['Documentation', 'Unicode'] creation = creator = 'Rhamphoryncus' dependencies = [] files = [] hgrepos = [] issue_num = 1581182 keywords = [] message_count = 9.0 messages = ['61023', '61024', '61025', '61026', '61027', '84524', '84554', '112466', '214532'] nosy_count = 8.0 nosy_names = ['lemburg', 'loewis', 'georg.brandl', 'Rhamphoryncus', 'ajaksu2', 'ezio.melotti', 'docs@python', 'BreamoreBoy'] pr_nums = [] priority = 'normal' resolution = 'duplicate' stage = 'resolved' status = 'closed' superseder = '20906' type = 'enhancement' url = 'https://bugs.python.org/issue1581182' versions = ['Python 2.6', 'Python 3.1', 'Python 2.7', 'Python 3.2'] ```

    8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 17 years ago

    Python's definition of a character does not match that of Unicode. Python's documentation should, at a minimum, explain how python definition compares to Unicode's definition of a code unit, code point, glyph, grapheme cluster, or character.

    Unicode's definition of a character can be found here: http://unicode.org/reports/tr17/

    Python seems to use the Code Units option given here: http://www.unicode.org/faq/char_combmark.html#7

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 17 years ago

    Logged In: YES user_id=21627

    The Python string type is not at all Unicode compliant, so I don't see a need to use Unicode terminology to explain it.

    8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 17 years ago

    Logged In: YES user_id=12364

    Sorry, I wasn't clear. I only intended this to be about the unicode type.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 17 years ago

    Logged In: YES user_id=21627

    Ok. Can you come up with a patch?

    8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 17 years ago

    Logged In: YES user_id=12364

    Not at the moment.

    90baf024-6604-450d-8341-d796fe6858f3 commented 15 years ago

    Anyone brave enough can find the mentioned definitions in the thread below. Reading all of it is necessary, as there are some contradictory quotes and interpretations before an agreement is (sort of) achieved.

    http://mail.python.org/pipermail/python-dev/2008-July/080886.html

    malemburg commented 15 years ago

    See this talk for an explanation of the various Unicode terms and how they map to Python's implementation:

    http://www.egenix.com/library/presentations/#PythonAndUnicode

    Also note that the Unicode standard has evolved a lot since Unicode support was added to Python in late 1999. Some terms used in Python differ from those used in Unicode 5.0 or have been defined in more strict ways than were common at the time.

    And finally: don't forget that Python provides ways of *working* with Unicode, i.e. it does not guarantee that a Python Unicode string always contains all code points required for e.g. UTF-16. It is well possible to store lone surrogates and invalid or unassigned code points in a Python Unicode string.

    malemburg commented 14 years ago

    Without patch, I don't see how this issue can be moved forward.

    Adding a list of such Unicode term definitions would at best cause additional confusion and only address people knowledgable in the Unicode field.

    Note that Python's use of code units and code points matches those of the Unicode standard in most respects. Glyphs and all higher-level definitions are out-of-scope for Python.

    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 10 years ago

    Can this be tied in with the work being done on the unicode howto bpo-20906?