python / cpython

The Python programming language
https://www.python.org
Other
62.59k stars 30.03k forks source link

Enhanced \N{} escapes for Unicode strings #62814

Open stevendaprano opened 11 years ago

stevendaprano commented 11 years ago
BPO 18614
Nosy @terryjreedy, @ezio-melotti, @stevendaprano
Files
  • issue18614.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'expert-unicode'] title = 'Enhanced \\N{} escapes for Unicode strings' updated_at = user = 'https://github.com/stevendaprano' ``` bugs.python.org fields: ```python activity = actor = 'terry.reedy' assignee = 'none' closed = False closed_date = None closer = None components = ['Unicode'] creation = creator = 'steven.daprano' dependencies = [] files = ['31112'] hgrepos = [] issue_num = 18614 keywords = ['patch'] message_count = 3.0 messages = ['194075', '194087', '194123'] nosy_count = 4.0 nosy_names = ['terry.reedy', 'ezio.melotti', 'mrabarnett', 'steven.daprano'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18614' versions = ['Python 3.4'] ```

    stevendaprano commented 11 years ago

    As per the discussion here:

    http://mail.python.org/pipermail/python-ideas/2013-July/022419.html

    \N{} escapes should support the Unicode code point notation U+xxxx (where there are four, five or six hex digits after the U+).

    E.g. '\N{U+03BB}' => 'λ'

    unicodedata.lookup should also support such numeric names, e.g.:

    unicodedata.lookup('U+03BB') => 'λ'

    As '+' is otherwise prohibited in Unicode character names, there should never be ambiguity between 'U+xxxx' as a code point and an actual name, and a single lookup function can handle both.

    (See http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G39 for details on characters allowed in names.)

    Also add a function for the reverse

    unicodedata.codepoint('λ') => 'U+03BB'

    def codepoint(c):
        return 'U+{:04X}'.format(ord(c))
    39d85a87-36ea-41b2-b2bb-2be43abb500e commented 11 years ago

    I've attached a patch for this.

    terryjreedy commented 11 years ago

    I agree with the proposal.

    Some of the code seems redundant with code we already have. In Python, I would write

    def codepoint_from_U_notation(name, namelen):
      if not (4 <= namelen <= 6): raise <wrong length>
      return chr(int(name, 16))

    maybe with try-except to re-write error messages like ValueError: invalid literal for int() with base 16: '99x3' ValueError: chr() arg not in range(0x110000)

    My point is that we already have code to convert hex strings to int; I presume PyUnicode_FromOrdinal(code) is the C version of 'chr' that already checks the max value.