Enhanced \N{} escapes for Unicode strings

python / cpython

The Python programming language

https://www.python.org

Other

62.59k stars 30.03k forks source link

Enhanced \N{} escapes for Unicode strings #62814

Open stevendaprano opened 11 years ago

stevendaprano commented 11 years ago

BPO	18614
Nosy	@terryjreedy, @ezio-melotti, @stevendaprano
Files	issue18614.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'expert-unicode'] title = 'Enhanced \\N{} escapes for Unicode strings' updated_at = user = 'https://github.com/stevendaprano' ``` bugs.python.org fields: ```python activity = actor = 'terry.reedy' assignee = 'none' closed = False closed_date = None closer = None components = ['Unicode'] creation = creator = 'steven.daprano' dependencies = [] files = ['31112'] hgrepos = [] issue_num = 18614 keywords = ['patch'] message_count = 3.0 messages = ['194075', '194087', '194123'] nosy_count = 4.0 nosy_names = ['terry.reedy', 'ezio.melotti', 'mrabarnett', 'steven.daprano'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18614' versions = ['Python 3.4'] ```

stevendaprano commented 11 years ago

As per the discussion here:

http://mail.python.org/pipermail/python-ideas/2013-July/022419.html

\N{} escapes should support the Unicode code point notation U+xxxx (where there are four, five or six hex digits after the U+).

E.g. '\N{U+03BB}' => 'λ'

unicodedata.lookup should also support such numeric names, e.g.:

unicodedata.lookup('U+03BB') => 'λ'

As '+' is otherwise prohibited in Unicode character names, there should never be ambiguity between 'U+xxxx' as a code point and an actual name, and a single lookup function can handle both.

(See http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G39 for details on characters allowed in names.)

Also add a function for the reverse

unicodedata.codepoint('λ') => 'U+03BB'

def codepoint(c):
    return 'U+{:04X}'.format(ord(c))

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 11 years ago

I've attached a patch for this.

terryjreedy commented 11 years ago

I agree with the proposal.

Some of the code seems redundant with code we already have. In Python, I would write

def codepoint_from_U_notation(name, namelen):
  if not (4 <= namelen <= 6): raise <wrong length>
  return chr(int(name, 16))

maybe with try-except to re-write error messages like ValueError: invalid literal for int() with base 16: '99x3' ValueError: chr() arg not in range(0x110000)

My point is that we already have code to convert hex strings to int; I presume PyUnicode_FromOrdinal(code) is the C version of 'chr' that already checks the max value.