python / cpython

The Python programming language
https://www.python.org
Other
63.23k stars 30.28k forks source link

Request for grapheme support in Python re lib #56942

Open 5c59cbd7-8186-4351-8391-b403f3a3a73f opened 13 years ago

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago
BPO 12733
Nosy @gvanrossum, @mcepl, @ezio-melotti

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-regex', 'type-feature'] title = 'Request for grapheme support in Python re lib' updated_at = user = 'https://bugs.python.org/tchrist' ``` bugs.python.org fields: ```python activity = actor = 'mcepl' assignee = 'none' closed = False closed_date = None closer = None components = ['Regular Expressions'] creation = creator = 'tchrist' dependencies = [] files = [] hgrepos = [] issue_num = 12733 keywords = [] message_count = 3.0 messages = ['141924', '142114', '143041'] nosy_count = 7.0 nosy_names = ['gvanrossum', 'mcepl', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'tchrist', 'Socob'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue12733' versions = ['Python 3.4'] ```

5c59cbd7-8186-4351-8391-b403f3a3a73f commented 13 years ago

Without proper grapheme support in the regular expression library, it is impossible to correctly process Unicode. And the very least, one needs the \X escape supported, which is an extended grapheme cluster per UTS#18. This escape is supported by many regex libraries, include Perl's own and of course PCRE (and thence PHP, the standard ICU library, and Matthew Barnett's replacement regex library for Python.

How do you process a string by graphemes if you cannot split on \X? How can you avoid splitting a grapheme into silly pieces if you cannot match one? How do I match the letter O no matter what diacritics have been applied to it otherwise? A match of (?=O)\X against an NFD string is by far the simplest and best way.

This is necessary for a wide variety of reasons. Adding \pM and \PM go a little ways, but not far enough, because that is not how grapheme clusters are defined. You need a proper \X.

ezio-melotti commented 13 years ago

As I said on bpo-12734 and bpo-12731, if the 'regex' module address this issue, we should just wait until we include it in the stdlib.

gvanrossum commented 13 years ago

Again, I would be disappointed if the re (_sre) module could not be fixed. It is a reasonable feature request.