python / cpython

The Python programming language
https://www.python.org
Other
62.28k stars 29.92k forks source link

Support Unicode line boundaries in regular expression #66681

Open serhiy-storchaka opened 9 years ago

serhiy-storchaka commented 9 years ago
BPO 22491
Nosy @pitrou, @ezio-melotti, @serhiy-storchaka, @ZackerySpytz, @LewisGaul

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['extension-modules', 'expert-regex', 'type-feature'] title = 'Support Unicode line boundaries in regular expression' updated_at = user = 'https://github.com/serhiy-storchaka' ``` bugs.python.org fields: ```python activity = actor = 'LewisGaul' assignee = 'none' closed = False closed_date = None closer = None components = ['Extension Modules', 'Regular Expressions'] creation = creator = 'serhiy.storchaka' dependencies = [] files = [] hgrepos = [] issue_num = 22491 keywords = [] message_count = 4.0 messages = ['227508', '227523', '348310', '355473'] nosy_count = 6.0 nosy_names = ['pitrou', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'ZackerySpytz', 'LewisGaul'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue22491' versions = ['Python 3.5'] ```

serhiy-storchaka commented 9 years ago

Currently regular expressions support on '\n' as line boundary. To meet Unicode standard requirement RL1.6 [1] all Unicode line separators should be supported: '\n', '\r', '\v', '\f', '\x85', '\u2028', '\u2029' and two-character '\r\n'. Also it is recommended that '.' in "dotall" mode matches '\r\n'. Also strongly recommended to support the '\R' pattern which matches all line separators (equivalent to '(?:\\r\n|(?!\r\n)[\n\v\f\r\x85\u2028\u2029]').

>>> [m.start() for m in re.finditer('$', '\r\n\n\r', re.M)]
[1, 2, 4]  # should be [0, 2, 3, 4]
>>> [m.start() for m in re.finditer('^', '\r\n\n\r', re.M)]
[0, 2, 3]  # should be [0, 2, 3, 4]
>>> [m.group() for m in re.finditer('.', '\r\n\n\r', re.M|re.S)]
['\r', '\n', '\n', '\r']  # should be ['\r\n', '\n', '\r']
>>> [m.group() for m in re.finditer(r'\R', '\r\n\n\r')]
[]  # should be ['\r\n', '\n', '\r']

[1] http://www.unicode.org/reports/tr18/#RL1.6

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 9 years ago

For reference, the regex module normally considers the line ending to be '\n', but it has a WORD flag ('(?w)') that turns on the Unicode definition of a 'word' character as well as Unicode line separator.

2776f601-9573-4690-ab86-59139fdf3c89 commented 5 years ago

To meet Unicode standard requirement RL1.6 [1] all Unicode line separators should be supported:

It seems that large portions of Modules/_sre.c would have to be rewritten in order to do this.

2e277fb8-21c6-426d-9603-d20fcf12b091 commented 4 years ago

Hi there, I'm running 'EnHackathon' in a couple of weeks, and was wondering if this could be a good issue for a small team of first-time contributors with experience in C to work on.

Would anyone be able to offer any guidance for where to start in Modules/_sre.c?