Closed ebd80758-eb43-4a34-86bb-a2b2a2196e69 closed 22 years ago
(this may be a feature request-- but it is annoying enough that I filed it as a bug)
Python's named sub expressions within regular
expressions are an incredibly valuable feature;
between it and the ability to automatically collapse
multiline regex's w/comments leads to very
readable regex's.
However, there is an annoyance in named subexpressions that has bitten me several times.
Namely, if you have a situation where a particular token must be parsed out of the input through the use of one of two (or more) expressions in a fashion that cannot be expressed without multiple possible means of matching any given subexpression, then the named subexpression will only be non-None intermittently (depending on expression order and what was matched).
That is, given:
(?:(?\<Tok1>[a-z]+)\s(?\<Tok2>[a-z]+))|(?:(?\<Tok1> [a-z]+)\t(?\<Tok2>[a-z]+))
In this case, Tok1 and Tok2 will be None if the first expression matches...
(Yes, this is a contrived example that could be refactored to not use multiple \<Tok1>/\<Tok2> references-- however, more complex expressions do not always enable easy refactoring.)
Logged In: YES user_id=31435
Since symbolic names are names *of* integer group numbers, the regexp compiler should really raise an exception when seeing a given symbolic name defined more than once in a regexp.
Logged In: YES user_id=103811
While I agree that the proposed solution of raising an exception would certainly be more acceptable behavior than what is occurring now, doing away with support for multiple subexpressions with the same name would be undesirable.
In particular, named subexpressions allow the developer to decouple oneself from counting expressions. It also allows the developer to not fall into a situation where they have to write a few lines of if/else statements to get the value when it might be in either expression A or expression B.
I would rather an error be raised if two separate instances of named expression A were both defined. As long as only one matches, then it shouldn't matter that it appears twice.
The goal is to be able to do this|that where this and that both define the same set of named subexpressions. By definition, only one of this or that will match and, therefore, only one value could be had for a named expression that appears in both this and that.
(As it stands, I have numerous lines of if/else 'this or that' code that generally causes clutter. It means that the groupdict() cannot be treated as a pure result-- I often have to go through the this/that logic to normalize the groupdict into something that actually represents the results I desired).
Logged In: YES user_id=31435
Bill, you misunderstand my comment: I'm not trying to solve your problem \<wink>. Named groups were my idea to begin with (years ago), and what you want of them is both unclear and beyond their intended use.
I'm not suggesting to take *away* "support for multiple subexpressions with the same name": there is no such support, only the illusion of support due to the regexp compiler failing to raise an exception when a name is redefined (that's an old bug, btw: it's persisted across three generations of underlying regexp engine).
Group names are nothing but synonyms for numbered groups; they add no power, just convenience. If you want more than that, that's fine, but then you need to specify exactly what happens in all cases, and get that implemented. The semantics of named groups right now are defined in terms of a trivial bijection with numbered groups, and all you're seeing when you repeat a name is implementation accidents due to a failure to enforce that there *is* a bijection.
Logged In: YES user_id=38376
This will be fixed (as in "explicitly disallowed") in 2.2b2.
(but I guess it's time to start thinking about building a better framework on top of SRE. after all, the engine itself can do what Bill wants...)
\</F>
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['expert-regex']
title = 'regex annoyance'
updated_at =
user = 'https://bugs.python.org/bbum'
```
bugs.python.org fields:
```python
activity =
actor = 'effbot'
assignee = 'effbot'
closed = True
closed_date = None
closer = None
components = ['Regular Expressions']
creation =
creator = 'bbum'
dependencies = []
files = []
hgrepos = []
issue_num = 476912
keywords = []
message_count = 5.0
messages = ['7301', '7302', '7303', '7304', '7305']
nosy_count = 3.0
nosy_names = ['tim.peters', 'effbot', 'bbum']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue476912'
versions = []
```