python / cpython

The Python programming language
https://www.python.org/
Other
59.7k stars 28.93k forks source link

regex annoyance #35441

Closed ebd80758-eb43-4a34-86bb-a2b2a2196e69 closed 22 years ago

ebd80758-eb43-4a34-86bb-a2b2a2196e69 commented 22 years ago
BPO 476912
Nosy @tim-one

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-regex'] title = 'regex annoyance' updated_at = user = 'https://bugs.python.org/bbum' ``` bugs.python.org fields: ```python activity = actor = 'effbot' assignee = 'effbot' closed = True closed_date = None closer = None components = ['Regular Expressions'] creation = creator = 'bbum' dependencies = [] files = [] hgrepos = [] issue_num = 476912 keywords = [] message_count = 5.0 messages = ['7301', '7302', '7303', '7304', '7305'] nosy_count = 3.0 nosy_names = ['tim.peters', 'effbot', 'bbum'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = None status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue476912' versions = [] ```

ebd80758-eb43-4a34-86bb-a2b2a2196e69 commented 22 years ago

(this may be a feature request-- but it is annoying enough that I filed it as a bug)

Python's named sub expressions within regular expressions are an incredibly valuable feature;
between it and the ability to automatically collapse multiline regex's w/comments leads to very readable regex's.

However, there is an annoyance in named subexpressions that has bitten me several times.

Namely, if you have a situation where a particular token must be parsed out of the input through the use of one of two (or more) expressions in a fashion that cannot be expressed without multiple possible means of matching any given subexpression, then the named subexpression will only be non-None intermittently (depending on expression order and what was matched).

That is, given:

(?:(?\<Tok1>[a-z]+)\s(?\<Tok2>[a-z]+))|(?:(?\<Tok1> [a-z]+)\t(?\<Tok2>[a-z]+))

In this case, Tok1 and Tok2 will be None if the first expression matches...

(Yes, this is a contrived example that could be refactored to not use multiple \<Tok1>/\<Tok2> references-- however, more complex expressions do not always enable easy refactoring.)

tim-one commented 22 years ago

Logged In: YES user_id=31435

Since symbolic names are names *of* integer group numbers, the regexp compiler should really raise an exception when seeing a given symbolic name defined more than once in a regexp.

ebd80758-eb43-4a34-86bb-a2b2a2196e69 commented 22 years ago

Logged In: YES user_id=103811

While I agree that the proposed solution of raising an exception would certainly be more acceptable behavior than what is occurring now, doing away with support for multiple subexpressions with the same name would be undesirable.

In particular, named subexpressions allow the developer to decouple oneself from counting expressions. It also allows the developer to not fall into a situation where they have to write a few lines of if/else statements to get the value when it might be in either expression A or expression B.

I would rather an error be raised if two separate instances of named expression A were both defined. As long as only one matches, then it shouldn't matter that it appears twice.

The goal is to be able to do this|that where this and that both define the same set of named subexpressions. By definition, only one of this or that will match and, therefore, only one value could be had for a named expression that appears in both this and that.

(As it stands, I have numerous lines of if/else 'this or that' code that generally causes clutter. It means that the groupdict() cannot be treated as a pure result-- I often have to go through the this/that logic to normalize the groupdict into something that actually represents the results I desired).

tim-one commented 22 years ago

Logged In: YES user_id=31435

Bill, you misunderstand my comment: I'm not trying to solve your problem \<wink>. Named groups were my idea to begin with (years ago), and what you want of them is both unclear and beyond their intended use.

I'm not suggesting to take *away* "support for multiple subexpressions with the same name": there is no such support, only the illusion of support due to the regexp compiler failing to raise an exception when a name is redefined (that's an old bug, btw: it's persisted across three generations of underlying regexp engine).

Group names are nothing but synonyms for numbered groups; they add no power, just convenience. If you want more than that, that's fine, but then you need to specify exactly what happens in all cases, and get that implemented. The semantics of named groups right now are defined in terms of a trivial bijection with numbered groups, and all you're seeing when you repeat a name is implementation accidents due to a failure to enforce that there *is* a bijection.

b7a711ff-d634-47b2-ad1b-41e5ae806c8b commented 22 years ago

Logged In: YES user_id=38376

This will be fixed (as in "explicitly disallowed") in 2.2b2.

(but I guess it's time to start thinking about building a better framework on top of SRE. after all, the engine itself can do what Bill wants...)

\</F>