jamadden / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 2 forks source link

support concatenation of compiled patterns -- feature #15

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
With the new named lists feature, a compiled pattern may contain both the 
pattern string, and references to one or more lists.  This means that the 
standard way of composing patterns, concatenating the .pattern attribute of 
compiled patterns with other text, won't "just work".

Let me suggest an API modification to get around this.  Instead of the 
"pattern" parameter to regex.compile() being a string, allow it to be either a 
string, or a sequence of objects, each of which must be either a string or a 
compiled pattern.  The elements of the sequence are concatenated to generate 
the new pattern string, and list references in the compiled pattern elements of 
the sequence are transferred into the new compiled pattern.  Duplicate names 
for list references could be an error; or, they could be handled automatically 
by name-mangling the list names within the aggregate compiled pattern.

Original issue reported on code.google.com by Bill.Jan...@gmail.com on 29 Jun 2011 at 4:24

GoogleCodeExporter commented 9 years ago
Concatenating the .pattern attribute of compiled patterns doesn't "just work", 
because the regexes may have been provided with different 'flags' arguments, 
even assuming that they don't have conflicting named groups.

I'll need to consider the alternatives (concatenating compiled pattern objects 
with "+"?) and implications of this feature.

Original comment by re...@mrabarnett.plus.com on 30 Jun 2011 at 2:03

GoogleCodeExporter commented 9 years ago
I'm not sure about "concatenation with +".  I was really just suggesting that 
the first parameter to "compile" could be either a pattern string, or a 
sequence of pattern strings and/or compiled patterns.  As you say, flags would 
have to match, named groups would have to satisfy the standard constraint of 
being unique within selector branches, etc.

Original comment by Bill.Jan...@gmail.com on 7 Jul 2011 at 7:28

GoogleCodeExporter commented 9 years ago
Here's an illustration of the kind of thing I'm doing now.

Original comment by Bill.Jan...@gmail.com on 7 Jul 2011 at 7:33

Attachments:

GoogleCodeExporter commented 9 years ago
One thought I had was that, in some ways, it's like the named list feature, 
except that it's a subpattern, which suggests:

    which_regex = regex.compile("first|second")
    which_item_regex = regex.compile(r"\L<which>\s+(\w+)", which=which_regex)

Original comment by re...@mrabarnett.plus.com on 7 Jul 2011 at 8:00

GoogleCodeExporter commented 9 years ago
Ah, interesting idea.

Though I suspect a different flag character (instead of 'L') might be a good 
idea.  It's not really a list.  Is 'R' taken?

Hmmm...  One issue I see is that it introduces another level of naming (the 
name of the group) which kind of detracts from the "direct manipulation" aspect 
of being able to use the Python variable directly:

    which_regex = regex.compile("first|second")
    which_item_regex = regex.compile((which_regex, r"\s+(\w+)"))

On the other hand, it's more regex-syntax-friendly.

Original comment by Bill.Jan...@gmail.com on 8 Jul 2011 at 6:43

GoogleCodeExporter commented 9 years ago
One feature which is currently missing is the attribute "named_lists":

>>> which = regex.compile(r"\L<options>", options="first second".split())
>>> which.pattern
'\\L<options>'
>>> which.named_lists
{'options': frozenset({'second', 'first'})}

You can then say:

>>> which_item_regex = regex.compile(which.pattern + r"\s+(\w+)", 
**which.named_lists)
>>> which_item_regex.pattern
'\\L<options>\\s+(\\w+)'
>>> which_item_regex.named_lists
{'options': frozenset({'second', 'first'})}

That will be in the next release.

Original comment by re...@mrabarnett.plus.com on 8 Jul 2011 at 7:20

GoogleCodeExporter commented 9 years ago
Re \R, some regex implementations use that to match various line endings, 
something like r"\r\n|\n" (possibly Unicode newline as well), and I don't want 
to preclude that in the future.

Original comment by re...@mrabarnett.plus.com on 8 Jul 2011 at 8:13

GoogleCodeExporter commented 9 years ago
Inserting a pre-compiled regex into a regex pattern is not the same as 
inserting a regex pattern into a regex pattern.

For example, this:

    p = regex.compile("cat")
    q = regex.compile("(?i)" + p.pattern)

is the same as this:

    q = regex.compile("(?i)cat")

which will match "CAT", but this:

    p = regex.compile("cat")
    q = regex.compile(r"(?i)\I<rgx>", rgx=p)

won't match "CAT", because p has already been compiled as case-insensitive.

Similar remarks apply to DOTALL and the other flags, and also fuzzy matching.

This:

    p = regex.compile("cat")
    q = regex.compile("(?:" + p.pattern + "){e<=1}")

is the same as this:

    q = regex.compile("(?:cat){e<=1}")

and is a fuzzy regex, but this:

    p = regex.compile("cat")
    q = regex.compile(r"\I<rgx>{e<=1}", rgx=p)

isn't. (It should probably raise an exception.)

So the question is: should inserting a pre-existing regex actually use the 
pre-compiled regex as-is as shown above, or should it use that regex's pattern 
with an implicit (?:...) around it?

If it uses the pre-compiled regex as-is, should that regex be atomic (no 
backtracking into it after it has matched)?

Should there be both forms of insertion, r"\I<rgx>" and r"\i<rgx>"? (That may 
be confusing!)

Original comment by re...@mrabarnett.plus.com on 11 Jul 2011 at 11:24

GoogleCodeExporter commented 9 years ago

Original comment by re...@mrabarnett.plus.com on 6 Aug 2011 at 3:29