Adding a new regex module (compatible with re)

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

BPO	2636
Nosy	@loewis, @birkenfeld, @gpshead, @amauryfa, @ncoghlan, @abalkin, @pitrou, @vstinner, @giampaolo, @mark-summerfield, @ssbr, @ezio-melotti, @akitada, @stevendaprano, @alex, @bitdancer, @sandrotosi, @sorcio, @ericsnowcurrently, @Fak3, @serhiy-storchaka, @jayvdb
Files	regex_test-20100316: Python 2.6.5 re test run against regex-20100305 issue2636-20101230.zip remove_guards.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-regex', 'type-feature', 'library'] title = 'Adding a new regex module (compatible with re)' updated_at = user = 'https://bugs.python.org/timehorse' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = True closed_date = closer = 'vstinner' components = ['Library (Lib)', 'Regular Expressions'] creation = creator = 'timehorse' dependencies = [] files = ['16563', '20192', '20203'] hgrepos = ['108'] issue_num = 2636 keywords = ['patch'] message_count = 334.0 messages = ['65513', '65593', '65613', '65614', '65617', '65725', '65726', '65727', '65734', '65838', '65841', '66033', '67309', '67447', '67448', '68336', '68339', '68358', '68399', '68409', '73185', '73295', '73714', '73717', '73721', '73730', '73752', '73766', '73779', '73780', '73782', '73791', '73794', '73798', '73801', '73803', '73805', '73827', '73848', '73853', '73854', '73855', '73861', '73875', '73955', '74025', '74026', '74058', '74104', '74174', '74203', '74204', '74904', '80916', '81112', '81236', '81238', '81239', '81240', '81359', '81473', '81475', '82673', '82739', '82950', '83271', '83277', '83390', '83411', '83427', '83428', '83429', '83988', '83989', '83993', '84350', '86004', '86032', '89632', '89634', '89643', '90954', '90961', '90985', '90986', '90989', '91028', '91035', '91038', '91245', '91250', '91437', '91439', '91448', '91450', '91460', '91462', '91463', '91473', '91474', '91490', '91495', '91496', '91497', '91500', '91535', '91598', '91607', '91610', '91671', '91917', '97860', '98809', '99072', '99132', '99148', '99186', '99190', '99462', '99470', '99479', '99481', '99494', '99548', '99552', '99665', '99668', '99835', '99863', '99872', '99888', '99890', '99892', '100066', '100076', '100080', '100134', '100152', '100359', '100362', '100370', '100452', '101172', '101181', '101193', '101557', '102042', '103003', '103060', '103064', '103078', '103095', '103096', '103097', '109358', '109363', '109372', '109384', '109401', '109403', '109404', '109405', '109406', '109407', '109408', '109409', '109410', '109413', '109447', '109460', '109461', '109463', '109474', '109657', '110233', '110237', '110701', '110704', '110761', '111519', '111531', '111643', '111652', '111656', '111660', '111921', '113927', '113931', '114034', '114766', '116133', '116151', '116223', '116227', '116229', '116231', '116238', '116248', '116252', '116276', '116749', '117008', '117046', '117050', '118243', '118631', '118636', '118640', '118674', '118682', '119887', '119930', '119947', '119951', '119956', '119958', '120013', '120037', '120038', '120164', '120202', '120203', '120204', '120206', '120215', '120216', '120243', '120571', '120969', '120976', '120984', '120986', '121136', '121145', '121149', '121589', '121832', '122221', '122225', '122228', '122880', '123518', '123527', '123747', '124581', '124582', '124585', '124614', '124626', '124627', '124746', '124750', '124759', '124816', '124821', '124833', '124834', '124891', '124900', '124904', '124905', '124906', '124909', '124912', '124929', '124931', '124936', '124950', '124959', '124971', '124988', '124990', '125291', '126294', '126372', '127045', '130886', '130905', '130906', '130999', '135700', '135703', '135704', '140102', '140152', '140154', '143090', '143333', '143334', '143337', '143340', '143343', '143350', '143352', '143355', '143366', '143367', '143374', '143377', '143389', '143423', '143442', '143445', '143447', '143448', '143467', '143471', '143619', '144110', '152210', '152211', '152212', '152214', '152215', '152217', '152218', '152246', '157445', '174888', '221629', '221657', '230846', '230862', '230866', '230867', '230873', '230877', '230885', '230887', '231137', '231141', '231159', '231165', '385674'] nosy_count = 43.0 nosy_names = ['loewis', 'georg.brandl', 'gregory.p.smith', 'jimjjewett', 'sjmachin', 'amaury.forgeotdarc', 'ncoghlan', 'belopolsky', 'pitrou', 'vstinner', 'nneonneo', 'giampaolo.rodola', 'rsc', 'timehorse', 'mark', 'vbr', 'Devin Jeanpierre', 'ezio.melotti', 'mrabarnett', 'jaylogan', 'akitada', 'moreati', 'steven.daprano', 'alex', 'r.david.murray', 'jacques', 'zdwiel', 'THRlWiTi', 'sandro.tosi', 'abacabadabacaba', 'stiv', 'davide.rizzo', 'mattchaput', 'ronnix', 'tshepang', 'eric.snow', 'akoumjian', 'Roman.Evstifeev', 'serhiy.storchaka', 'Mateon1', 'jayvdb', 'Socob', 'petros'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue2636' versions = [] ```

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

I am working on adding features to the current Regexp implementation, which is now set to 2.2.2. These features are to bring the Regexp code closer in line with Perl 5.10 as well as add a few python-specific niceties and potential speed-ups and clean-ups.

I will be posting regular patch updates to this thread when major milestones have been reach with a description of the feature(s) added. Currently, the list of proposed changes are (in no particular order):

1) Fix \<a href="http://bugs.python.org/issue433030"\>bpo-433030\</a> by adding support for Atomic Grouping and Possessive Qualifiers

2) Make named matches direct attributes of the match object; i.e. instead of m.group('foo'), one will be able to write simply m.foo.

3) (maybe) make Match objects subscriptable, such that m[n] is equivalent to m.group(n) and allow slicing.

4) Implement Perl-style back-references including relative back-references.

5) Add a well-formed, python-specific comment modifier, e.g. (?P#...); the difference between (?P#...) and Perl/Python's (?#...) is that the former will allow nested parentheses as well as parenthetical escaping, so that patterns of the form '(?P# Evaluate (the following) expression, 3\) using some other technique)'. The (?P#...) will interpret this entire expression as a comment, where as with (?#...) only, everything following ' expression...' would be considered part of the match. (?P#...) will necessarily be slower than (?#...) and so only should be used if richer commenting style is required but the verbose mode is not desired.

6) Add official support for fast, non-repeating capture groups with the Template option. Template is unofficially supported and disables all repeat operators (*, + and ?). This would mainly consist of documenting its behavior.

7) Modify the re compiled expression cache to better handle the thrashing condition. Currently, when regular expressions are compiled, the result is cached so that if the same expression is compiled again, it is retrieved from the cache and no extra work has to be done. This cache supports up to 100 entries. Once the 100th entry is reached, the cache is cleared and a new compile must occur. The danger, all be it rare, is that one may compile the 100th expression only to find that one recompiles it and has to do the same work all over again when it may have been done 3 expressions ago. By modifying this logic slightly, it is possible to establish an arbitrary counter that gives a time stamp to each compiled entry and instead of clearing the entire cache when it reaches capacity, only eliminate the oldest half of the cache, keeping the half that is more recent. This should limit the possibility of thrashing to cases where a very large number of Regular Expressions are continually recompiled. In addition to this, I will update the limit to 256 entries, meaning that the 128 most recent are kept.

8) Emacs/Perl style character classes, e.g. [:alphanum:]. For instance, :alphanum: would not include the '_' in the character class.

9) C-Engine speed-ups. I commenting and cleaning up the _sre.c Regexp engine to make it flow more linearly, rather than with all the current gotos and replace the switch-case statements with lookup tables, which in tests have shown to be faster. This will also include adding many more comments to the C code in order to make it easier for future developers to follow. These changes are subject to testing and some modifications may not be included in the final release if they are shown to be slower than the existing code. Also, a number of Macros are being eliminated where appropriate.

10) Export any (not already) shared value between the Python Code and the C code, e.g. the default Maximum Repeat count (65536); this will allow those constants to be changed in 1 central place.

11) Various other Perl 5.10 conformance modifications, TBD.

More items may come and suggestions are welcome.

-----

Currently, I have code which implements 5) and 7), have done some work on 10) and am almost 9). When 9) is complete, I will work on 1), some of which, such as parsing, is already done, then probably 8) and 4) because they should not require too much work -- 4) is parser-only AFAICT. Then, I will attempt 2) and 3), though those will require changes at the C-Code level. Then I will investigate what additional elements of 11) I can easily implement. Finally, I will write documentation for all of these features, including 6).

In a few days, I will provide a patch with my interim results and will update the patches with regular updates when Milestones are reached.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

I am very sorry to report (at least for me) that as of this moment, item 9), although not yet complete, is stable and able to pass all the existing python regexp tests. Because these tests are timed, I am using the timings from the first suite of tests to perform a benchmark of performance between old and new code. Based on discussion with Andrew Kuchling, I have decided for the sake of simplicity, the "timing" of each version is to be calculated by the absolute minimum time to execute observed because it is believed this execution would have had the most continuous CPU cycles and thus most closely represents the true execution time.

It is this current conclusion that greatly saddens me, not that the effort has not been valuable in understanding the current engine.
Indeed, I understand the current engine now well enough that I could proceed with the other modifications as-is rather than implementing them with the new engine. Mind you, I will likely not bring over the copious
comments that the new engine received when I translated it to a form without C_Macros and gotos, as that would require too much effort IMHO.

Anyway, all that being said, and keeping in mind that I am not 100% satisfied with the new engine and may still be able to wring some timing out of it -- not that I will spend much more time on this -- here is where we currently stand:

Old Engine: 6.574s New Engine: 7.239s

This makes the old Engine 665ms faster over the entire first test_re.py suite, or 9% faster than the New Engine.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Here are the modification so far for item 9) in _sre.c plus some small modifications to sre_constants.h which are only to get _sre.c to compile; normally sre_constants.h is generated by sre_constants.py, so this is not the final version of that file. I also would have intended to make SRE_CHARSET and SRE_COUNT use lookup tables, as well as maybe others, but not likely any other lookup tables. I also want to remove alloc_pos out of the self object and make it a parameter to the ALLOC parameter and probably get rid of the op_code attribute since it is only used in 1 place to save one subtract in a very rare case. But I want to resolve the 10% problem first, so would appreciate it if people could look at the REMOVE_SRE_MATCH_MACROS section of code and compare it to the non-REMOVE_SRE_MATCH_MACROS version of SRE_MATCH and see if you can suggest anything to make the former (new code) faster to get me that elusive 10%.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Here is a patch to implement item 7)

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

This simple patch adds (?P#...)-style comment support.

74c4563b-ab1c-43d8-9219-30c4eca796bc commented 16 years ago

These features are to bring the Regexp code closer in line with Perl 5.10

Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue?

as well as add a few python-specific

because this also adds to the scope.

2) Make named matches direct attributes of the match object; i.e. instead of m.group('foo'), one will be able to write simply m.foo.

3) (maybe) make Match objects subscriptable, such that m[n] is equivalent to m.group(n) and allow slicing.

(2) and (3) would both be nice, but I'm not sure it makes sense to do *both* instead of picking one.

5) Add a well-formed, python-specific comment modifier, e.g. (?P#...);

[handles parens in comments without turning on verbose, but is slower]

Why? It adds another incompatibility, so it has to be very useful or clear. What exactly is the advantage over just turning on verbose?

9) C-Engine speed-ups. ... a number of Macros are being eliminated where appropriate.

Be careful on those, particular on str/unicode and different compile options.

amauryfa commented 16 years ago

> These features are to bring the Regexp code closer in line > with Perl 5.10

Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue?

5.10.0 comes after 5.8 and is the latest version (2007/12/18)! Yes it is confusing.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Thanks Jim for your thoughts!

Armaury has already explained about Perl 5.10.0. I suppose it's like Macintosh version numbering, since Mac Tiger went from version 10.4.9 to 10.4.10 and 10.4.11 a few years ago. Maybe we should call Python 2.6 Python 2.06 just in case. But 2.6 is the known last in the 2 series so it's not a problem for us! :)

> as well as add a few python-specific

because this also adds to the scope.

At this point the only python-specific changes I am proposing would be items 2, 3 (discussed below), 5 (discussed below), 6 and 7. 6 is only a documentation change, the code is already implemented. 7 is just a better behavior. I think it is RARE one compiles more than 100 unique regular expressions, but you never know as projects tend to grow over time, and in the old code the 101st would be recompiled even if it was just compiled 2 minutes ago. The patch is available so I leave it to the community to judge for themselves whether it is worth it, but as you can see, it's not a very large change.

> 2) Make named matches direct attributes > of the match object; i.e. instead of m.group('foo'), > one will be able to write simply m.foo.

> 3) (maybe) make Match objects subscriptable, such > that m[n] is equivalent to m.group(n) and allow slicing.

(2) and (3) would both be nice, but I'm not sure it makes sense to do *both* instead of picking one.

Well, I think named matches are better than numbered ones, so I'd definitely go with 2. The problem with 2, though, is that it still leaves the rather typographically intense m.group(n), since I cannot write m.3. However, since capture groups are always numbered sequentially, it models a list very nicely. So I think for indexing by group number, the subscripting operator makes sense. I was not originally suggesting m['foo'] be supported, but I can see how that may come out of 3. But there is a restriction on python named matches that they have to be valid python and that strikes me as 2 more than 3 because 3 would not require such a restriction but 2 would. So at least I want 2, but it seems IMHO m[1] is better than m.group(1) and not in the least hard or a confusing way of retrieving the given group. Mind you, the Match object is a C-struct with python binding and I'm not exactly sure how to add either feature to it, but I'm sure the C-API manual will help with that.

> 5) Add a well-formed, python-specific comment modifier, > e.g. (?P#...);

[handles parens in comments without turning on verbose, but is slower]

Why? It adds another incompatibility, so it has to be very useful or clear. What exactly is the advantage over just turning on verbose?

Well, Larry Wall and Guido agreed long ago that we, the python community, own all expressions of the form (?P...) and although I'd be my preference to make (?#...) more in conformance with understanding parenthesis nesting, changing the logic behind THAT would make python non-standard. So as far as any conflicting design, we needn't worry.

As for speed, the this all occurs in the parser and does not effect the compiler or engine. It occurs only after a (?P has been read and then only as the last check before failure, so it should not be much slower except when the expression is invalid. The actual execution time to find the closing brace of (?P#...) is a bit slower than that for (?#...) but not by much.

Verbose is generally a good idea for anything more than a trivial Regular Expression. However, it can have overhead if not included as the first flag: an expression is always checked for verbose post-compilation and if it is encountered, the expression is compiled a second time, which is somewhat wasteful. But the reason I like the (?P#...) over (?#...) is because I think people would more tend to assume:

r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.

That expression only matches "He ls)llo", so I created the (?P#...) to make the comment match type more intuitive:

r'He(?P# 2 (TWO) ls)llo' matches "Hello".

> 9) C-Engine speed-ups. ... > a number of Macros are being eliminated where appropriate.

Be careful on those, particular on str/unicode and different compile options.

Will do; thanks for the advice! I have only observed the UNICODE flag controlling whether certain code is used (besides the ones I've added) and have tried to stay true to that when I encounter it. Mind you, unless I can get my extra 10% it's unlikely I'd actually go with item 9 here, even if it is easier to read IMHO. However, I want to run the new engine proposal through gprof to see if I can track down some bottlenecks.

At some point, I hope to get my current changes on Launchpad if I can get that working. If I do, I'll give a link to how people can check out my working code here as well.

74c4563b-ab1c-43d8-9219-30c4eca796bc commented 16 years ago

Python 2.6 isn't the last, but Guido has said that there won't be a 2.10.

Match object is a C-struct with python binding and I'm not exactly sure how to add either feature to it

I may be misunderstanding -- isn't this just a matter of writing the function and setting it in the tp_as_sequence and tp_as_mapping slots?

Larry Wall and Guido agreed long ago that we, the python community, own all expressions of the form (?P...)

Cool -- that reference should probably be added to the docs. For someone trying to learn or translate regular expressions, it helps to know that (?P ...) is explicitly a python extension (even if Perl adopts it later).

Definately put the example in the doc.

r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.  Maybe

even without the change, as doco on the current situation.

Does VERBOSE really have to be the first flag, or does it just have to be on the whole pattern instead of an internal switch?

I'm not sure I fully understand what you said about template. Is this a special undocumented switch, or just an internal optimization mode that should be triggered whenever the repeat operators don't happen to occur?

pitrou commented 16 years ago

I don't know anything about regexp implementation, but if you replace a switch-case with a function lookup table, it isn't surprising that the new version ends up slower. A local jump is always faster than a function call, because of the setup overhead and stack manipulation the latter involves.

So you might try to do the cleanup while keeping the switch-case structure, if possible.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Thank you and Merci Antoine!

That is a good point. It is clearly specific to the compiler whether a switch-case will be turned into a series of conditional branches or simply creating an internal jump table with lookup. And it is true that most compilers, if I understand correctly, use the jump-table approach for any switch-case over 2 or 3 entries when the cases are tightly grouped and near 0. That is probably why the original code worked so fast. I'll see if I can combine the best of both approaches. Thanks again!

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

I am making my changes in a Bazaar branch hosted on Launchpad. It took me quite a while to get things set up more-or-less logically but there they are and I'm currently trying to re-apply my local changes up to today into the various branches I have. Each of the 11 issues I outlined originally has its own branch, with a root branch from which all these branches are derived to serve as a place for a) merging in python 2.6 alpha concurrent development (merges) and to apply any additional re changes that don't fall into any of the other categories, of which I have so far found only 2 small ones.

Anyway, if anyone is interested in monitoring my progress, it is available at:

https://code.launchpad.net/~timehorse/

I will still post major milestones here, but one can monitory day-to-day progress on Launchpad. Also on launchpad you will find more detail on the plans for each of the 11 modifications, for the curious.

Thanks again for all the advice!

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

I am finally making progress again, after a month of changing my patches from my local svn repository to bazaar hosted on launchpad.net, as stated in my last update. I also have more or less finished the probably easiest item, #5, so I have a full patch for that available now. First, though, I want to update my "No matter what" patch, which is to say these are the changes I want to make if any changes are made to the Regexp code.

a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 16 years ago

AFAIK if you have a regex with named capture groups there is no direct way to relate them to the capture group numbers. You could do (untested; Python 3 syntax):

    d = {v: k for k, v in match.groupdict()}
    for i in range(match.lastindex):
         print(i, match.group(i), d[match.group(i)])

One possible solution would be a grouptuples() function that returned a tuple of 3-tuples (index, name, captured_text) with the name being None for unnamed groups.

Anyway, good luck with all your improvements, I will be especially glad if you manage to do (2) and (8) (and maybe (3)).

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Mark scribbled:

One possible solution would be a grouptuples() function that returned a tuple of 3-tuples (index, name, captured_text) with the name being None for unnamed groups.

Hmm. Well, that's not a bad idea at all IMHO and would, AFAICT probably be easier to do than (2) but I would still do (2) but will try to add that to one of the existing items or spawn another item for it since it is kind of a distinct feature.

My preference right now is to finish off the test cases for (7) because it is already coded, then finish the work on (1) as that was the original reason for modification then on to (2) then (3) as they are related and then I don't mind tackling (8) because I think that one shouldn't be too hard. Interestingly, the existing engine code (sre_parse.py) has a place-holder, commented out, for character classes but it was never properly implemented. And I will warn that with Unicode, I THINK all the character classes exist as unicode functions or can be implemented as multiple unicode functions, but I'm not 100% sure so if I run into that problem, some character classes may initially be left out while I work on another item.

Anyway, thanks for the input, Mark!

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Well, it's time for another update on my progress...

Some good news first: Atomic Grouping is now completed, tested and documented, and as stated above, is classified as bpo-2636-01 and related patches. Secondly, with caveats listed below, Named Match Group Attributes on a match object (item 2) is also more or less complete at bpo-2636-02 -- it only lacks documentation.

Now, I want to also update my list of items. We left off at 11: Other Perl-specific modifications. Since that time, I have spawned a number of other branches, the first of which (bpo-2636-12) I am happy to announce is also complete!

12) Implement the changes to the documentation of re as per Jim J. Jewett suggestion from 2008-04-24 14:09. Again, this has been done.

13) Implement a grouptuples(...) method as per Mark Summerfield's suggest on 2008-05-28 09:38. grouptuples would take the same filtering parameters as the other group* functions, and would return a list of 3- tuples (unless only 1 group was requested). It should default to all match groups (1..n, not group 0, the matching string).

14) As per PEP-3131 and the move to Python 3.0, python will begin to allow full UNICODE-compliant identifier names. Correspondingly, it would be the responsibility of this item to allow UNICODE names for match groups. This would allow retrieval of UNICODE names via the group* functions or when combined with Item 3, the getitem handler (m[u'...']) (03+14) and the attribute name itself (e.g. getattr(m, u'...')) when combined with item 2 (02+14).

15) Change the Pattern_Type, Match_Type and ScannerType (experimental) to become richer Python Types. Specifically, add \_doc__ strings to each of these types' methods and members.

16) Implement various FIXMEs.

16-1) Implement the FIXME such that if m is a MatchObject, del m.string will disassociate the original matched string from the match object; string would be the only member that would allow modification or deletion and you will not be able to modify the m.string value, only delete it.

-----

Finally, I want to say a couple notes about Item 2:

Firstly, as noted in Item 14, I wish to add support for UNICODE match group names, and the current version of the C-code would not allow that; it would only make sense to add UNICODE support if 14 is implemented, so adding support for UNICODE match object attributes would depend on both items 2 and 14. Thus, that would be implemented in bpo-2636-02+14.

Secondly, there is a FIXME which I discussed in Item 16; I gave that problem it's own item and branch. Also, as stated in Item 15, I would like to add more robust help code to the Match object and bind __doc__ strings to the fixed attributes. Although this would not directly effect the Item 2 implementation, it would probably involve moving some code around in its vicinity.

Finally, I would like suggestions on how to handle name collisions when match group names are provided as attributes. For instance, an expression like '(?P\<pos>.*)' would match more or less any string and assign it to the name "pos". But "pos" is already an attribute of the Match object, and therefore pos cannot be exposed as a named match group
attribute, since match.pos will return the usual meaning of pos for a match object, not the value of the capture group names "pos".

I have 3 proposals as to how to handle this:

a) Simply disallow the exposure of match group name attributes if the names collide with an existing member of the basic Match Object interface.

b) Expose the reserved names through a special prefix notation, and for forward compatibility, expose all names via this prefix notation. In other words, if the prefix was 'k', match.kpos could be used to access pos; if it was '_', match._pos would be used. If Item 3 is implemented, it may be sufficient to allow access via match['pos'] as the canonical way of handling match group names using reserved words.

c) Don't expose the names directly; only expose them through a prefixed name, e.g. match._pos or match.kpos.

Personally, I like a because if Item 3 is implemented, it makes a fairly useful shorthand for retrieving keyword names when a keyword is used for a name. Also, we could put a deprecation warning in to inform users that eventually match groups names that are keywords in the Match Object will eventually be disallowed. However, I don't support restricting the match group names any more than they already are (they must be a valid python identifier only) so again I would go with a) and nothing more and that's what's implemented in bpo-2636-02.patch.

-----

Now, rather than posting umteen patch files I am posting one bz2- compressed tar of ALL patch files for all threads, where each file is of the form:

bpo-2636(-\d\d|+\d\d)*(-only)?.patch

For instance,

bpo-2636-01.patch is the p1 patch that is a difference between the current Python trunk and all that would need to be implemented to support Atomic Grouping / Possessive Qualifiers. Combined branches are combined with a PLUS ('+') and sub-branches concatenated with a DASH ('- '). Thus, "bpo-2636-01+09-01-01+10.patch" is a patch which combines the work from Item 1: Atomic Grouping / Possessive Qualifiers, the sub- sub branch of Item 9: Engine Cleanups and Item 10: Shared Constants.
Item 9 has both a child and a grandchild. The Child (09-01) is my proposed engine redesign with the single loop; the grandchild (09-01-01) is the redesign with the triple loop. Finally the optional "-only" flag means that the diff is against the core SRE modifications branch and thus does not include the core branch changes.

As noted above, Items 01, 02, 05, 07 and 12 should be considered more or less complete and ready for merging assuming I don't identify in my implementation of the other items that I neglected something in these.
The rest, including the combined items, are all provided in the given tarball.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Sorry, as I stated in the last post, I generated the patches then realized that I was missing the documentation for Item 2, so I have updated the bpo-2636-02.patch file and am attaching that separately until the next release of the patch tarball. bpo-2636-02-only.patch should be ignored and I will only regenerate it with the correct documentation in the next tarball release so I can move on to either Character Classes or Relative Back-references. I wanna pause Item 3 for the moment because 2, 3, 13, 14, 15 and 16 all seem closely related and I need a break to allow my mind to wrap around the big picture before I try and tackle each one.

a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 16 years ago

[snip]

13) Implement a grouptuples(...) method as per Mark Summerfield's suggest on 2008-05-28 09:38. grouptuples would take the same filtering parameters as the other group* functions, and would return a list of 3- tuples (unless only 1 group was requested). It should default to all match groups (1..n, not group 0, the matching string).

:-)

[snip]

Finally, I would like suggestions on how to handle name collisions when match group names are provided as attributes. For instance, an expression like '(?P\<pos>.*)' would match more or less any string and assign it to the name "pos". But "pos" is already an attribute of the Match object, and therefore pos cannot be exposed as a named match group attribute, since match.pos will return the usual meaning of pos for a match object, not the value of the capture group names "pos".

I have 3 proposals as to how to handle this:

a) Simply disallow the exposure of match group name attributes if the names collide with an existing member of the basic Match Object interface.

I don't like the prefix ideas and now that you've spelt it out I don't like the sometimes m.foo will work and sometimes it won't. So I prefer m['foo'] to be the canonical way because that guarantees your code is always consistent.

------------------------------------------------------------ BTW I wanted to do a simple regex to match a string that might or might not be quoted, and that could contain quotes (but not those used to delimit it). My first attempt was illegal:

(?P<quote>['"])?([^(?=quote)])+(?(quote)(?=quote))

It isn't hard to work round but it did highlight the fact that you can't use captures inside character classes. I don't know if Perl allows this; I guess if it doesn't then Python shouldn't either since GvR wants the engine to be Perl compatible.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Thanks for weighing in Mark! Actually, your point is valid and quite fair, though I would not assume that Item 3 would be included just because Item 2 isn't. I will do my best to develop both, but I do not make the final decision as to what python includes. That having been said, 3 seems very likely at this point so we may be okay, but let me give this one more try as I think I have a better solution to make Item 2 more palatable. Let's say we have 5 choices here:

a) Simply disallow the exposure of match group name attributes if the names collide with an existing member of the basic Match Object interface.

b) Expose the reserved names through a special prefix notation, and for forward compatibility, expose all names via this prefix notation. In other words, if the prefix was 'k', match.kpos could be used to access pos; if it was '_', match._pos would be used. If Item 3 is implemented, it may be sufficient to allow access via match['pos'] as the canonical way of handling match group names using reserved words.

c) Don't expose the names directly; only expose them through a prefixed name, e.g. match._pos or match.kpos.

d) (As Mark suggested) we drop Item 2 completely. I have not invested much work in this so that would not bother me, but IMHO I actually prefer Item 2 to 3 so I would really like to see it preserved in some form.

e) Add an option, re.MATCH_ATTRIBUTES, that is used as a Match Creation flag. When the re.MATCHATTRIBUTES or re.A flag is included in the compile, or (?a) is included in the pattern, it will do 2 things.
First, it will raise an exception if either a) there exists an unnamed capture group or b) the capture group name is a reserved keyword. In addition to this, I would put in a hook to support a from \_future so that any post 2.6 changes to the match object type can be smoothly integrated a version early to allow programmers to change when any future changes come. Secondly, I would *conditionally* allow arbitrary capture group name via the __getattr handler IFF that flag was present; otherwise you could not access Capture Groups by name via match.foo.

I really like the idea of e) so I'm taking Item 2 out of the "ready for merge" category and going to put it in the queue for the modifications spelled out above. I'm not too worried about our flags differing from Perl too much as we did base our first 4 on Perl (x, s, m, i), but subsequently added Unicode and Locale, which Perl does not have, and never implemented o (since our caching semantic already pretty much gives every expression that), e (which is specific to Perl syntax AFAICT) and g (which can be simulated via re.split). So I propose we take A and implement it as I've specified and that is the current goal of Item 2. Once this is done and working, we can decide whether it should be included in the python trunk.

How does that sound to you, Mark and anyone else who wishes to weigh in?

a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 16 years ago

[snip]

It seems to me that both using a special prefix or adding an option are adding a lot of baggage and will increase the learning curve.

The nice thing about (3) (even without slicing) is that it seems a v. natural extension. But (2) seems magical (i.e., Perl-like rather than Pythonic) which I really don't like.

BTW I just noticed this:

'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!r}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!s}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!a}".format(rx)

That's fair enough, but maybe for !s the output should be rx.pattern?

pitrou commented 16 years ago

See also bpo-3825.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Update 16 Sep 2008:

Based on the work for issue bpo-3825, I would like to simply update the item list as follows:

1) Atomic Grouping / Possessive Qualifiers (See also Issue bpo-433030) [Complete]

2) Match group names as attributes (e.g. match.foo) [Complete save issues outlined above]

3) Match group indexing (e.g. match['foo'], match[3])

4) Perl-style back-references (e.g. compile(r'(a)\g{-1}'), and possibly adding the r'\k' escape sequence for keywords.

5) Parenthesis-Aware Python Comment (e.g. r'(?P#...)') [Complete]

6) Expose support for Template expressions (expressions without repeat operators), adding test cases and documentation for existing code.

7) Larger compiled Regexp cache (256 vs. 100) and reduced thrashing risk. [Complete]

8) Character Classes (e.g. r'[:alphanum:]')

9) Proposed Engine redesigns and cleanups (core item only contains cleanups and comments to the current design but does not modify the design).

9-1) Single-loop Engine redesign that runs 8% slower than current. [Complete]

9-1-1) 3-loop Engine redesign that runs 10% slower than current. [Complete]

9-2) Matthew Bernett's Engine redesign as per issue bpo-3825

10) Have all C-Python shared constants stored in 1 place (sre_constants.py) and generated by that into C constants (sre_constants.h). [Complete AFAICT]

11) Scan Perl 5.10.0 for other potential additions that could be implemented for Python.

12) Documentation suggestions by Jim J. Jewett [Complete]

13) Add grouptuples method to the Match object (i.e. match.grouptuples() returns (\<index>, \<name or None>, \<value>) ) suitable for iteration.

14) UNICODE match group names, as per PEP-3131.

15) Add __doc__ strings and other Python niceties to the Pattern_Type, Match_Type and Scanner_Type (experimental).

16) Implement any remaining TODOs and FIXMEs in the Regexp modules.

16-1) Allow for the disassociation of a source string from a Match_Type, assuming this will still leave the object in a "reasonable" state.

17) Variable-length [Positive and Negative] Look-behind assertions, as described and implemented in Issue bpo-3825.

---

Now, we have a combination of Items 1, 9-2 and 17 available in issue bpo-3825, so for now, refer to that issue for the 01+09-02+17 combined solution. Eventually, I hope to merge the work between this and that issue.

I sadly admit I have made not progress on this since June because managing 30 some lines of development, some of which having complex diamond branching, e.g.:

01 is the child of bpo-2636 09 is the child of bpo-2636 10 is the child of bpo-2636 09-01 is the child of 09 09-01-01 is the child of 09-01 01+09 is the child of 01 and 09 01+10 is the child of 01 and 10 09+10 is the child of 09 and 10 01+09-01 is the child of 01 and 09-01 01+09-01-01 is the child of 01 and 09-01-01 09-01+10 is the child of 09-01 and 10 09-01-01+10 is the child of 09-01-01 and 10

Which all seems rather simple until you wrap your head around:

01+09+10 is the child of 01, 09, 10, 01+09, 01+10 AND 09+10!

Keep in mind the reason for all this complex numbering is because many issues cannot be implemented in a vacuum: If you want Atomic Grouping, that's 1 implementation, if you want Shared Constants, that's a different implementation. but if you want BOTH Atomic Grouping and Shared Constants, that is a wholly other implementation because each implementation affects the other. Thus, I end up with a plethora of branches and a nightmare when it comes to merging which is why I've been so slow in making progress. Bazaar seems to be very confused when it comes to a merge in 6 parts between, for example 01, 09, 10, 01+09, 01+10 and 09+10, as above. It gets confused when it sees the same changes applied in a previous merge applied again, instead of simply realizing that the change in one since last merge is EXACTLY the same change in the other since last merge so effectively there is nothing to do; instead, Bazaar gets confused and starts treating code that did NOT change since last merge as if it was changed and thus tries to role back the 01+09+10-specific changes rather than doing nothing and generates a conflict. Oh, that I could only have a version control system that understood the kind of complex branching that I require!

Anyway, that's the state of things; this is me, signing out!

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

Comparing item 2 and item 3, I think that item 3 is the Pythonic choice and item 2 is a bad idea.

Item 4: back-references in the pattern are like \1 and (?P=name), not \g\<1> or \g\<name>, and in the replacement string are like \g\<1> and \g\<name>, not \1 (or (?P=name)). I'd like to suggest that back-references in the pattern also include \g\<1> and \g\<name> and \g\<-1> for relative back-references. Interestingly, Perl names groups with (?\<name>...) whereas Python uses (?P\<name>...). A permissible alternative?

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Thanks for weighing in Matthew!

Yeah, I do get some flack for item 2 because originally item 3 wasn't supposed to cover named groups but on investigation it made sense that it should. I still prefer 2 over-all but the nice thing about them being separate items is that we can accept 2 or 3 or both or neither, and for the most part development for the first phase of 2 is complete though there is still IMHO the issue of UNICODE name groups (visa-vi item 14) and the name collision problem which I propose fixing with an Attribute / re.A flag. So, I think it may end up that we could support both 3 by default and 2 via a flag or maybe 3 and 2 both but with 2 as is, with name collisions hidden (i.e. if you have r'(?P\<string>...)' as your capture group, typing m.string will still give you the original comparison string, as per the current python documentation) but have collision-checking via the Attribute flag so that with r'(?A)(?P\<string>...)' would not compile because string is a reserved word.

Your interpretation of 4 matches mine, though, and I would definitely suggest using Perl's \g\<-n> notation for relative back-references, but further, I was thinking, if not part of 4, part of the catch-all item 11 to add support for Perl's (?\<name>...) as a synonym for Python's (?P\<name>...) and Perl's \k\<name> for Python's (?P=name) notation. The evolution of Perl's name group is actually interesting. Years ago, Guido had a conversation with Larry Wall about using the (?P...) capture sequence for python-specific Regular Expression blocks. So Python went ahead and implemented named capture groups. Years later, the Perl folks thought named capture groups were a neat idea and adapted them in the (?\<...>...) form because Python had restricted the (?P...) notation to themselves so they couldn't use our even if they wanted to. Now, though, with Perl adapting (?\<...>...), I think it inevitable that Java and even C++ may see this as the defacto standard. So I 100% agree, we should consider supporting (?\<name>...) in the parser.

Oh, and as I suggested in bpo-3825, I have these new item proposals:

Item 18: Add a re.REVERSE, re.R (?r) flag for reversing the direction of the String Evaluation against a given Regular Expression pattern. See bpo-516762, as implemented in bpo-3825.

Item 19: Make various in-line flags positionally dependant, for example (?i) makes the pattern before this case-sensitive but after it case-insensitive. See bpo-433024, as implemented in bpo-3825.

Item 20: All the negation of in-line flags to cancel their effect in conditionally flagged expressions for example (?-i). See bpo-433027, as implemented in bpo-3825.

Item 21: Allow for scoped flagged expressions, i.e. (?i:...), where the flag(s) is applied to the expression within the parenthesis. See Issue 433028, as implemented in bpo-3825.

Item 22: Zero-width regular expression split: when splitting via a regular expression of Zero-length, this should return an expression equivalent to splitting at each character boundary, with a null string at the beginning and end representing the space before the first and after the last character. See bpo-3262.

Item 23: Character class ranges over case-insensitive matches, i.e. does "(?i)[9-A]" contain '_' , whose ord is greater than the ord of 'A' and less than the ord of 'a'. See bpo-5311.

And I shall create a bazaar repository for your current development line with the unfortunately unwieldy name of lp:~timehorse/python/issue2636-01+09-02+17+18+19+20+21 as that would, AFAICT, cover all the items you've fixed in your latest patch.

Anyway, great work Matthew and I look forward to working with you on Regexp 2.7 as you do great work!

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

Regarding item 22: there's also bpo-1647489 ("zero-length match confuses re.finditer()").

This had me stumped for a while, but I might have a solution. I'll see whether it'll fix item 22 too.

I wasn't planning on doing any more major changes on my branch, just tweaking and commenting and seeing whether I've missed any tricks in the speed stakes. Half the task is finding out what's achievable, and how!

birkenfeld commented 16 years ago

Though I can't look at the code at this time, I just want to express how good it feels that you both are doing these great things for regular expressions in Python! Especially atomic grouping is something I've often wished for when writing lexers for Pygments... Keep up the good work!

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Good catch on bpo-1647489 Matthew; it looks like this is where that bug fix will end up going. But, I am unsure if the solution for this issue is going to be the same as for 3262. I think the solution here is to add an internal flag that will keep track of whether the current character had previously participated in a Zero-Width match and thus not allow any subsequent zero-width matches associated beyond the first, and at the same time not consuming any characters in a Zero-width match.

Thus, I have allocated this fix as Item 24, but it may be later merged with 22 if the solutions turn out to be more or less the same, likely via a 22+24 thread. The main difference, though, as I see it, is that the change in 24 may be considered a bug where the general consensus of 22 is that it is more of a feature request and given Guido's acceptance of a flag-based approach, I suggest we allocate re.ZEROWIDTH, re.Z and (?z) flags to turn on the behaviour you and I expect, but still think that be best as a 2.7 / 3.1 solution. I would also like to add a from __futurue__ import ZeroWidthRegularExpressions or some such to make this the default behaviour so that by version 3.2 it may indeed be considered the default.

Anyway, I've allocated all the new items in the launchpad repository so feel free to go to http://www.bazaar-vcs.org/ and install Bazaar for windows so you can download any of the individual item development threads and try them out for yourself. Also, please consider setting up a free launchpad account of your very own so that I can perhaps create a group that would allow us to better share development.

Thanks again Matthew for all your greatly appreciated contributions!

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

I've moved all the development branches to the \~pythonregexp2.7 team so that we can work collaboratively. You just need to install Bazaar, join www.launchpad.net, upload your public SSH key and then request to be added to the pythonregexp2.7 team. At that point, you can check out any code via:

bzr co lp:~pythonregexp2.7/python/issue2636-*

This should make co-operative development easier.

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

Just out of interest, is there any plan to include bpo-1160 while we're at it?

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

I've enumerated the current list of Item Numbers at the official Launchpad page for this issue:

https://launchpad.net/~pythonregexp2.7

There you will find links to each development branch associated with each item, where a broader description of each issue may be found.

I will no longer enumerate the entire list here as it has grown too long to keep repeating; please consult that web page for the most up-to-date list of items we will try to tackle in the Python Regexp 2.7 update.

Also, anyone wanting to join the development team who already has a Launchpad account can just go to the Python Regexp 2.7 web site above and request to join. You will need Bazaar to check out, pull or branch code from the repository, which is available at www.bazaar-vcs.org.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Good catch, Matthew, and if you spot any other outstanding Regular Expression issues feel free to mention them here.

I'll give bpo-1160 an item number of 25 and think all we need to do here is change SRE_CODE to be typedefed to an unsigned long and change the repeat count constants (which would be easier if we assume item 10: shared constants).

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

For reference, these are all the regex-related issues that I've found (including this one!):

id : activity : title bpo-2636 : 25/09/08 : Regexp 2.7 (modifications to current re 2.2.2) bpo-1160 : 25/09/08 : Medium size regexp crashes python bpo-1647489 : 24/09/08 : zero-length match confuses re.finditer() bpo-3511 : 24/09/08 : Incorrect charset range handling with ignore case flag? bpo-3825 : 24/09/08 : Major reworking of Python 2.5.2 re module bpo-433028 : 24/09/08 : SRE: (?flag:...) is not supported bpo-433027 : 24/09/08 : SRE: (?-flag) is not supported. bpo-433024 : 24/09/08 : SRE: (?flag) isn't properly scoped bpo-3262 : 22/09/08 : re.split doesn't split with zero-width regex bpo-3299 : 17/09/08 : invalid object destruction in re.finditer() bpo-3665 : 24/08/08 : Support \u and \U escapes in regexes bpo-3482 : 15/08/08 : re.split, re.sub and re.subn should support flags bpo-1519638 : 11/07/08 : Unmatched Group issue - workaround bpo-1662581 : 09/07/08 : the re module can perform poorly: O(2**n) versus O(n**2) bpo-3255 : 02/07/08 : [proposal] alternative for re.sub bpo-2650 : 28/06/08 : re.escape should not escape underscore bpo-433030 : 17/06/08 : SRE: Atomic Grouping (?>...) is not supported bpo-1721518 : 24/04/08 : Small case which hangs bpo-1693050 : 24/04/08 : \w not helpful for non-Roman scripts bpo-2537 : 24/04/08 : re.compile(r'((x|y+))') should fail bpo-1633953 : 23/02/08 : re.compile("(.*$){1,4}", re.MULTILINE) fails bpo-1282 : 06/01/08 : re module needs to support bytes / memoryview well bpo-814253 : 11/09/07 : Grouprefs in lookbehind assertions bpo-214033 : 10/09/07 : re incompatibility in sre bpo-1708652 : 01/05/07 : Exact matching bpo-694374 : 28/06/03 : Recursive regular expressions bpo-433029 : 14/06/01 : SRE: posix classes aren't supported

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Hmmm. Well, some of those are already covered:

bpo-2636 : self bpo-1160 : Item 25 bpo-1647489 : Item 24 bpo-3511 : Item 23 bpo-3825 : Item 9-2 bpo-433028 : Item 21 bpo-433027 : Item 20 bpo-433024 : Item 19 bpo-3262 : Item 22 bpo-3299 : TBD bpo-3665 : TBD bpo-3482 : TBD bpo-1519638 : TBD bpo-1662581 : TBD bpo-3255 : TBD bpo-2650 : TBD bpo-433030 : Item 1 bpo-1721518 : TBD bpo-1693050 : TBD bpo-2537 : TBD bpo-1633953 : TBD bpo-1282 : TBD bpo-814253 : TBD (but I think you implemented this, didn't you Matthew?) bpo-214033 : TBD bpo-1708652 : TBD bpo-694374 : TBD bpo-433029 : Item 8

I'll have to get nosy and go over the rest of these to see if any of them have already been solved, like the duplicate test case issue from a while ago, but someone forgot to close them. I'm thinking specifically the '\u' escape sequence one.

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

bpo-814253 is part of the fix for variable-width lookbehind.

BTW, I've just tried a second time to register with Launchpad, but still no reply. :-(

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Yes, I see in you rc2+2 diff it was added into that. I will have to allocate a new number for that fix though, as technically it's a different feature than variable-length look-behind.

For now I'm having a hard time merging your diffs in with my code base. Lots and lots of conflicts, alas.

BTW, what UID did you try to register under at Launchpad? Maybe I can see if it's registered but just forgetting to send you e-mail.

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

Tried bazaar@mrabarnett.plus.com twice, no reply. Succeeded with mrabarnett@freeuk.com.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Thanks Matthew. You are now part of the pythonregexp2.7 team. I want to handle integrating Branch 01+09-02+17 myself for now and the other branches will need to be renamed because I need to add Item 26: Capture Groups in Look-Behind expressions, which would mean the order of your patches are:

01+09-02+17:

regex_2.6rc2.diff regex_2.6rc2+1.diff

01+09-02+17+26:

regex_2.6rc2+2.diff

01+09-02+17+18+26:

regex_2.6rc2+3.diff regex_2.6rc2+4.diff

01+09-02+17+18+19+20+21+26:

regex_2.6rc2+5 regex_2.6rc2+6

It is my intention, therefore, to check a version of each of these patches in to their corresponding repository, sequentially, starting with 0, which is what I am working on now.

I am worried about a straight copy to each thread though, as there are some basic cleanups provided through the core bpo-2636 patch, the item 1 patch and the item 9 patch. The best way to see what these changes are is to download http://bugs.python.org/file10645/issue2636-patches.tar.bz2 and look at the bpo-2636-01+09.patch file or, by typing the following into bazaar:

bzr diff --old lp:~pythonregexp2.7/python/base --new lp:~pythonregexp2.7/python/issue2636+01+09

Which is more up-to-date than my June patches -- I really need to regenerate those!

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

I've been completely unable to get Bazaar to work with Launchpad: authentication errors and bzrlib.errors.TooManyConcurrentRequests.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Matthew,

Did you upload a public SSH key to your Launchpad account?

You're on MS Windows, right? I can try and do an install on an MS Windows XP box or 2 I have lying around and see how that works, but we should try and solve this vexing thing I've noticed about Windows development, which is that Windows cannot understand Unix-style file permissions, and so when I check out Python on Windows and then check it back in, I've noticed that EVERY python and C file is "changed" by virtue of its permissions having changed. I would hope there's some way to tell Bazaar to ignore 'permissions' changes because I know our edits really have nothing to do with that.

Anyway, I'll try a few things visa-vi Windows to see if I get a similar problem; there's also the https://answers.launchpad.net/bazaar forum where you can post your Bazaar issues and see if the community can help. Search previous questions or click the "Ask a question" button and type your subject. Launchpad's UI is even smart enough to scan your question title for similar ones so you may be able to find a solution right away that way. I use the Launchpad Answers section all the time and have found it usually is a great way of getting help.

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

I have it working finally!

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Great, Matthew!!

Now, I'm still in the process of setting up branches related to your work; generally they should be created from a core and set of features implemented for example:

To get from Version 2 to Version 3 of your Engine, I had to first check out lp:~pythonregexp2.7/python/issue2636-01+09-02+17 and then "push" it back onto launchpad as lp:~pythonregexp2.7/python/issue2636-01+09-02+17+26. This way the check-in logs become coherent.

So, please hold off on checking your code in until I have your current patch-set checked in, which I should finish by today; I also need to rename some of the projects based on the fact that you also implemented item 26 in most of your patches. Actually, I keep a general To-Do list of what I am up to on the https://code.launchpad.net/~pythonregexp2.7/python/issue2636 whiteboard, which you can also edit, if you want to see what I'm up to. But I'll try to have that list complete by today, fingers crossed! In the mean time, would you mind seeing if you are getting the file permissions issue by doing a checkout or pull or branch and then calling "bzr stat" to see if this caused Bazaar to add your entire project for checkin because the permissions changed. Thanks and congratulations again!

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

I did a search on the permissions problem: https://answers.launchpad.net/bzr/+question/34332.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Thanks, Matthew. My reading of that Answer is that you should be okay because you, I assume, installed the Windows-Native package rather than the cygwin that I first tested. I think the problem is specific to Cygwin as well as the circumstances described in the article. Still, it should be quite easy to verify if you just check out python and then do a stat, as this will show all files whose permissions have changed as well as general changes. Unfortunately, I am still working on setting up those branches, but once I finish documenting each of the branches, I should proceed more rapidly.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Phew! Okay, all you patches have been applied as I said in a previous message, and you should now be able to check out lp:~pythonregexp2.7/python/issue2636+01+09-02+17+18+19+20+21+24+26 where you can then apply your latest known patch (rc2+7) to add a fix for the findall / finditer bug.

However, please review my changes to:

a) lp:~pythonregexp2.7/python/issue2636-01+09-02+17 b) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+26 c) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+18+26 d) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+18+19+20+21+26

To make sure my mergers are what your code snapshots should be. I did get one conflict with patch 5 IIRC where a reverse attribute was added to the SRE_STATE struct, and get a weird grouping error when running the tests for (a) and (b), which I think is a typo; a compile error regarding the afore mentioned missing reverse attribute from patch 3 or 4 in (c) and the SRE_FLAG_REVERSE seems to have been lost in (d) for some reason.

Also, if you feel like tackling any other issues, whether they have numbers or not, and implementing them in your current development line, please let me know so I can get all the documentation and development branches set up. Thanks and good luck!

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

I haven't yet found out how to turn on compression when getting the branches, so I've only looked at lp:~pythonregexp2.7/python/issue2636+01+09-02+17+18+19+20+21+24+26. I did see that the SRE_FLAG_REVERSE flag was missing.

BTW, I ran re.findall(r"(?m)^(.*re\..+\\m)$", text) where text was 67MB of emails. Python v2.5.2 took 2.4secs and the new version 5.6secs. Ouch! I added 4 lines to _sre.c and tried again. 1.1secs. Nice! :-)

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Good work, Matthew. Now, another bazaar hint, IMHO, is once of my favourite commands: switch. I generally develop all in one directory, rather than getting a new directory for each branch. Once does have to be VERY careful to type "bzr info" to make sure the branch you're editing is the one you think it is! but with "bzr switch", you do a differential branch switch that allows you to change your development branch quickly and painlessly. This assumes you did a "bzr checkout" and not a "bzr pull". If you did a pull, you can still turn this into a "checkout", where all VCS actions are mirrored on the server, by using the 'bind' command. Make sure you push your branch first. You don't need to worry about all this "bind"ing, "push"ing and "pull"ing if you choose checkout, but OTOH, if your connection is over-all very slow, you may still be better off with a "pull"ed branch rather than a "checkout"ed one.

Anyway, good catch on those 4 lines and I'll see if I can get your earlier branches up to date.

f0c60cc3-3318-4c1a-b471-b3af1d161d69 commented 16 years ago

Matthew, I've traced down the patch failures in my merges and now each of the 4 versions of code on Launchpad should compile, though the first 2 do not pass all the negative look-behind tests, though your later 2 do. Any chance you could back-port that fix to the lp:~pythonregexp2.7/python/issue2636-01+09-02+17 branch? If you can, I can propagate that fix to the higher levels pretty quickly.

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

bpo-2636-01+09-02+17_backport.diff is the backport fix.

Still unable to compress the download, so that's >200MB each time!

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

The explanation of the zero-width bug is incorrect. What happens is this:

The functions for finditer(), findall(), etc, perform searches and want the next one to continue from where the previous match ended. However, if the match was actually zero-width then that would've made it search from where the previous search _started_, and it would be stuck forever. Therefore, after a zero-width match the caller of the search consumes a character. Unfortunately, that can result a character being 'missed'.

The bug in re.split() is also the result of an incorrect fix to this zero-width problem.

I suggest that the regex code should include the fix for the zero-width split bug; we can have code to turn it off unless a re.ZEROWIDTH flag is present, if that's the decision.

The patch bpo-2636+01+09-02+17+18+19+20+21+24+26_speedup.diff includes some speedups.

39d85a87-36ea-41b2-b2bb-2be43abb500e commented 16 years ago

I've found an interesting difference between Python and Perl regular expressions:

In Python:

    \Z matches at the end of the string

In Perl:

    \\Z matches at the end of the string or before a newline at the

end of the string

    \z matches at the end of the string

python / cpython

Adding a new regex module (compatible with re) #46888