pfalcon / re1.5

re1, the unbloated regexp engine by Russ Cox, elaborated to be useful for real-world applications
BSD 3-Clause "New" or "Revised" License
42 stars 4 forks source link

Contributing to your re1.5 project #9

Open ampli opened 9 years ago

ampli commented 9 years ago

Hi Paul, For another project that I contribute to, I need a flexible regexp package that I can easily change for my needs (natural-language sentence tokenizing). It looks like "re1" is appropriate for that. Since I liked your modified VM code more than the original one, I plan to base my modifications on your re1.5.

Before I add the very specific features that I need, I plan to add some more general ones that I also need. Some of my changes may seem to be too specific for a your purpose, but maybe some are not. While preparing my specific version, I can send you pull requests for features you think that are appropriate (or may be appropriate) for your re1.5.

Here is what I plan to do (some I already have done, as indicated):

  1. Add flags to main() [done]
usage: re [-hmdv] [-e ENGINE] <regexp> <string>...
-h:        Print help message and exit
-m:        String is anchored
-e ENGINE: Specify one of: recursive recursiveloop backtrack thompson pike
-d:        Print debug messages
-t:        Print VM trace messages

The rational was to ease the debug cycle.

  1. Add CFLAGS for debug. Especially, enable ASAN checks. [done]
  2. Remove re1_5_sizecode() altogether. [done] Instead, I used _compilecode() also for code sizing. The rational was to prevent the need to make parallel changes in re1_5_sizecode() when features are added in _compilecode(), to keep both synchronized regarding the generated code size.
  3. Change the API of _compilecode() to return the code (or NULL in case of a failure). [done] The idea is to be similar to existing regex packages, i which you can pre-compile a list of regexp strings, and use the compiled ones whenever needed, like objects. But I guess such changes may be inappropriate for your project.
  4. Add the ability to generate absolutely all matches, not only the first one. [partially done] Such a feature is needed for my said other project. Most (if not all) of the existing free regex packages lack it (although it can be done, with some effort, using PCRE.) For now I implemented it only for "backtrack" , and I still need to write an API for it (the implementation for the other methods is similar.)
  5. Add non-capture groups (?:). [done]
  6. Add the ability to keep track on all group captures, not only the last capture for each group. [in work] This is also a feature I must have for my other project. AFAIK, only .NET has such a feature (see: http://www.rexegg.com/regex-csharp.html#quantgroups)
  7. From your TODO: Support for repetition operator {n} and {n,m}. I already figured out a way to implement that, but I'm not sure it is the ideal one. The idea is to add count "registers" (similar to the "sub" implementation) and new VM commands to inspect and jump on it.
  8. From your TODO: Support for Unicode (UTF-8). This is also a must-have feature for my said project.
  9. Support using ] as a first character in a class. [done]
  10. Support magic removal for character x by \x, in any place. [partial done]
  11. Support 0xHH / 0x{HHHH}.
  12. A more flexible design for named classes. My idea is to use a table like:
'd'   "0-9"
'w'   "A-Za-z0-9_"
's'    "\x32\x09\x0A\x0C"
't'     "\t"
...

that will be used by the code generator, also for things like [\d\s], instead of hard-coding all classes in _re1_5_namedclassmatch(). This way it will be easy to add any additional desired class. (Such single-char definitions, like \t, will of course generate char and not class.)

BTW, currently \s in re1.5 is missing 0x0C (formfeed).

  1. Remove the original regex parse and code generation. (For not including obsolete code, or have a need to maintain it for new functionality that is added.) BTW, in the current program ?: anywhere in the regexp (when compiled with DEBUG) gives an error due to the old code which has an incomplete support for it.
  2. Free all memory which is allocated. [partially done] Currently some memory remained allocated after functions are called.
  3. Make it thread-safe, at least while matching. (This is a must for my said other project.) Currently the global variable Sub *freesub is problematic in this regard.

If you indicate which feature is desired also for your code, I will try to isolate it and send you a pull request.

pfalcon commented 9 years ago

Amir, I should start with saying that I'd appreciate contributions to this project. At the same time, the reason it exists is foremost because we needed very lean regexp implementation for https://github.com/micropython/micropython . So, the idea was to take re1, which is mostly demo/teaching/experimentation project, and make it a bit more suitable for real-world usage, while still keeping it as minimal as reasonably possible (I do accept that code size would grow, just the growth should be controlled). So, if you think you appreciate that aim, and other traits of re1.5 look good for you, getting patches would be very nice.

pfalcon commented 9 years ago

Let's go over specific points you list:

Add flags to main() [done]

Well, I'm indifferent to this, as long as it doesn't affect library size. If you implemented them, would be nice to merge.

Add CFLAGS for debug. Especially, enable ASAN checks. [done]

If these are optional (require uncommenting/special make target), sounds good.

Remove re1_5_sizecode() altogether. [done]

If this decrease, not increase code size, sounds good.

Change the API of _compilecode() to return the code (or NULL in case of a failure). [done]

Would need to look into this. But for MicroPython, we need to have dependency injection for memory allocation, i.e. be able to pre-size area needed for compiled regex, allocate it in client, and then pass pointer to fill in to compile API.

Add the ability to generate absolutely all matches, not only the first one. [partially done]

Well, I guess this can go in "level 2 patches" category, to be done after "level 1" ;-). I'd be glad to make re1.5 be better in some aspects than other engines, but if that increases code size, I'd ask it to be #ifdef'able.

For now I implemented it only for "backtrack" , and I still need to write an API for it (the implementation for the other methods is similar.)

Well, multiple backends is a nice feature inherited from re1, and I'd like to maintain it, as that's clearly a differentiating point. But of course, it requires extra maintenance effort.

pfalcon commented 9 years ago

Add non-capture groups (?:). [done]

That's very nice, feel free to submit first thing ;-).

Add the ability to keep track on all group captures, not only the last capture for each group. [in work]

Level 2 feature.

From your TODO: Support for repetition operator {n} and {n,m}. I already figured out a way to implement that, but I'm not sure it is the ideal one. The idea is to add count "registers" (similar to the "sub" implementation) and new VM commands to inspect and jump on it.

Yes, apparently that's the way to do it. Key point is that these "counters" should be part of data state, not code state, and should be maintained per-match (otherwise it's not thread-safe, etc.). Other aspects need experimenting, like if it would make sense to have common data "area" and common indexes for sub's and counters, or they should be separate.

From your TODO: Support for Unicode (UTF-8).

Would be very nice to implement, but would need to be lean and optional (i.e. make "nextchar" function to be either a define/inline function/normal function).

Support using ] as a first character in a class. [done]

Funny corner case, didn't even think about it ;-). Are you sure empty charclasses are banned?

Support magic removal for character x by \x, in any place. [partial done]

Not sure I got this, do you mean treating unknown escape sequences as a normal char?

Support 0xHH / 0x{HHHH}.

Would be nice to have this #ifdef'able

A more flexible design for named classes. a table [...], instead of hard-coding all classes

Well, that will cost us precious bytes in MicroPython. I have a magic spell for all such cases: #ifdef ;-)

Remove the original regex parse and code generation.

Well, initial parser was hand-written to be as small as possible, without too much attention to error handling. It might improve since that, but I still guess it may miss to report errors or even crash. Original parser was kinda left for reference, and indeed I'm not sure I updated it consistently. So, well, if your independent review shows that hand-written parser is good enough up to dropping original one, we can do it ;-).

Free all memory which is allocated. [partially done]. Currently some memory remained allocated after functions are called.

Again, our idea is that memory is dependency-injected into all functions. Would need to look at specific cases to see what you mean.

Make it thread-safe, at least while matching. (This is a must for my said other project.) Currently the global variable Sub *freesub is problematic in this regard.

I'm positive about this, and if it's only a single global var which is culprit, it doesn't even require #ifdef'ing.

pfalcon commented 9 years ago

Summing up, I'm positive about almost all changes you propose, but they would have different priorities for merging, and some may need more detailed look to see if they worth to have them in re1.5 upstream or if they can be specific to your project.

Thanks!

ampli commented 9 years ago

Just some concentrated comments on some of your points:

Change the API of _compilecode() to return the code (or NULL in case of a failure). [done] Would need to look into this. But for MicroPython, we need to have dependency injection for memory allocation, i.e. be able to pre-size area needed for compiled regex, allocate it in client, and then pass pointer to fill in to compile API.

I will check if I can reimplement it in a compatible way.

Well, multiple backends is a nice feature inherited from re1, and I'd like to maintain it, as that's clearly a >differentiating point. But of course, it requires extra maintenance effort.

BTW, I found that the code in thompson.c is redundant. The code of pick.c is essentially the same, plus the added saved submatches. I say "essentially" because there are two minor differences:

  1. An off-by-one bug in the new sp when creating a thread after a character match. It leads to an incorrect offset in Save. But Save is not actually implemented in the code, so this bug dowsn't manifest itself.
  2. In the addthread() of pick.c, the new thread is not created if its first command is not going to be immediately handled. It looks like this is an optimization since as far as I could check it works fine either way.

So to ease future maintenance, thompson.c can be removed, and a comment added to pike.c, like: /* This is the Thompson implementation + saving submatches */

Support using ] as a first character in a class. [done]

Funny corner case, didn't even think about it ;-). Are you sure empty charclasses are banned?

Empty [] is an error in Python. Currently there is no way in re1.5 to specify ] inside [], since there are no \ escapes yet inside [] and no \xHH escapes. So this fix enables representing ] in a class.

I have also implemented a fix to support 'metcharacter escape inside[]. BTW, traling-` currently produce an error and should be fixed to represent the character itself - I will prepare a patch too.

Support magic removal for character x by \x, in any place. [partial done]

Not sure I got this, do you mean treating unknown escape sequences as a normal char?

I mean being able to put \ before regexp metacharacter, anywere, like [[\]]. The yacc code doesn't support this even in more simple cases, so I propose not to maintain it any more.

A more flexible design for named classes. a table [...], instead of hard-coding all classes

Well, that will cost us precious bytes in MicroPython. I have a magic spell for all such cases: #ifdef ;-)

I think such an implementation has the potential to save code (no need to implement named class in the VM)... I may try it to see.

BTW, in main.c the ifdef's are indented, and in compile.c it is not. Can I use non-indented ifdef's?

Free all memory which is allocated. [partially done]. Currently some memory remained allocated after >>functions are called.

Again, our idea is that memory is dependency-injected into all functions. Would need to look at specific >cases to see what you mean.

In MicroPython uses only recursiveloop.c, which doesn't contain memory allocation, so it doesn't matter to it. On the other hand, pike.c malloc's memory, some of it is not freed (or reused on later invocations).

main.c also alloc's memory that is not getting freed. Of course this doesn't really matter,. But if CFLAGS=-fsanitized=address is used, freeing everything makes it easy to validate there is no memory leak anywhere (i.e. in the library).

pfalcon commented 9 years ago

BTW, in main.c the ifdef's are indented, and in compile.c it is not. Can I use non-indented ifdef's?

Yes. Indented #ifdef likely leaked from micropython, where we have many ifdef's, and having them indented really helps readability.

(will answer rest of question as time permits)

pfalcon commented 9 years ago

BTW, traling - currently produce an error and should be fixed to represent the character itself - I will prepare a patch too.

Well, about such case I know, but always took care to allow it to be specified at the beginning.

I think such an implementation has the potential to save code (no need to implement named class in the VM)... I may try it to see.

So, as our discussion in the latest PRs showed, it may be not immediately obvious how this or that change affects code size, especially if you're not get used to think in terms of mere bytes. But rule of thumb is that data-driven approach makes it more easier to extend, but likely code size bigger. Hardcoding simple cases makes them smaller and also more performant (CPU branch predictor works better). @dpgeorge implemented that code, see #2 and #3, and as you can see, my intuitive assessment of what would be shorter faulted there.

All in all, I wrote the above not to block you to add table-driven, extensible implementation if you really need extensibility, just please check code sizes, and add #ifdef between old and new code then if warranted.

pfalcon commented 9 years ago

BTW, I found that the code in thompson.c is redundant. The code of pick.c is essentially the same, plus the added saved submatches. I say "essentially" because there are two minor differences:

Sorry for the delay with looking into this matter, I pooled time to look into code myself to skip asking "dumb" question, but so far wasn't able, so let's discuss it speculatively. First of all, I hope you read Russ Cox' articles which discuss regex engines implementation choices/strategies, they talk about Thompson algo, and then about improvement over it, Pike algo. re1 is demonstration of all these algos, and by default I wanted to preserve it while makes sense. Then, how much "they're are essentially the same"? Can for example pike.c be made to produce the same code as thompson.c with simple, clean #ifdef's? If so, I'm in favor of merging them too. But just dropping thompson.c doesn't seem too good - it's simpler and uses less resources. Granted, most high-level APIs requires not just boolean match/nomatch status, but captures as result, but there're still good usages of yes/no algo, and then it's usually high-volume data, where getting highest performance is important.

That's ideas I have about it, if you have other arguments, please share them.

ampli commented 9 years ago

First of all, I hope you read Russ Cox' articles which discuss regex engines implementation choices/strategies, they talk about Thompson algo, and then about improvement over it, Pike algo.

Of course.

Then, how much "they're are essentially the same"?

I realized they are essentially the same by adding, as an exercise, sub capturing to thompson.c. I then rearranged the few subp initialization lines at the start of re15(thompsonvm|pike)vm() to be exactly the same (because it is actually the same).

When I then did a diff (ignoring whitespace), to my surprise the diff was very small, and it was in addthread() in these lines:

    l->t[l->n] = t;
    l->n++;

In pike.c they are under default:, while in thompson.c they are just before the switch. Because its exact position is not an essential part of the implementation, it turned out these files are exactly the same, beside of the sub capturing code.

Can for example pike.c be made to produce the same code as thompson.c with simple, clean #ifdef's?

Yes. The sub capturing just needs to be ifdef'ed. However, this will make it looking somewhat inflated and cumbersome. For example, the line addthread(nlist, thread(pc, sub), input, sp+1); would become:

#ifdef CAPTURE
       addthread(nlist, thread(pc, sub), input, sp+1);
#else
        addthread(nlist, thread(pc, input, sp+1);
#endif

There are 10 such lines in pike.c. There are also 8 decref() lines that need to be ifdef'ed, even though in this case, but in this case it can be done using:

#ifndef CAPTURE
#define decref()
#endif 

Several other lines need to be ifdef'ed too. I'm not sure you would like the result... But if you indeed like this idea I can implement it. To be really useful, it should be implemented in the whole code (i,.e. also in the dumpcode and compile code).

But just dropping thompson.c doesn't seem too good - it's simpler and uses less resources.

It uses less resources only because it doesn't do sub capturing, and because it does a capturing of the start of the whole match in a buggy way (that fixing it will add more code).

Note that any of the other VM implementations can be fixed to take less resources this way, by just omitting capturing!

Granted, most high-level APIs requires not just boolean match/nomatch status, but captures as result, but there're still good usages of yes/no algo, and then it's usually high-volume data, where getting highest performance is important.

BTW, one of the differences between those VM implementations is the amount of stack space that each uses. Hence I wonder why recursiveloop() was chosen for MicroPython(), since in the pikevm the needed "repetition" space is allocated on the heap. However, for simple regexes it indeed doesn't matter.

pfalcon commented 9 years ago
#ifndef CAPTURE
#define decref()
#endif

I'd think that addthread() example above could be massaged into such form too (i.e. not ifdef each invocation, but the declaration).

Note that any of the other VM implementations can be fixed to take less resources this way, by just omitting capturing!

Yes, except that yet needs to be done, while for thompson/pike case it's already done, and as I mentioned above, it's all above preserving somebody's (hard) work while it makes sense. Amir, you now know code very well, so I'll trust your opinion if you think it makes sense to just drop thompson.c.

pfalcon commented 9 years ago

Hence I wonder why recursiveloop() was chosen for MicroPython(), since in the pikevm the needed "repetition" space is allocated on the heap.

Deciding factor for initial code-drop into uPy then was pure code size. That's something to revisit in due time for sure, along with other pending issues on uPy side.

ampli commented 9 years ago

Yes, except that yet needs to be done, while for thompson/pike case it's already done, and as I mentioned above, it's all above preserving somebody's (hard) work while it makes sense. Amir, you now know code very well, so I'll trust your opinion if you think it makes sense to just drop thompson.c.

Since the placement of the cleanmarks() code has been resolved, I think that for now there is no need to do any extra work regarding the thompson.c. file (even not fixing the known bug in the reported match start position). This means not removing it, and only maybe eventually adding it a comment saying that since it is similar to pike.c without capturing, further work has not been done to add fixes and features to it (e.g vm-trace and {n,m}).