refi64 / rejit

A work-in-progress JIT-powered regex engine
Mozilla Public License 2.0
110 stars 4 forks source link

Implement lookbehinds #4

Open refi64 opened 8 years ago

refi64 commented 8 years ago

There are two ways to approach this:

  1. Figure out the length N of the lookbehind assertion, and check for it exactly N characters back in the string. This is what Python does. I personally find it slightly restrictive... 2. Allow any expression. Just step back through the string one character at a time and test if the lookbehind matches AND ends right at the previous position. This is obviously very slow but much more flexible. One optimization here could be to calculate the minimum length of the match and start there. Doing the same thing with a max length may not be worth it, since stuff like (?<=a{2,5}) isn't that common.

My idea is to combine both: when the first approach can be used, then use it! When the given regex wouldn't work with it (e.g. (?<=a+)b), then use method 2. Using method 2 in practice utterly failed. Not even Perl allows that. I just went with method 1.

refi64 commented 8 years ago

Positive lookbehinds now work according to method 1.

refi64 commented 8 years ago

Should be finished now!

vendethiel commented 8 years ago

Although it'd be really amazing to have variable-length look{ahead,behind} ... I had some uses for 'em :)

refi64 commented 8 years ago

@vendethiel Variable-length lookaheads work, though! Lookbehinds are a little trickier, and I'm trying to think of not-insanely-inefficient ways to do it...but ATM it's not that high-priority.

vendethiel commented 8 years ago

Okay, well, I'll be interested to know if you ever implement this. Allows some crazy obscure tricks.

refi64 commented 8 years ago

Y'know what screw it. I think I know how to implement this more efficiently than I thought. (Or at least on X64; not sure what to do on X86 yet, since I'm pretty much out of registers...).

vendethiel commented 8 years ago

Cool :D. Feel free to page me here when you commit an implementation, or if you wanna write down technical details.

pyos commented 8 years ago

I'm pretty sure it should be possible to implement reasonably fast variable-length lookbehinds, so long as the lookbehind expression only matches words of a regular language (so no nested backreferences or lookaheads or lookbehinds in a lookbehind). If that's the case, you could compile the subexpression into a separate DFA/NFA with all concatenations reversed, and attempt to match the reverse of string[:current_position].

(That doesn't sound particularly easy, though.)

refi64 commented 8 years ago

@pyos That's actually a really good idea...thanks!