nitely / nim-regex

Pure Nim regex engine. Guarantees linear time matching
https://nitely.github.io/nim-regex/
MIT License
227 stars 21 forks source link

Support look-around assertions #6

Closed nitely closed 3 years ago

nitely commented 6 years ago

Docs say:

but lacks a few features that can not be implemented while keeping the space/time complexity guarantees, i.e.: backreferences and look-around assertions.

However that's not entirely true. Look-ahead assertions limited to a single symbol or set are easy to implement, it's actually being done in order to support assertions (i.e: \b). I'm not sure about look-behind assertions. All of it while keeping the space/time complexity guarantees.

Bounded look-ahead assertions (i.e: without +, * and {n,}) could also be implemented but I'm not sure how much it would complicate things. I know a queue would have to be used when peeking following chars. Bounded backreferences could also be implemented somehow. Both features would take O(m) space, where m is the length of the regex (not the input). wait, we just need to pass the input string around, so it's not very complicated, the queue would be needed if we supported streams. This two will probably never happen, though, at least if it's up to me, since I never need them.

haltcase commented 6 years ago

Hey @nitely — I've used this library to implement glob, which has gone super well so far. Hoping to start up a chat about (?!...) groups as it's needed to support the !(...) glob pattern match form and it doesn't seem to work here yet.

import regex
echo "foo.nim".contains re"(?!boo)[^\/]*\.nim"
Invalid group. Unknown group type `!`

(?!boo)[^\/]*\.txt
^

Sounds like you're not too keen on adding this?

nitely commented 6 years ago

Hey @citycide, yep, I was hoping to never have to implement this :sweat_smile: . I think it can be done efficiently with a ring buffer (note for future me), but it'll take me a while to get to it. If you really need it then your best bet at the moment is PCRE :disappointed:

haltcase commented 6 years ago

It's cool, I think I'll leave it out for now and wait for it here since I'd rather stick with pure Nim. I'd also really like to keep usage of this library up because I think having a pure Nim regex engine is important.

I'll keep watch but thanks for everything so far, this is a great piece of work :+1:

timotheecour commented 3 years ago

@nitely this is one of the biggest usability limitations with pkg/regex at the moment; I often have to resort to ugly workarounds due to lack of lookaround assertions.

Can D's std/regex implementation help?

implementation: https://github.com/dlang/phobos/blob/d64b78e27d838110f2f27cc4e6a6bb6866bada25/std/regex/internal/backtracking.d#L605

docs https://dlang.org/library/std/regex.html

?=regex) | Zero-width lookahead assertion. Matches at a point where the subexpression regex could be matched starting from the current position.
-- | --
(?!regex) | Zero-width negative lookahead assertion. Matches at a point where the subexpression regex could not be matched starting from the current position.
(?<=regex) | Zero-width lookbehind assertion. Matches at a point where the subexpression regex could be matched ending at the current position (matching goes backwards).
(?<!regex) | Zero-width negative lookbehind assertion. Matches at a point where the subexpression regex could not be matched ending at the current position (matching goes backwards).
nitely commented 3 years ago

This is implemented, but limited to a single char (and set? idk). Implementing full support while keeping it linear time is too complicated. D implements a second regex engine --a backtracker that runs in exponential time-- to support this (+backreferences), and I don't want to do that.

I often have to resort to ugly workarounds due to lack of lookaround assertions.

Do you really need full lookaround support? supporting +, *, x|y, and {n,} within the lookaround expression is complicated, but multiple chars/sets is not.

timotheecour commented 3 years ago

Do you really need full lookaround support?

yes otherwise I wouldn't have emphasized it :) . I do realize this may not be simple to implement, but this is definitely useful; single char/set often falls short. Backtracking may not be so bad, if you don't pay for what you don't use; existing queries will remain as performant, some queries might require backtracking and be less performant, and that's ok.