iter-tools / regex

A streaming regex evaluation engine
MIT License
11 stars 1 forks source link

Feature: lookahead (?=) (?!) #11

Open conartist6 opened 3 years ago

conartist6 commented 3 years ago

Support lookahead assertions. Given a lookahead expression and a contingent sequence, here's what has to happen:

Make a lookahead state at the engine level. It will essentially be a normal expression state of [lookahead, contingent] with a bit of special sauce in handling success and failure.

conartist6 commented 2 years ago

I'm going to build this. I hadn't looked at the code in long enough that I forgot how some of it works. I took it as an opportunity to improve the docs a bit as I rediscovered where I put the important logic. I'm confident once again that I know how to do what this issue describes, and I'll do it just as soon as the macrome generated artifacts are stable again (shifted from timestamps to hashes)

conartist6 commented 2 years ago

This will be significantly simpler once #35 lands I think.

conartist6 commented 2 years ago

I'm ready to implement this, and I'm less likely to mess it up since I just spent so much time rewriting and debugging the engine.

conartist6 commented 2 years ago

The architectural question is how to look at expressions when succeeding or failing. Currently success and failure are the responsibility of sequences, and expressions are assumed to be safe to prune when they no longer are needed to hold a list.

Currently there are two places in the code that are relevant. In remove a failing sequence may replace its parent if it has no siblings, which would be incorrect if its parent is a lookahead. https://github.com/iter-tools/regex/blob/58a456b495f6d308b19a2341e1cdc28134722356/lib/internal/engine.ts#L122 A succeeding sequence is promoted, which may allow it to eliminate expressions: https://github.com/iter-tools/regex/blob/58a456b495f6d308b19a2341e1cdc28134722356/lib/internal/engine.ts#L273-L287

conartist6 commented 2 years ago

The question will be, do I want a recursive solution, which would allow me to program the behavior as a a method overload, or should I try to avoid recursion, as I am unsure how deep the state tree may be?

A middle ground may be for a lookahead to be a subclass of match, say SpeculativeMatch extends Match. This would make it easier to quickly check for lookaheads, as matches have their own linkages. I think that's probably the thing to do.

conartist6 commented 2 years ago

If I do use a SpeculativeMatch it will need some way to know that the lookahead has succeeded. This will require support in regex.ts which currently would only ever return a success result when the entire pattern is terminated. If lookahead is a kind of match though, it would need a success result to be returned or else it would not be possible for the lookahead to succeed!

conartist6 commented 2 years ago

I've been assuming I need engine-level support for this, but is that really true?

It's not too hard to create an endLookahead matcher which will fail when its contents succeed. What I guess isn't possible (?) is to do the other thing, to fail the whole expression if the sequence fails. When a sequence fails the chain of matchers is broken, and regex.ts has no more control over what happens. So yes, I think it is true.

conartist6 commented 2 years ago

if I make implement a SpeculativeMatch type in engine how will that impact the chaining of matchers? What is next in the chain after a lookahead, and how is that going to be called?

conartist6 commented 2 years ago

I'll have a next -- a reference to the thing that follows the sequence. But that will already be used immediately because it'll be one of the two sequences held (initially) by the speculative match. So it will indeed be safe to just ignore the next reference when causing a speculative match to succeed

conartist6 commented 2 years ago

A failure of the contingent sequence will also need to erase the speculative sequence. How will we know when to do this? The conditions are slightly different -- the lookahead sequence may still be active, which is to say that a SpeculativeMatch fails when it has one remaining sequence vs a normal Match which fails when it has 0

conartist6 commented 2 years ago

I think I want a SpeculativeSequence and a SpeculativeExpression. This way when a speculative match fails it will be a sequence failing that can overload methods to do the correct work.

conartist6 commented 2 years ago

It's time! I need this feature myself, and am building it.

conartist6 commented 2 years ago

Current problem: captures. First, I don't really remember entirely how what I built works, and I'm not 100% sure it's right.

Second: lookahead has some really weird cases. Evaluating something like /(a)(?=(b))/.exec('abc') I actually need to be able to insert the capture group 2 result (b) into the captures after the rest of the pattern has already succeeded!

conartist6 commented 2 years ago

For an execution like /(a)(?=.(c))(b)/.exec('abc') I need the captures to be a, c, b, which is tricky because the order we'll actually capture in is a, b, c.

conartist6 commented 2 years ago

I think I've duplicated part of the engine's data structure in the stack-of-lists data structure I use for captures.

conartist6 commented 2 years ago

OK, that's fixed. Captures themselves no longer contain parentList, which is now part of captureStack