iter-tools / regex

A streaming regex evaluation engine
MIT License
11 stars 1 forks source link

Global execution is broken #3

Closed conartist6 closed 3 years ago

conartist6 commented 3 years ago

I forgot the /ab|a/g problem: The first iteration of a pattern may consume characters that are not part of its match, and those characters must be available to participate in the next match. I think the fix will be relatively easy in my current setup.

One way to think about it (and potentially implement it) is to change the non-capturing root of the pattern from .*?(pattern) to (.*?(pattern))+. The engine can already do this so it already has a solution to the problem, I just need to figure out how to apply it correctly so that I get the results as they match.

conartist6 commented 3 years ago

I definitely think I need to make that change to the root of the pattern, because .*? terminates the states we will need when any match is made.

conartist6 commented 3 years ago

Also any strategy other than changing the pattern will probably fail to terminate matches that aught to be terminated, i.e. matches that would contain some characters that were part of the previous match.

conartist6 commented 3 years ago

My brain hurts so I'm going to think in specifics. We can toss the .* for a second, since we're only considering cases where matches overlap and it would match nothing. (ab|a)+ works because a match on a creates a terminal state for ab|a. Since the pattern is its own next state, we queue a state that starts matching the pattern anew. When a second a is encountered b is rejected, the first pattern succeeds as a whole, and the second a also continues in the second match.

So this is all about what happens at pattern termination (of course it is). What we really want is to queue the pattern itself up as its own next state. This would mean that a success state must be able to hold an expression reference, and it must be traversable. It must also still have the tendency to "bubble" upwards through the tree as worse alternatives fail. Then when finally it reaches the root of the tree and is accepted as a match, it can be replaced with the expression.

I think this means it makes sense not to change the root pattern, just to implement these changes to the success state. It would also be ideal if non-global execution didn't trigger this behavior.

conartist6 commented 3 years ago

Currently I've dropped the global flag in favor of an execGlobal method, but it's obvious that it would be simpler to implement this logic in a way that is applied only when necessary if the pattern itself could have a global flag. But the reasons for separate exec and execGlobal methods still exist too -- I don't want the pattern to change the shape of the output (so I would always need an iterator of results) but I also want to make sure there's a version that doesn't force the user to deal in iterators (particularly because that is the api of the native exec).

Does this mean I should leave both mechanisms? The worse of it would be when a non-global pattern was passed to execGlobal. In this case I could:

I'd say there is precedent for the "quietly yield only one" approach, as that's how non-global regexes work for splitting or replacing. We'll want to offer those methods eventually, and the non-global regexes are potentially useful there: while it's easy enough to take only one result from an iterator provided by exec there will be no such facility inside a split or replace method. (And we will want to offer those methods)