Closed osa1 closed 2 years ago
This issue is actually deeper than just a miscompilation of regex to NFA.
Currently, if the state machine is in an accepting state but it's also possible to transition to another state from the current state with the next character in the input, we do the transition and ignore the match.
In the repro above, after seeing 'z', we're in an accepting state, but we can also make progress with the next character (i.e. we don't get stuck with the next character), so to implement longest match, we ignore the accepting state, and move on to the next state using the second 'x'.
Instead, when we're in a state (or set of states, in NFA), we need to know about all the possible matches that leads us to the current state, and if we cannot make any more progress (state machine stuck), we pick one of the mathes from the list and run semantic actions.
Conceptually, list of matches will look like this
struct Match {
match_start: usize,
match_end: usize,
semantic_action_idx: usize,
}
A value of this type represents a match in input[match_start..match_end]
, and the semantic action that we need to run on this match if we decide to go with this match.
Then, for a state in DFA, we will need
type MatchStack = Vec<Match>;
Finally, for an accepting state, we will need an e transition to the beginning of the current user state (e.g. initial state of Init
rule set).
After these changes, when we see 'z', we will have a match stack like:
[ Match { match_start: 0, match_end: 3, semantic_action_idx: 1 } ]
and the NFA states will be { initial state, state after 'z' in first regex }
.
When we see 'a', match stack will look like:
[ Match { match_start: 0, match_end: 3, semantic_action_idx: 1 }, Match { match_start: 3, match_end: 6, semantic_action_idx: 2 } ]
Since we're reached end-of-input (i.e. we can't make progress), we need to scan the match stack, find matches that connect (if we pick a match N..M
, the next match needs to be M..
), and run semantic actions.
I'm not sure if searching in matches will be efficient. Match stack will be sorted on match_start
, so we could do binary search to find the next potential match, but there will still be some backtracking involved (e.g. when multiple matches have same match_start
).
Secondly, there's a weird interaction between semantic actions and this behavior of maintaining a list of matches: in the original repro, if I had "xyz" => |lexer| lexer.switch_(...)
and input was still "xyzxya", this switching would happen after scanning the whole string, and in the new state we would have to backtrack in input to scan the suffix "xya" again.
Given that semantic actions are turing-complete programs and we cannot analyse them to find which states they can swtich to, we can't do anything about this.
Here's an ocamllex program with this behavior:
{ }
rule rule1 = parse
| "xyzxyz" { print_string "1\n"; rule1 lexbuf }
| "xyz" { print_string "2\n"; rule2 lexbuf }
| "xya" { print_string "3\n"; rule1 lexbuf }
| eof { () }
and rule2 = parse
| "xya" { print_string "4\n"; exit 0 }
{
rule1 (Lexing.from_string "xyzxya")
}
Here's another repro:
// copy helpers from original repro
lexer! {
Lexer -> &'input str;
'a'+ 'b' => return_match,
'a' => return_match,
}
fn main() {
let mut lexer = Lexer::new("aaaab");
assert_eq!(next(&mut lexer), Some(Ok("aaaab"))); // OK
assert_eq!(next(&mut lexer), None); // OK
let mut lexer = Lexer::new("aaaa");
assert_eq!(next(&mut lexer), Some(Ok("a"))); // Fails, lexer returns error
}
Interestingly, I tried this with logos and it returns an error in the second case above. I reported it in https://github.com/maciejhirsz/logos/issues/227.
ocamllex handles this as expected:
{ }
rule rule1 = parse
| 'a'+ 'b' { print_string "1\n"; rule1 lexbuf }
| 'a' { print_string "2\n"; rule1 lexbuf }
| eof { () }
{
rule1 (Lexing.from_string "aaaaaa");
print_string "---\n";
rule1 (Lexing.from_string "aaaaab");
}
alex also handles it as expected:
-- Lexer.x
{
{-# LANGUAGE ScopedTypeVariables #-}
module Lexer where
import Debug.Trace
}
%wrapper "basic"
tokens :-
a+ b { Token1 }
a { Token2 }
{
data Token
= Token1 String
| Token2 String
deriving (Eq, Show)
}
-- Main.hs
import Lexer
main = do
print (alexScanTokens "aaaaab")
putStrLn "---------"
print (alexScanTokens "aaaaaa")
It seems like ocamllex implements this with backtracking. For the program above, if I add some prints in generated code and try it with "aaaaaaaa"
I see this output:
state 4 - a seen, curr=3, last=1
state 4 - a seen, curr=4, last=1
state 4 - a seen, curr=5, last=1
state 4 - a seen, curr=6, last=1
state 4 - a seen, curr=7, last=1
state 4 - a seen, curr=8, last=1
EOF reached in state 4, curr=1, last=1
state 4 - a seen, curr=4, last=2
state 4 - a seen, curr=5, last=2
state 4 - a seen, curr=6, last=2
state 4 - a seen, curr=7, last=2
state 4 - a seen, curr=8, last=2
EOF reached in state 4, curr=2, last=2
state 4 - a seen, curr=5, last=3
state 4 - a seen, curr=6, last=3
state 4 - a seen, curr=7, last=3
state 4 - a seen, curr=8, last=3
EOF reached in state 4, curr=3, last=3
state 4 - a seen, curr=6, last=4
state 4 - a seen, curr=7, last=4
state 4 - a seen, curr=8, last=4
EOF reached in state 4, curr=4, last=4
state 4 - a seen, curr=7, last=5
state 4 - a seen, curr=8, last=5
EOF reached in state 4, curr=5, last=5
state 4 - a seen, curr=8, last=6
EOF reached in state 4, curr=6, last=6
EOF reached in state 4, curr=7, last=7
So it handles one match at a time, and scans the prefix again every time we see EOF instead of 'b'.
Output when I run it on "aaaaaaaa"
:
state 4 - a seen, curr=3, last=1
state 4 - a seen, curr=4, last=1
state 4 - a seen, curr=5, last=1
state 4 - a seen, curr=6, last=1
state 4 - a seen, curr=7, last=1
state 4 - b seen, curr=8, last=1
I started fixing this in branch backtracking
. So far I implemented backtracking NFA simulation. Added the a+b | a
example above as a test and it works.
I'm not sure if in the generated code we want to do backtracking, but it's good enough for NFA and DFA simulations. I'm guessing if it's good enough for ocamllex it could be good enough for us too, so maybe I will also implement backtracking in the generated code.
Here's an observation: we need to do at most one backtracking, we never need to backtrack more than once.
To see why we need to think in terms of state machines (NFA or DFA, does not matter) instead of regexes or rule sets. When we see an accepting state, but we can make progress with the next character, we have two options: accept and reset the machine to the initial state, or continue. To implement the "longest match" rule, we need to continue. If we continue and lexing fails, we need to run the semantic action of the accepting state we skipped, and reset the state machine. At that point we've yielded a token and we're ready to continue with lexing, and there's no more backtracking to do.
So in the cases where we skip an accepting state because we can potentially return a longer match, we will either see a new accepting state, or run the semantic action to the previous accepting state. If we see a new accepting state we do the same (skip it, return to it in case of a failure). If we don't see a new accepting state then we return to the previous one.
What this means is we don't need a stack, we just need a variable/argument for the "last accepting state". It needs to be optional, and hold the matched substring and semantic action function reference/index.
It needs to be optional, and hold the matched substring and semantic action function reference/index.
"semantic action function" we need #8 for this.
Interestingly, we have very similar examples that work fine. For example, Lua lexer has regexes for "elseif" and "else".