Regex compilation bug when prefix is shared by regexes

osa1 commented 3 years ago

use lexgen::lexer;

fn ignore_pos<A, E>(ret: Option<Result<(usize, A, usize), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

fn next<A, E>(
    iter: &mut dyn Iterator<Item = Result<(usize, A, usize), E>>,
) -> Option<Result<A, E>> {
    ignore_pos(iter.next())
}

fn return_match<'lexer, 'input>(lexer: LexerHandle<'lexer, 'input>) -> LexerAction<&'input str> {
    let match_ = lexer.match_();
    lexer.return_(match_)
}

lexer! {
    Lexer -> &'input str;

    "xyzxyz" => return_match,
    "xyz" => return_match,
    "xya" => return_match,
}

fn main() {
    let mut lexer = Lexer::new("xyzxya");
    assert_eq!(next(&mut lexer), Some(Ok("xyz")));  // fails
    assert_eq!(next(&mut lexer), Some(Ok("xya")));
    assert_eq!(next(&mut lexer), None);
}

Interestingly, we have very similar examples that work fine. For example, Lua lexer has regexes for "elseif" and "else".

osa1 commented 3 years ago

This issue is actually deeper than just a miscompilation of regex to NFA.

Currently, if the state machine is in an accepting state but it's also possible to transition to another state from the current state with the next character in the input, we do the transition and ignore the match.

In the repro above, after seeing 'z', we're in an accepting state, but we can also make progress with the next character (i.e. we don't get stuck with the next character), so to implement longest match, we ignore the accepting state, and move on to the next state using the second 'x'.

Instead, when we're in a state (or set of states, in NFA), we need to know about all the possible matches that leads us to the current state, and if we cannot make any more progress (state machine stuck), we pick one of the mathes from the list and run semantic actions.

Conceptually, list of matches will look like this

struct Match {
    match_start: usize,
    match_end: usize,
    semantic_action_idx: usize,
}

A value of this type represents a match in input[match_start..match_end], and the semantic action that we need to run on this match if we decide to go with this match.

Then, for a state in DFA, we will need

type MatchStack = Vec<Match>;

Finally, for an accepting state, we will need an e transition to the beginning of the current user state (e.g. initial state of Init rule set).

After these changes, when we see 'z', we will have a match stack like:

[ Match { match_start: 0, match_end: 3, semantic_action_idx: 1 } ]

and the NFA states will be { initial state, state after 'z' in first regex }.

When we see 'a', match stack will look like:

[ Match { match_start: 0, match_end: 3, semantic_action_idx: 1 }, Match { match_start: 3, match_end: 6, semantic_action_idx: 2 } ]

Since we're reached end-of-input (i.e. we can't make progress), we need to scan the match stack, find matches that connect (if we pick a match N..M, the next match needs to be M..), and run semantic actions.

I'm not sure if searching in matches will be efficient. Match stack will be sorted on match_start, so we could do binary search to find the next potential match, but there will still be some backtracking involved (e.g. when multiple matches have same match_start).

Secondly, there's a weird interaction between semantic actions and this behavior of maintaining a list of matches: in the original repro, if I had "xyz" => |lexer| lexer.switch_(...) and input was still "xyzxya", this switching would happen after scanning the whole string, and in the new state we would have to backtrack in input to scan the suffix "xya" again.

Given that semantic actions are turing-complete programs and we cannot analyse them to find which states they can swtich to, we can't do anything about this.

Here's an ocamllex program with this behavior:

{ }

rule rule1 = parse
  | "xyzxyz" { print_string "1\n"; rule1 lexbuf }
  | "xyz"    { print_string "2\n"; rule2 lexbuf }
  | "xya"    { print_string "3\n"; rule1 lexbuf }
  | eof      { () }

and rule2 = parse
  | "xya"    { print_string "4\n"; exit 0 }

{
  rule1 (Lexing.from_string "xyzxya")
}

osa1 commented 3 years ago

Here's another repro:

// copy helpers from original repro

lexer! {
    Lexer -> &'input str;

    'a'+ 'b' => return_match,
    'a' => return_match,
}

fn main() {
    let mut lexer = Lexer::new("aaaab");
    assert_eq!(next(&mut lexer), Some(Ok("aaaab"))); // OK
    assert_eq!(next(&mut lexer), None); // OK

    let mut lexer = Lexer::new("aaaa");
    assert_eq!(next(&mut lexer), Some(Ok("a"))); // Fails, lexer returns error
}

Interestingly, I tried this with logos and it returns an error in the second case above. I reported it in https://github.com/maciejhirsz/logos/issues/227.

ocamllex handles this as expected:

{ }

rule rule1 = parse
  | 'a'+ 'b' { print_string "1\n"; rule1 lexbuf }
  | 'a'      { print_string "2\n"; rule1 lexbuf }
  | eof      { () }

{
  rule1 (Lexing.from_string "aaaaaa");
  print_string "---\n";
  rule1 (Lexing.from_string "aaaaab");
}

alex also handles it as expected:

-- Lexer.x
{
{-# LANGUAGE ScopedTypeVariables #-}

module Lexer where

import Debug.Trace
}

%wrapper "basic"

tokens :-

  a+ b  { Token1 }
  a     { Token2 }

{
data Token
 = Token1 String
 | Token2 String
 deriving (Eq, Show)
}

-- Main.hs
import Lexer

main = do
    print (alexScanTokens "aaaaab")
    putStrLn "---------"
    print (alexScanTokens "aaaaaa")

osa1 commented 3 years ago

It seems like ocamllex implements this with backtracking. For the program above, if I add some prints in generated code and try it with "aaaaaaaa" I see this output:

state 4 - a seen, curr=3, last=1
state 4 - a seen, curr=4, last=1
state 4 - a seen, curr=5, last=1
state 4 - a seen, curr=6, last=1
state 4 - a seen, curr=7, last=1
state 4 - a seen, curr=8, last=1
EOF reached in state 4, curr=1, last=1
state 4 - a seen, curr=4, last=2
state 4 - a seen, curr=5, last=2
state 4 - a seen, curr=6, last=2
state 4 - a seen, curr=7, last=2
state 4 - a seen, curr=8, last=2
EOF reached in state 4, curr=2, last=2
state 4 - a seen, curr=5, last=3
state 4 - a seen, curr=6, last=3
state 4 - a seen, curr=7, last=3
state 4 - a seen, curr=8, last=3
EOF reached in state 4, curr=3, last=3
state 4 - a seen, curr=6, last=4
state 4 - a seen, curr=7, last=4
state 4 - a seen, curr=8, last=4
EOF reached in state 4, curr=4, last=4
state 4 - a seen, curr=7, last=5
state 4 - a seen, curr=8, last=5
EOF reached in state 4, curr=5, last=5
state 4 - a seen, curr=8, last=6
EOF reached in state 4, curr=6, last=6
EOF reached in state 4, curr=7, last=7

So it handles one match at a time, and scans the prefix again every time we see EOF instead of 'b'.

Output when I run it on "aaaaaaaa":

state 4 - a seen, curr=3, last=1
state 4 - a seen, curr=4, last=1
state 4 - a seen, curr=5, last=1
state 4 - a seen, curr=6, last=1
state 4 - a seen, curr=7, last=1
state 4 - b seen, curr=8, last=1

ocamllex-generated code with prints

```ocaml # 1 "lexer.mll" # 4 "lexer.ml" let rec __ocaml_lex_refill_buf lexbuf _buf _len _curr _last = if lexbuf.Lexing.lex_eof_reached then 256, _buf, _len, _curr, _last else begin lexbuf.Lexing.lex_curr_pos <- _curr; lexbuf.Lexing.lex_last_pos <- _last; lexbuf.Lexing.refill_buff lexbuf; let _curr = lexbuf.Lexing.lex_curr_pos in let _last = lexbuf.Lexing.lex_last_pos in let _len = lexbuf.Lexing.lex_buffer_len in let _buf = lexbuf.Lexing.lex_buffer in if _curr < _len then Char.code (Bytes.unsafe_get _buf _curr), _buf, _len, (_curr + 1), _last else __ocaml_lex_refill_buf lexbuf _buf _len _curr _last end let rec __ocaml_lex_state4 lexbuf _last_action _buf _len _curr _last = let next_char, _buf, _len, _curr, _last = if _curr >= _len then __ocaml_lex_refill_buf lexbuf _buf _len _curr _last else Char.code (Bytes.unsafe_get _buf _curr), _buf, _len, (_curr + 1), _last in begin match next_char with (* |'a' *) | 97 -> Printf.printf "state 4 - a seen, curr=%d, last=%d\n" _curr _last; __ocaml_lex_state4 lexbuf _last_action _buf _len _curr _last (* |'b' *) | 98 -> Printf.printf "state 4 - b seen, curr=%d, last=%d\n" _curr _last; (* *) lexbuf.Lexing.lex_curr_pos <- _curr; lexbuf.Lexing.lex_last_pos <- _last; 0 | _ -> let _curr = _last in Printf.printf "EOF reached in state 4, curr=%d, last=%d\n" _curr _last; lexbuf.Lexing.lex_curr_pos <- _curr; lexbuf.Lexing.lex_last_pos <- _last; _last_action end let rec rule1 lexbuf = let __ocaml_lex_result = let _curr = lexbuf.Lexing.lex_curr_pos in let _last = _curr in let _len = lexbuf.Lexing.lex_buffer_len in let _buf = lexbuf.Lexing.lex_buffer in let _last_action = -1 in lexbuf.Lexing.lex_start_pos <- _curr; let next_char, _buf, _len, _curr, _last = if _curr >= _len then __ocaml_lex_refill_buf lexbuf _buf _len _curr _last else Char.code (Bytes.unsafe_get _buf _curr), _buf, _len, (_curr + 1), _last in begin match next_char with (* |'a' *) | 97 -> (* *) let _last = _curr in let _last_action = 1 in let next_char, _buf, _len, _curr, _last = if _curr >= _len then __ocaml_lex_refill_buf lexbuf _buf _len _curr _last else Char.code (Bytes.unsafe_get _buf _curr), _buf, _len, (_curr + 1), _last in begin match next_char with (* |'a' *) | 97 -> __ocaml_lex_state4 lexbuf 1 (* = last_action *) _buf _len _curr _last (* |'b' *) | 98 -> (* *) lexbuf.Lexing.lex_curr_pos <- _curr; lexbuf.Lexing.lex_last_pos <- _last; 0 | _ -> let _curr = _last in lexbuf.Lexing.lex_curr_pos <- _curr; lexbuf.Lexing.lex_last_pos <- _last; 1 (* = last_action *) end (* |eof *) | 256 -> (* *) lexbuf.Lexing.lex_curr_pos <- _curr; lexbuf.Lexing.lex_last_pos <- _last; 2 | _ -> let _curr = _last in lexbuf.Lexing.lex_curr_pos <- _curr; lexbuf.Lexing.lex_last_pos <- _last; _last_action end in begin let _curr_p = lexbuf.Lexing.lex_curr_p in if _curr_p != Lexing.dummy_pos then begin lexbuf.Lexing.lex_start_p <- _curr_p; lexbuf.Lexing.lex_curr_p <- {_curr_p with Lexing.pos_cnum = lexbuf.Lexing.lex_abs_pos+lexbuf.Lexing.lex_curr_pos} end end; match __ocaml_lex_result with | 0 -> # 4 "lexer.mll" ( rule1 lexbuf ) # 119 "lexer.ml" | 1 -> # 5 "lexer.mll" ( rule1 lexbuf ) # 124 "lexer.ml" | 2 -> # 6 "lexer.mll" ( () ) # 129 "lexer.ml" | _ -> raise (Failure "lexing: empty token") ;; # 8 "lexer.mll" (* let time f x = let start = Unix.gettimeofday () in let res = f x in let stop = Unix.gettimeofday () in let () = Printf.printf "%fs\n%!" (stop -. start) in res in let str0 = String.make 10000 'a' in let str1 = String.make 20000 'a' in let str2 = String.make 40000 'a' in let str3 = String.make 80000 'a' in let str4 = String.make 160000 'a' in let str5 = String.make 320000 'a' in time (fun () -> rule1 (Lexing.from_string str0)) (); time (fun () -> rule1 (Lexing.from_string str1)) (); time (fun () -> rule1 (Lexing.from_string str2)) (); time (fun () -> rule1 (Lexing.from_string str3)) (); time (fun () -> rule1 (Lexing.from_string str4)) (); time (fun () -> rule1 (Lexing.from_string str5)) () *) rule1 (Lexing.from_string "aaaaaaab") # 164 "lexer.ml" ```

osa1 commented 3 years ago

I started fixing this in branch backtracking. So far I implemented backtracking NFA simulation. Added the a+b | a example above as a test and it works.

I'm not sure if in the generated code we want to do backtracking, but it's good enough for NFA and DFA simulations. I'm guessing if it's good enough for ocamllex it could be good enough for us too, so maybe I will also implement backtracking in the generated code.

osa1 commented 2 years ago

Here's an observation: we need to do at most one backtracking, we never need to backtrack more than once.

To see why we need to think in terms of state machines (NFA or DFA, does not matter) instead of regexes or rule sets. When we see an accepting state, but we can make progress with the next character, we have two options: accept and reset the machine to the initial state, or continue. To implement the "longest match" rule, we need to continue. If we continue and lexing fails, we need to run the semantic action of the accepting state we skipped, and reset the state machine. At that point we've yielded a token and we're ready to continue with lexing, and there's no more backtracking to do.

So in the cases where we skip an accepting state because we can potentially return a longer match, we will either see a new accepting state, or run the semantic action to the previous accepting state. If we see a new accepting state we do the same (skip it, return to it in case of a failure). If we don't see a new accepting state then we return to the previous one.

What this means is we don't need a stack, we just need a variable/argument for the "last accepting state". It needs to be optional, and hold the matched substring and semantic action function reference/index.

osa1 commented 2 years ago

It needs to be optional, and hold the matched substring and semantic action function reference/index.

"semantic action function" we need #8 for this.

osa1 / lexgen

Regex compilation bug when prefix is shared by regexes #16