osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Implement right contexts (lookahead) #37

Closed osa1 closed 2 years ago

osa1 commented 2 years ago

Fixes #29

osa1 commented 2 years ago

I think I will have to generate separate NFA/DFAs for right contexts. Currently in DFA simulation I use this to run right contexts:

// Similar to `simulate`, but does not keep track of the last match as we don't need "longest
// match" semantics and backtracking
fn simulate_right_ctx<A>(
    dfa: &DFA<StateIdx, A>,
    init: StateIdx,
    accept: StateIdx,
    mut char_indices: std::str::CharIndices,
) -> bool {
    if init == accept {
        return true;
    }

    let mut state = init;

    while let Some((_, char)) = char_indices.next() {
        match next(dfa, state, char) {
            None => {
                // Stuck
                return false;
            }
            Some(next_state) => {
                if next_state == accept {
                    return true;
                }

                state = next_state;
            }
        }
    }

    match next_end_of_input(dfa, state) {
        None => false,
        Some(next_state) => next_state == accept,
    }
}

In the generated code, for next above we will have a match state { ... } where in each alternative we will have a match char { ... }. These matches will duplicate the code for the main DFA next method, for each right context. That's a lot of duplication.

If we maintain separate DFAs for right contexts we can generate smaller code for next that doesn't have states of the main DFA.

osa1 commented 2 years ago

So one of the tests I'd added last time I worked on this is this:

lexer! {
    Lexer -> u32;

    // Per longest match we "a" should return 2, not 1
    'a' = 1,
    'a' > $ = 2,
}

let mut lexer = Lexer::new("a");
assert_eq!(next(&mut lexer), Some(Ok(2)));
assert_eq!(next(&mut lexer), None);

However as I think about this again now I realize that this is not a good idea. For this semantics we need to run all right contexts of a state, even after finding on that matches. I'm not sure if this semantics is useful, and it can certainly be slower than needed. Instead I think it should be good enough to try the rules in order and run semantic action of the first one that matches.

This means if there's a rule without a right context in the same state, then the ones with the right context will never be tried. We should probably start generating warnings in these cases, but maybe not in this PR.