osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Eliminate peeking in generated code #20

Closed osa1 closed 2 years ago

osa1 commented 2 years ago

I don't remember why I needed peeking, but DFA simulation does not need peeking, so generated code also shouldn't.

osa1 commented 2 years ago

Peekable is used to provide peek method on lexer handlers. However we should be able to eliminate uses of it in the generated code.

osa1 commented 2 years ago

Here's a test that fails if I use next instead of peek:

#[test]
fn failure_confusion_3_2() {
    lexer! {
        Lexer -> usize;

        $$ascii_lowercase+ = 1,
        ',' = 2,
    }

    let mut lexer = Lexer::new("f,");
    assert_eq!(lexer.next(), Some(Ok((0, 1, 1))));
    assert_eq!(lexer.next(), Some(Ok((1, 2, 2))));
    assert_eq!(lexer.next(), None);
}

The relevant part is handling of the input after seeing an ASCII lowercase:

self.__last_match = Some((self.__current_match_start, SEMANTIC_ACTION_1, self.__current_match_end));
match self.__iter.peek().copied() {
    None => match SEMANTIC_ACTION_1(self) {
        LexerAction::Continue => {
            self.__state = self.__initial_state;
        }
        LexerAction::Return(res) => {
            self.__state = self.__initial_state;
            let match_start = self.__current_match_start;
            self.__current_match_start = self.__current_match_end;
            return Some(Ok((match_start, res, self.__current_match_end)));
        }
    },
    Some((char_idx, char)) => match char {
        x if (x >= 'a' && x <= 'z') => {
            self.__current_match_end += x.len_utf8();
            let _ = self.__iter.next();
            self.__state = 1usize;
        }
        _ => match SEMANTIC_ACTION_1(self) {
            LexerAction::Continue => {
                self.__state = self.__initial_state;
            }
            LexerAction::Return(res) => {
                self.__state = self.__initial_state;
                let match_start = self.__current_match_start;
                self.__current_match_start = self.__current_match_end;
                return Some(Ok((match_start, res, self.__current_match_end)));
            }
        },
    },
}

Here we avoid advancing the iterator if we see anything other than $$ascii_lowercase and run the semantic action for the rule $$ascii_lowercase+ = 1.

If we use next instead of peek, we will have to use __last_match in the "stuck" case above to revert the iterator and run the semantic action, which is exactly what DFA simulation does.

I will try removing the peek now and run my benchmarks (not open source yet, sorry).