osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Allow initializing lexers with a char iterator #42

Closed osa1 closed 2 years ago

osa1 commented 2 years ago

This fixes #41 in an almost backwards compatible way. Generated lexers now have an extra constructor:

impl<I: Iterator<Item = char> + Clone> Lexer<'static, I> {
    fn new_from_iter(iter: I) -> Self {
        Lexer(::lexgen_util::Lexer::new_from_iter(iter))
    }
}

API of the generated lexers are exactly the same, however, if a lexer is constructed with new_from_iter instead of new or new_with_state, then match_ method will panic in runtime. This is because in lexers constructed with new_from_iter we don't have the input string, so cannot return a slice to it. Instead use match_loc to get the start and end locations of the current match.

Only breaking change is the generated types now have one more generic argument, for the iterator type. So for a lexer like:

lexer! {
    MyLexer -> MyToken;
    ...
}

Instead of

struct MyLexer<'input>(...);

we now generate

struct MyLexer<'input, I: Iterator<Item = char> + Clone>(...);

So any code that refers to the lexer type will break.

Other than this the changes should be backwards compatible.

Fixes #41

osa1 commented 2 years ago

Performance seems to regress a little bit. My Lua lexer benchmark reports +3% compared to main branch. I'm guessing the reason is the cloning for __last_match as that should be the only difference in generated code.

osa1 commented 2 years ago

I think we should be able optimize __last_match updates in code like this:

'>' => {
    self.0.set_accepting_state(Lexer_ACTION_13);          // 2
    match self.0.next() {
        None => {
            self.0.__done = true;
            match self.0.backtrack() {                    // 6
                ...
            }
        }
        Some(char) => match char {
            '>' => {
                self.0.reset_accepting_state();           // 12
                match Lexer_ACTION_31(self) {
                    ...
                }
            }
            '=' => {
                self.0.reset_accepting_state();           // 18
                match Lexer_ACTION_11(self) {
                    ...
                }
            }
            _ => match self.0.backtrack() {               // 23
                ...
            },
        },
    }
}

In the code above we set __last_match in line 2. However in the continuation we either use the value we set directly, or reset it:

osa1 commented 2 years ago

The perf issue above reported as #43