`switch_and_return` does not consume the returned token?

cr1901 commented 3 years ago

First, I want to thank you for writing lexgen. I find it incredibly useful when paired with LALRPOP (and would at some point like to contribute some integration to it, though I'm not sure exactly what that should look like right now).

I have a lexer with two rules- Init and NoteBody. The purpose of NoteBody is to consume characters until a newline is found, and the change between rules happens when an = (Tok::Equal) is detected and certain conditions (in_note_name state) are met. The logic looks like this:

"=" => |mut lexer| {
    if lexer.state().in_note_name {
        lexer.switch_and_return(LexerRule::NoteBody, Tok::Equal)
    } else {
        lexer.return_(Tok::Equal)
    }
},

When I've detected a newline in the NoteBody rule by peek()ing, I want to return all the characters I've found after the equals sign and up to (not including) the newline. I would expect the logic for this to look something like:

let match_ = lexer.match_();
...
lexer.switch_and_return(LexerRule::Init, Tok::NoteBody(&match_))

However, it turns out that this will return the = token prepended to the text I actually want to match. In order to return just the text after the = token, I need to do:

let match_ = lexer.match_();
...
lexer.switch_and_return(LexerRule::Init, Tok::NoteBody(&match_[1..=match_end]))

My question is: Is this a bug, or intended behavior? If the latter, is my workaround to strip the first character from the match correct, or can you suggest a method for removing the = sign before the transition from the Init to NoteBody rule?

osa1 commented 3 years ago

Hi @cr1901. Thanks for reporting this, and I'm glad you find lexgen useful.

I think this is a bug, yes.

lexgen lexers internally maintain two indices, for the start and end of the current match. The match_ method returns a slice of the input using these indices.

When switching states we don't reset the start index, so the character(s) you just matched in your current state will be included in the current match in the next state.

You can see an example of this in my Lua lexer: https://github.com/osa1/lexgen/blob/700700dc81b22246241ef7288f771e3a14929d9e/tests/lua_5_1.rs#L262

In this code, lexing a "long string" is done using 3 states: LongStringBracketLeft, LongString, and LongStringBracketRight. When returning a token in the last rule (LongStringBracketRight) we use match_, but we take opening brackets (matched in LongStringBracketLeft rule) and closing brackets (match in the LongStringBracketRight rule) into account, and return a smaller slice, in expression &lexer.match_[left_eqs + 2..lexer.match_.len() - right_eqs - 2]. The final string includes all of the characters matched in the middle rule (LongString).

However, when we return a token, we should reset the current match, because the match is used to return the token. So in your code, since you use switch_and_return rather than just switch, I think we should've reset the match before switching to the next state (NoteBody).

I think the reason why I didn't catch this bug so far is because there's another bug that hides this one. In the Lua lexer, I use switch_and_return in 3 places, and it works fine. The reason is there's another bug: we always reset the current match when switching to the initial rule. It so happens that in the Lua lexer when I use switch_and_return, I always switch to the initial rule. So it accidentally works.

Here's a regression test:

#[test]
fn return_should_reset_match() {
    lexer! {
        Lexer -> &'input str;

        rule Init {
            "aaa" => |lexer| {
                let match_ = lexer.match_();
                lexer.switch_and_return(LexerRule::State1, match_)
            },
        }

        rule State1 {
            "bbb" => |lexer| {
                let match_ = lexer.match_();
                lexer.switch_and_return(LexerRule::Init, match_)
            },
        }
    }

    let mut lexer = Lexer::new("aaabbb");
    assert_eq!(ignore_pos(lexer.next()), Some(Ok("aaa")));
    assert_eq!(ignore_pos(lexer.next()), Some(Ok("bbb"))); // assertion current fails
    assert_eq!(ignore_pos(lexer.next()), None);
}

I'll fix this.

cr1901 commented 3 years ago

Well, using switch to go between states and then returning a substring of the currently-matched data is indeed another solution to my problems. It just felt like switch_and_return was more appropriate for my use case since = is not the beginning of a comment, but rather a token used/expected by the parser.

switch is good to know as an example of how to handle "raw strings" in various programming languages, however :).

osa1 commented 3 years ago

I agree that your use of switch_and_return is a good one. I fixed this issue so switch_and_return should work as you expect.

I didn't fix the issue with resetting the current match when we switch to the initial rule though. That problem is a bit tricky, because apparently we rely on it to skip whitespace in the initial rule, without including the skipped whitespace in the current match. I'll have to think about this more.

osa1 commented 3 years ago

I also released 0.5.0 on crates.io, with the fix.

cr1901 commented 3 years ago

I didn't fix the issue with resetting the current match when we switch to the initial rule though.

So to summarize your fix:

switch_and_return now resets the current match when switching from "any rule to any rule", not just "any rule to Init".
switch resetting the match when going from "any rule to Init" is technically a bug, but it's invasive to fix, so right now it's a feature.

Do I understand this correctly? Perhaps maybe document switch resetting the match when going back to Init is a feature for now :)?

osa1 commented 3 years ago

Yes, you got it. I'm not sure whether we want to document this behavior of switch, or fix it. I'll need to think more on this.

osa1 / lexgen

`switch_and_return` does not consume the returned token? #11