osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

continue_ doesn't consume input in non-initial rule #62

Closed glennschmidt closed 1 year ago

glennschmidt commented 1 year ago

In the following example, there's a pattern in the Secondary rule intended to skip whitespace. But the whitespace ends up part of the following match.

use lexgen::lexer;

#[derive(Debug, PartialEq)]
enum Token<'input> {
    Keyword,
    Identifier(&'input str),
}

lexer! {
    Lexer -> Token<'input>;

    rule Init {
        "do" => |lexer| {
            lexer.switch_and_return(LexerRule::Secondary, Token::Keyword)
        },
    },

    rule Secondary {
        $$ascii_whitespace,
        $$ascii_alphanumeric+ => |lexer| {
            lexer.return_(Token::Identifier(lexer.match_()))
        },
    },
}

#[test]
fn test() {
    let mut lexer = Lexer::new("do thing");
    assert_eq!(lexer.next().unwrap().unwrap().1, Token::Keyword);
    assert_eq!(lexer.next().unwrap().unwrap().1, Token::Identifier("thing"));
}

Expected result: Test passes

Actual result:

assertion failed: `(left == right)`
  left: `Identifier(" thing")`,
 right: `Identifier("thing")`

Note: This only happens in a non-initial rule. If I redefine the lexer as follows, the test will pass:

lexer! {
    Lexer -> Token<'input>;

    rule Init {
        $$ascii_whitespace,
        "do" = Token::Keyword,
        $$ascii_alphanumeric+ => |lexer| {
            lexer.return_(Token::Identifier(lexer.match_()))
        },
    },
}

Am I misunderstanding something or is this a bug?

osa1 commented 1 year ago

Thanks for reporting. This is an existing bug, previously reported as #12. In short, a rule without a right-hand side ($$ascii_whitespace, in your examples) only resets the current match in Init rule set. In the other rule sets it doesn't reset the match. This is working as intended (usually the initial rule set handles whitespace, other rule sets handle strings, comments etc.), but the intended behavior is confusing and we should fix it.

In https://github.com/osa1/lexgen/issues/12#issuecomment-940707685 I describe two ways to fix this problem. If you provide feedback that would be helpful.

Closing this one as duplicate.