osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Invalid token doesn't consume input from match_ in non-initial rule #48

Closed MiSawa closed 2 years ago

MiSawa commented 2 years ago

This works as I expected

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' = "s",
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

let input = "sxa";
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Ok((loc(0, 0, 0), "s", loc(0, 1, 1)))));
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 1, 1), kind: crate::LexerErrorKind::InvalidToken })));
assert_eq!(lexer.next(), Some(Ok((loc(0, 2, 2), "a", loc(0, 3, 3)))));
assert_eq!(lexer.next(), None);

though if I move "a" to a different rule and let "s" enter the rule, it starts behaving differently while I expected to return exactly same as above one

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' => |lexer| lexer.switch_and_return(LexerRule::InString, "s"),
    }
    rule InString {
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

let input = "sxa";
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Ok((loc(0, 0, 0), "s", loc(0, 1, 1)))));
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 1, 1), kind: crate::LexerErrorKind::InvalidToken })));
// Why "x" is included here???
assert_eq!(lexer.next(), Some(Ok((loc(0, 1, 1), "xa", loc(0, 3, 3)))));
// Perhaps this is because we don't switch back to the Init rule?
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 3, 3), kind: crate::LexerErrorKind::InvalidToken })));
assert_eq!(lexer.next(), None);

It'd be great if

osa1 commented 2 years ago

Thanks for reporting this.

The fix is easy, but I'm afraid it won't work the way you might expect. Basically the rule is that when the lexer fails, it also consumes the character. Here's an example to demonstrate:

use lexgen::lexer;

lexer! {
    Lexer -> &'input str;

    rule Init {
        "aaa" = "a",
        "b" = "b",
    }
}

fn main() {
    let input = "aab";
    let mut lexer = Lexer::new(input);
    println!("{:?}", lexer.next());
    println!("{:?}", lexer.next());
}

Output:

Some(Err(LexerError { location: Loc { line: 0, col: 0, byte_idx: 0 }, kind: InvalidToken }))
None

Since the lexer realizes that the input is invalid when it sees the b, b is also consumed. So the next next() call returns None.

With the bug that causes inconsistent behavior in your example is fixed, this is the output for your program:

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' => |lexer| lexer.switch_and_return(LexerRule::InString, "s"),
    }
    rule InString {
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

fn main() {
    let input = "sxa";
    let mut lexer = Lexer::new(input);
    println!("{:?}", lexer.next());
    println!("{:?}", lexer.next());
    println!("{:?}", lexer.next());
    println!("{:?}", lexer.next());
}

Output:

Some(Ok((Loc { line: 0, col: 0, byte_idx: 0 }, "s", Loc { line: 0, col: 1, byte_idx: 1 })))
Some(Err(LexerError { location: Loc { line: 0, col: 1, byte_idx: 1 }, kind: InvalidToken }))
Some(Err(LexerError { location: Loc { line: 0, col: 2, byte_idx: 2 }, kind: InvalidToken }))
None

We could make the lexers not consume the character when they return an error, but I'm not sure if it will be any more useful than the current behavior. In general, recovery from a lexing error will depend on the current lexer state (i.e. the current rule) and the lexer user state. I'm not sure what would be a good interface to provide to the users so that they will be able to recover from lexing errors.

In my use case, I'm trying to not fail in a lexer, but rather mark returned tokens/lexemes with error information. For example, if I'm lexing character literals, an unterminated literal like 'a is lexed as CharLiteral { terminated: false, ... }. If the lexer fails, I'm skipping characters until the next whitspace and restarting lexing from there. This is of course specific to my application and won't work in general.

MiSawa commented 2 years ago

Thank you for swift response and fix!

Basically the rule is that when the lexer fails, it also consumes the character.

Sorry that I didn't include this in the first place, but that'd work fine for me! Actually I'm not trying to recover from lexer error (at least for now), but my problem was that semantic actions being invoked with match_() with an unexpected prefix, which caused a panic.

"\\u" $hex_digit $hex_digit $hex_digit $hex_digit =? |lexer| {
    // Ooops, unwrap cause a panic if there were characters left from InvalidToken!
    let value = u32::from_str_radix(&lexer.match_()[2..], 16).unwrap();
    // This works. (But what to do if the rule didn't have a fixed length like this example?)
    let value = u32::from_str_radix(&lexer.match_()[lexer.match_().len()-4..], 16).unwrap();
    // do something with value and return
    todo!()
}

As for error recoveries, I have a question; I somehow assumed that the rule with longest match is preferred, and then ties are broken by the order of appearance in the rule {} statement. Is this a correct assumption? I think the current interface is fine as long as there's a fixed precedence over rules so that we can write a "fallback case". If a user want to recover from errors, they can write a fallback case explicitly like

rule Init {
    's' => |lexer| lexer.switch_and_return(LexerRule::InString, lexer.match_()),
}
rule InString {
    'a' => |lexer| lexer.return_(lexer.match_()),
    _ =? |lexer| lexer.return_(Err(lexer.match_())),
}

(or enter to new InFailure rule and consume until a whitespace or whatever), so users should be able to achieve whatever they want already.... (assuming precedence between 'a' and _ in the above example)

osa1 commented 2 years ago

Sorry that I didn't include this in the first place, but that'd work fine for me! Actually I'm not trying to recover from lexer error (at least for now), but my problem was that semantic actions being invoked with match_() with an unexpected prefix, which caused a panic.

Have you seen the reset_match() method (documented in README, added with #18)? Currently the current match (string returned by match_()) is only reset when you switch to the Init rule. If this is not what you need, you can call reset_match() in a semantic action, which resets the current match. In your case, if you call reset_match() before switching to the state with "\\u" $hex_digit $hex_digit $hex_digit $hex_digit =? ... you will only get the "\u..." part in match_().

As for error recoveries, I have a question; I somehow assumed that the rule with longest match is preferred, and then ties are broken by the order of appearance in the rule {} statement. Is this a correct assumption?

Correct.

I think the current interface is fine as long as there's a fixed precedence over rules so that we can write a "fallback case". If a user want to recover from errors, they can write a fallback case explicitly like

I haven't experimented with this idea myself, but it looks like a good way to handle errors.

MiSawa commented 2 years ago

Have you seen the resetmatch() method (documented in README, added with https://github.com/osa1/lexgen/pull/18)? Currently the current match (string returned by match()) is only reset when you switch to the Init rule. If this is not what you need, you can call reset_match() in a semantic action, which resets the current match. In your case, if you call reset_match() before switching to the state with "\u" $hex_digit $hex_digit $hex_digit $hexdigit =? ... you will only get the "\u..." part in match().

Yes, I knew the function existed, but in my case the leftover was coming from InvalidToken errors so there was no semantic action to call reset_match. An input example input would be \uxyzw\u1234. Here, since xyzw is invalid as hex characters, there was no rule handling it and thus I had an InvalidToken error. But it doesn't stop parsing, and when it comes to parse \u1234, the match_() was \uxyzw\u1234 which makes the action call u32::from_radix("xyzw\u1234", 16).unwrap(). In my case, I was able to resolve the issue by adding '\\' _ =? |lexer| lexer.return_(Err(...)) (or _ =? |lexer| lexer.return_(Err(..)) would have worked) so that there's no match failures causing InvalidToken and instead we consume the string fragment by manually return_ an error. And since I use return_ on all other rules which I believe consumes match_, the match_() on this unicode-escape rule always return a string that matches the regex. But with your fix, I don't need this hack(?) anymore!

osa1 commented 2 years ago

Here, since xyzw is invalid as hex characters, there was no rule handling it and thus I had an InvalidToken error. But it doesn't stop parsing, and when it comes to parse \u1234, the match_() was \uxyzw\u1234

Maybe we should reset the match on InvalidToken. Any thoughts on this? Other kind of errors will already reset the match as they're raised in a semantic action.

osa1 commented 2 years ago

Maybe we should reset the match on InvalidToken

Sorry for the confusion.. This is already what we do in #49.

MiSawa commented 2 years ago

Maybe we should reset the match on InvalidToken. Any thoughts on this? Other kind of errors will already reset the match as they're raised in a semantic action.

Yeah, that wold be ideal for me!

MiSawa commented 2 years ago

Thank you for the swift fix!!