Closed cr1901 closed 3 years ago
Hi @cr1901. Thanks for reporting this, and I'm glad you find lexgen useful.
I think this is a bug, yes.
lexgen lexers internally maintain two indices, for the start and end of the current match. The match_
method returns a slice of the input using these indices.
When switching states we don't reset the start index, so the character(s) you just matched in your current state will be included in the current match in the next state.
You can see an example of this in my Lua lexer: https://github.com/osa1/lexgen/blob/700700dc81b22246241ef7288f771e3a14929d9e/tests/lua_5_1.rs#L262
In this code, lexing a "long string" is done using 3 states: LongStringBracketLeft
, LongString
, and LongStringBracketRight
. When returning a token in the last rule (LongStringBracketRight
) we use match_
, but we take opening brackets (matched in LongStringBracketLeft
rule) and closing brackets (match in the LongStringBracketRight
rule) into account, and return a smaller slice, in expression &lexer.match_[left_eqs + 2..lexer.match_.len() - right_eqs - 2]
. The final string includes all of the characters matched in the middle rule (LongString
).
However, when we return a token, we should reset the current match, because the match is used to return the token. So in your code, since you use switch_and_return
rather than just switch
, I think we should've reset the match before switching to the next state (NoteBody
).
I think the reason why I didn't catch this bug so far is because there's another bug that hides this one. In the Lua lexer, I use switch_and_return
in 3 places, and it works fine. The reason is there's another bug: we always reset the current match when switching to the initial rule. It so happens that in the Lua lexer when I use switch_and_return
, I always switch to the initial rule. So it accidentally works.
Here's a regression test:
#[test]
fn return_should_reset_match() {
lexer! {
Lexer -> &'input str;
rule Init {
"aaa" => |lexer| {
let match_ = lexer.match_();
lexer.switch_and_return(LexerRule::State1, match_)
},
}
rule State1 {
"bbb" => |lexer| {
let match_ = lexer.match_();
lexer.switch_and_return(LexerRule::Init, match_)
},
}
}
let mut lexer = Lexer::new("aaabbb");
assert_eq!(ignore_pos(lexer.next()), Some(Ok("aaa")));
assert_eq!(ignore_pos(lexer.next()), Some(Ok("bbb"))); // assertion current fails
assert_eq!(ignore_pos(lexer.next()), None);
}
I'll fix this.
Well, using switch
to go between states and then returning a substring of the currently-matched data is indeed another solution to my problems. It just felt like switch_and_return
was more appropriate for my use case since =
is not the beginning of a comment, but rather a token used/expected by the parser.
switch
is good to know as an example of how to handle "raw strings" in various programming languages, however :).
I agree that your use of switch_and_return
is a good one. I fixed this issue so switch_and_return
should work as you expect.
I didn't fix the issue with resetting the current match when we switch to the initial rule though. That problem is a bit tricky, because apparently we rely on it to skip whitespace in the initial rule, without including the skipped whitespace in the current match. I'll have to think about this more.
I also released 0.5.0 on crates.io, with the fix.
I didn't fix the issue with resetting the current match when we switch to the initial rule though.
So to summarize your fix:
switch_and_return
now resets the current match when switching from "any rule to any rule", not just "any rule to Init
".switch
resetting the match when going from "any rule to Init
" is technically a bug, but it's invasive to fix, so right now it's a feature.Do I understand this correctly? Perhaps maybe document switch
resetting the match when going back to Init
is a feature for now :)?
Yes, you got it. I'm not sure whether we want to document this behavior of switch
, or fix it. I'll need to think more on this.
First, I want to thank you for writing
lexgen
. I find it incredibly useful when paired with LALRPOP (and would at some point like to contribute some integration to it, though I'm not sure exactly what that should look like right now).I have a lexer with two rules-
Init
andNoteBody
. The purpose ofNoteBody
is to consume characters until a newline is found, and the change between rules happens when an=
(Tok::Equal
) is detected and certain conditions (in_note_name
state) are met. The logic looks like this:When I've detected a newline in the
NoteBody
rule bypeek()
ing, I want to return all the characters I've found after the equals sign and up to (not including) the newline. I would expect the logic for this to look something like:However, it turns out that this will return the
=
token prepended to the text I actually want to match. In order to return just the text after the=
token, I need to do:My question is: Is this a bug, or intended behavior? If the latter, is my workaround to strip the first character from the match correct, or can you suggest a method for removing the
=
sign before the transition from theInit
toNoteBody
rule?