osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Empty rule fails or loops depending on other rules #27

Closed osa1 closed 2 years ago

osa1 commented 2 years ago
use lexgen::lexer;

lexer! {
    Lexer -> &'input str;

    let whitespace =
        ['\t' '\n' '\u{B}' '\u{C}' '\r' ' ' '\u{85}' '\u{200E}' '\u{200F}' '\u{2028}' '\u{2029}'];

    let dec_digit = ['0'-'9'];

    let hex_digit = ['0'-'9' 'a'-'f' 'A'-'F'];

    let int_suffix = ('u' | 'i') '8' | ('u' | 'i') "16" | ('u' | 'i') "32" |
            ('u' | 'i') "64" | ('u' | 'i') "128" | ('u' | 'i') "size";

    rule Init {
        "0x" => |lexer| lexer.switch(LexerRule::HexInt),

        $dec_digit => |lexer| lexer.switch(LexerRule::DecInt),
    }

    rule DecInt {
        ($dec_digit | '_')* $int_suffix?,

        $ => |lexer| {
            let match_ = lexer.match_();
            lexer.return_(match_)
        },

        $whitespace => |lexer| {
            let match_ = lexer.match_();
            lexer.return_(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()])
        },
    }

    rule HexInt {

    }
}

fn ignore_loc<A, E, L>(ret: Option<Result<(L, A, L), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

fn main() {
    let mut lexer = Lexer::new("0xff");
    assert_eq!(ignore_loc(lexer.next()), Some(Ok("0xff")));
}

The repro above loops, which is a bug. An empty rule should always fail.

Interestingly, if I remove the contents of the rule DecInt, it starts to fail, but I think the error value is not right:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Some(Err(InvalidToken { location: Loc { line: 0, col: 1, byte_idx: 1 } }))`,
 right: `Some(Ok("0xff"))`', src/main.rs:37:5

I would expect the error location to be column 2 as 0x part should be consumed for the rule starting with "0x".

osa1 commented 2 years ago

Here's a smaller repro:

use lexgen::lexer;

lexer! {
    Lexer -> &'input str;

    rule Init {
        'a' => |lexer| lexer.switch(LexerRule::Rule1),
    }

    rule Rule1 {
        'b'*,
    }
}

fn ignore_loc<A, E, L>(ret: Option<Result<(L, A, L), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

fn main() {
    let mut lexer = Lexer::new("ax");
    assert_eq!(ignore_loc(lexer.next()), Some(Ok("a")));
}

What happens is we switch to Rule1, Rule1 accepts empty string and the next character (in out case end-of-input) is not an expected character in Rule1, so we keep looping skipping the empty string forever.

osa1 commented 2 years ago

This is an interesting case. I think we should handle a substring in the input only once, empty or not. For non-empty substrings that's already the case. But apparently not for the empty substrings.

This is somewhat similar to end-of-input handling in that handling end-of-input also does not update the current location or increment that Chars iterator.

osa1 commented 2 years ago

Here's a simpler version of the original repro:

use lexgen::lexer;

lexer! {
    Lexer -> &'input str;

    rule Init {
        "0x" => |lexer| lexer.switch(LexerRule::HexInt),

        '0' => |lexer| lexer.switch(LexerRule::DecInt),
    }

    rule DecInt {
        '0'*,
    }

    rule HexInt {

    }
}

fn ignore_loc<A, E, L>(ret: Option<Result<(L, A, L), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

fn main() {
    let mut lexer = Lexer::new("0xff");
    assert_eq!(ignore_loc(lexer.next()), Some(Ok("0xff")));
}

The problem here is commenting-out contents of DecInt changes the behavior, but DecInt rule should not be used on this input. 0x part should switch to HexInt rule, which should loop as shown in my previous comment above.

osa1 commented 2 years ago

Final DFAs before codegen look as expected so I'm guessing this could a bug in the DFA simplifier.

osa1 commented 2 years ago

Switch method generated with empty DecInt and HexInt rules is bizarre:

    fn switch<A>(&mut self, rule: LexerRule) -> ::lexgen_util::SemanticActionResult<A> {
        match rule {
            LexerRule::HexInt => self.0.__state = 1usize,
            LexerRule::Init => self.0.__state = 0usize,
            LexerRule::DecInt => self.0.__state = 1usize,
        }
        self.0.__initial_state = self.0.__state;
        ::lexgen_util::SemanticActionResult::Continue
    }

So HexInt and DecInt have the same initial state.

Also, the main loop only has one state, so state 1 is not really handled.