maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.94k stars 126 forks source link

Runtime stack overflow when lexing certain strings #424

Open Philogy opened 2 months ago

Philogy commented 2 months ago

I was trying to get C-style multiline comment working in logos, the state machine for it is quite simple but I can't seem to get it working in logos. This regex seemed to work but it causes a panic when parsing certain things.

The code for report:

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f\r]+")] // Ignore this regex pattern between tokens
enum Token {
    // Tokens can be literal strings, of any length.
    #[token("#define")]
    Define,

    #[token("macro")]
    Macro,

    #[regex("//[^\n]*\n?", logos::skip)]
    LineComment,

    #[regex("/\\*([^\\*]*(\\*[^/])?)*\\*/", logos::skip)]
    MultiLineComment,

    // Or regular expressions.
    #[regex("0x[0-9a-fA-F]+")]
    HexLiteral,

    #[regex("[a-zA-Z_]\\w*:")]
    Label,

    #[regex("[a-zA-Z_]\\w*")]
    Ident,
}

fn main() {
    let src = "
/*wow amazing!!!!!*** /*  **/
// wow very nice
#define macro hi: very nice

";

    let mut lexer = Token::lexer(src);
    while let Some(token) = lexer.next() {
        println!("{:?} {}", token, lexer.slice());
    }
}

The error I'm getting:

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1]    5609 abort      cargo run
jeertmans commented 2 months ago

Hello @Philogy, I currently haven't much time to invest in this issue, but I would recommend to you the same thing as I do for all comment-style lexing: just create a token that matches the start of a comment, and then process the comment with a callback. This is usually much better, as comments can contain almost any characters, like escaped /, which makes it super hard to write a regex that handles all specific cases.

See https://github.com/maciejhirsz/logos/issues/421#issuecomment-2348285243 for an example on XML comments, which is very similar to multiline strings.

jeertmans commented 2 months ago

Looks like this is a duplicate of #400, so closing this anyway :-)

conradludgate commented 1 month ago

Looks like this is a duplicate of #400, so closing this anyway :-)

Looks different to me. #400 seems to error in the derive, whereas this errors at runtime.

For what it's worth, I also encountered a stack overflow/infinite loop at runtime with a small test case:

#[derive(Logos, Debug, PartialEq)]
enum TestToken {
    #[regex("c(a*b?)*c")]
    Token
}

#[cfg(test)]
mod logos_test {
    use logos::Logos;

    use crate::TestToken;

    #[test]
    fn overflow() {
        let _ = TestToken::lexer("c").next();
    }
}
Melyodas commented 1 week ago

I toyed a bit with adding a Mermaid output from the lexer.

Looking at the graph from the test above, the issue seems to be that the graph does not handle a loop between nodes when it misses.

flowchart LR
1("::Token")
3("rope#3")
  3 -- "a" --> 5
  3 --x 6
4("rope#4")
  4 -- "b" --> 3
  4 --x 3
5("rope#5")
  5 -- "a" --> 5
  5 --x 4
6("fork#6")
  6 -- "b" --> 3
  6 -- "c" --> 1
  6 --x 3
8("Start")
  8 -- "c" --> 3