maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.89k stars 118 forks source link

Cannot Tokenize on "\"\"\".*\"\"\"" #246

Open almathaler opened 2 years ago

almathaler commented 2 years ago

The regex "\"\"\".*\"\"\" cannot match "\"\"\"abc\"\"\"" in the following code snippet, though Rust's innate regex does match:

use logos::Logos;
use regex::Regex;

#[derive(Logos, Debug, PartialEq)]

enum Token {
    #[regex(r#"""".*""""#)]
    Triple,
    #[error]
    Error,
}
fn main() {
    let s = r#""""abc""""#;
    let mut lex = Token::lexer(s);
    let mut tp = lex.next();
    let mut i = 0;
    while  tp != Some(Token::Error){
        i += 1;
    tp = lex.next();
    }
    assert!(i == 0);
    let re = Regex::new(r#"""".*""""#).unwrap();
    assert!(re.is_match(r#""""abc""""#)); //the regex matches    
    println!("error type of tp: {:?}", tp.unwrap()); //but logos can't recognize the match
}
maciejhirsz commented 2 years ago

This is working as intended for Logos although it needs to be documented better. While Logos is using the syntax of Regex, it diverges in functionality: it's always greedy and it almost never backtracks.

For this specific use case you'd want r#""""[^"]*""""#. If quotes are to be allowed internally it might be easier to implement this as a """ token with a callback that finds the closing """.

I'll leave this open because .* is a common enough use case were some special case in code generation could be done, because as of right now it will always consume all input till the end.