maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.74k stars 108 forks source link

Bug - Lexer matches incorrect token #279

Open jeertmans opened 1 year ago

jeertmans commented 1 year ago

Hello! For the context, I am writing a TeX file parser, and I use Logos==0.12.1 to create a lexer.

You can find a (reduced) version of my Token implementation:

use logos::Logos;

#[derive(Logos, Debug)]
pub enum Token {
    #[token(r"\")]
    Backslash,
    #[token(r"\\")]
    DoubleBackslash,
    #[token(r"\begin")]
    EnvironmentBegin,
    #[token(r"\end")]
    EnvironmentEnd,
    //#[token(r"\begin{document}")] // <- the part that creates problems
    DocumentBegin,                  // <-
    #[regex(r"\\[a-zA-Z]+")]
    MacroName,
    #[error]
    Error,
}

When I run my test suite, see reduced version below, it works fine. However, when I add the enum variant DocumentBegin, which should not be matches by anything in within the test suite, I get an error saying that \begin now matches token Backslash. This does not make sense to me, since Blackslash should only match \\, and nothing else.

If I comment out the Backslash token, then \begin matches MacroName, but not EnvironmentBegin.

After reading the documentation, testing on different versions of Logos, I still cannot understand why this does not work as expected.

Can someone help me about this?

Test suite

#[cfg(test)]
mod tests {
    use super::*;
    use logos::Logos;

    macro_rules! assert_token_positions {
        ($source:expr, $token:pat, $($pos:expr),+ $(,)?) => {
            let source = $source;

            let positions: Vec<std::ops::Range<usize>> = vec![$($pos),*];
            let spanned_token: Vec<_> = Token::lexer(source)
                .spanned()
                .filter(|(token, _)| matches!(token, $token))
                .collect();

            let strs: Vec<_> = Token::lexer(source)
                .spanned()
                .map(|(token, span)| (token, source[span].to_string()))
                .collect();

            assert_eq!(
                spanned_token.len(), positions.len(),
                "The number of tokens found did not match the expected number of positions {strs:?}"
            );

            for (pos, (token, span)) in positions.into_iter().zip(spanned_token) {
                assert_eq!(
                    pos,
                    span,
                    "Token {token:#?} was found, but expected at {pos:?}"
                );
            }
        };
    }

    #[test]
    fn token_backslash() {
        assert_token_positions!(r"Should match \+, but not \\+", Token::Backslash, 13..14,);
    }
    #[test]
    fn token_double_backslash() {
        assert_token_positions!(
            r"Should match \\, but not \",
            Token::DoubleBackslash,
            13..15,
        );
    }
    #[test]
    fn token_environment_begin() {
        assert_token_positions!(r"\begin{equation}", Token::EnvironmentBegin, 0..6,);
    }
    #[test]
    fn token_environment_end() {
        assert_token_positions!(r"\end{equation}", Token::EnvironmentEnd, 0..4,);
    }
    #[test]
    fn token_macro_name() {
        assert_token_positions!(
            r"\sin\cos\text{some text}\alpha1234",
            Token::MacroName,
            0..4,
            4..8,
            8..13,
            24..30,
        );
    }
}

Test outputs

Without DocumentBegin variant

running 5 tests
test tests::token_environment_begin ... ok
test tests::token_backslash ... ok
test tests::token_environment_end ... ok
test tests::token_double_backslash ... ok
test tests::token_macro_name ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

With DocumentBegin variant

running 5 tests
test tests::token_double_backslash ... ok
test tests::token_backslash ... ok
test tests::token_environment_end ... ok
test tests::token_macro_name ... ok
test tests::token_environment_begin ... FAILED

failures:

---- tests::token_environment_begin stdout ----
thread 'tests::token_environment_begin' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`: The number of tokens found did not match the expected number of positions [(Backslash, "\\begin"), (Error, "{"), (Error, "e"), (Error, "q"), (Error, "u"), (Error, "a"), (Error, "t"), (Error, "i"), (Error, "o"), (Error, "n"), (Error, "}")]', src/main.rs:80:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    tests::token_environment_begin

test result: FAILED. 4 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

With DocumentBegin variant, without Backslash variant

NOTE: token_backslash test is now ignored (since we removed the variant)

running 5 tests
test tests::token_backslash ... ignored
test tests::token_environment_end ... ok
test tests::token_macro_name ... ok
test tests::token_environment_begin ... FAILED
test tests::token_double_backslash ... ok

failures:

---- tests::token_environment_begin stdout ----
thread 'tests::token_environment_begin' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`: The number of tokens found did not match the expected number of positions [(MacroName, "\\begin"), (Error, "{"), (Error, "e"), (Error, "q"), (Error, "u"), (Error, "a"), (Error, "t"), (Error, "i"), (Error, "o"), (Error, "n")]', src/main.rs:81:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    tests::token_environment_begin

test result: FAILED. 3 passed; 1 failed; 1 ignored; 0 measured; 0 filtered out; finished in 0.00s
maciejhirsz commented 1 year ago

Thanks for the very detailed report. I think I've some idea of what's happening here, namely:

  1. Lexer reads \ which is a potential match.
  2. Lexer reads b and continues to match whole \begin which is a potential match.
  3. Lexer checks for {, doesn't find it, so it settles on the last correctly matched pattern, which it incorrectly assumes to be 1 instead of 2.

I'll see what I can do about it today.

jeertmans commented 1 year ago

Happy to see you back @maciejhirsz :D

I hope you will find how to solve this! Do not hesitate if I can help!