maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.73k stars 108 forks source link

Strange behaviour when matching 'else' / 'else if' #160

Open irh opened 4 years ago

irh commented 4 years ago

I'm working on a lexer for a language where I'd like to have else and else if lexed as separate tokens, but I'm running into suprising behaviour.

In the following example you can see that else has been lexed as Other:

mod else_if {
    use logos::Logos;

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex(r"[ ]+", logos::skip)]
        #[error]
        Error,
        #[token("else")]
        Else,
        #[token("else if")]
        ElseIf,
        #[regex(r"[a-z]*")]
        Other,
    }

    #[test]
    fn else_x_else_if_y() {
        let mut lexer = Token::lexer("else x else if y");

        // Expected: assert_eq!(lexer.next().unwrap(), Token::Else);
        assert_eq!(lexer.next().unwrap(), Token::Other);

        assert_eq!(lexer.next().unwrap(), Token::Other);
        assert_eq!(lexer.next().unwrap(), Token::ElseIf);
        assert_eq!(lexer.next().unwrap(), Token::Other);
    }
}

Removing the space from else if allows else to be parsed as Else:

mod else_if_2 {
    use logos::Logos;

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex(r"[ ]+", logos::skip)]
        #[error]
        Error,
        #[token("else")]
        Else,
        #[token("elseif")]
        ElseIf,
        #[regex(r"[a-z]*")]
        Other,
    }

    #[test]
    fn else_x_else_if_y() {
        let mut lexer = Token::lexer("else x elseif y");

        assert_eq!(lexer.next().unwrap(), Token::Else);
        assert_eq!(lexer.next().unwrap(), Token::Other);
        assert_eq!(lexer.next().unwrap(), Token::ElseIf);
        assert_eq!(lexer.next().unwrap(), Token::Other);
    }
}

Keeping the space in else if, but removing some of the characters from Else causes it to be unexpectedly matched.

mod else_if_3 {
    use logos::Logos;

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex(r"[ ]+", logos::skip)]
        #[error]
        Error,
        #[token("e")]
        Else,
        #[token("else if")]
        ElseIf,
        #[regex(r"[a-z]*")]
        Other,
    }

    #[test]
    fn else_x_else_if_y() {
        let mut lexer = Token::lexer("else x else if y");

        // Expected: assert_eq!(lexer.next().unwrap(), Token::Other);
        assert_eq!(lexer.next().unwrap(), Token::Else);

        assert_eq!(lexer.next().unwrap(), Token::Other);
        assert_eq!(lexer.next().unwrap(), Token::ElseIf);
        assert_eq!(lexer.next().unwrap(), Token::Other);
    }
}

My understanding of the token disambiguation documentation is that the first example should work as I'd expect, with Else and ElseIf being matched independently, with higher priority than Other. Do I have that wrong? And is the last example exposing a bug?

Thanks for your time and the great library!

maciejhirsz commented 4 years ago

This definitely looks like a bug, will have a look as soon as I can, thanks for reporting!

Zenthial commented 1 year ago

I'm currently running into this, was any solution ever discovered?