maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.71k stars 105 forks source link

Regex seems to always take priority #377

Open afreeland opened 4 months ago

afreeland commented 4 months ago

I'm new to Rust and new to Logos, so this could just be me...but when using regex is seems like it always stomps on the other tokens. The snippet below has essentially calls out three tokens, one represents an action alert, another is protocol tls and then network information.

Here is the code with a regex

use logos::Logos;
use regex::Regex;

#[derive(Debug, Logos, PartialEq)]
enum Token {
    #[token("alert", priority = 2500)]
    Action,

    #[token("tls", priority = 200)]
    Protocol,

    #[regex(r"([^\s]+) ([^\s]+) (->|<-) ([^\s]+) ([^\s]+)", priority = 0)]
    NetworkInfo,

    Error,
}

struct SuricataLexer<'a> {
    lexer: logos::Lexer<'a, Token>,
}

impl<'a> SuricataLexer<'a> {
    fn new(input: &'a str) -> Self {
        SuricataLexer {
            lexer: Token::lexer(input),
        }
    }

    fn next_token(&mut self) -> Token {
        self.lexer.next().unwrap().unwrap_or(Token::Error)
    }
}

fn main() {
    // Sample Suricata rule
    let input = "alert tls $HOME_NET any -> $EXTERNAL_NET any (msg:\"some bs\")";

    // Create a SuricataLexer instance
    let mut lexer = SuricataLexer::new(input);

    // Tokenize the input and print the results
    while lexer.lexer.span().end < input.len() {
        let token = lexer.next_token();
        println!("{:?}", token);
    }
}

This outputs:

Error
Error
NetworkInfo
Error
Error

However, if I comment out the NetworkInfo section, my Action and Protocol will work just fine. Output:

Action
Error
Protocol
Error
Error
....

This part of the input $HOME_NET any -> $EXTERNAL_NET represents a source host, source port, direction, destination host and destination port. These things are pretty fluid so outside of regex, not really sure of how I would go about targeting them.

Is there a way to have regex not overpower everything around it...or am I doing something incorrectly? I read the token-disambiguation but couldn't seem to find a way to lower regex priority.

jeertmans commented 4 months ago

Hello, thanks for sharing your issue!

First, let me suggest this simpler MWE, so it is easier to debug:

use logos::Logos;

#[derive(Debug, Logos)]
enum Token {
    #[token("alert")]
    Action,

    #[token("tls")]
    Protocol,

    #[regex(r"([^\s]+) ([^\s]+) (->|<-) ([^\s]+) ([^\s]+)")]
    NetworkInfo,
}

fn main() {
    let input = "alert tls $HOME_NET any -> $EXTERNAL_NET any (msg:\"some bs\")";

    let mut lexer = Token::lexer(input);
    while let Some(token) = lexer.next() {
        println!("{:?}", token);
    }
}

Second, I think this is a duplicate of #358, and maybe #265. Hopefully, the bug fix mentioned in #265 by @jameshurt might solve this, but I am waiting for a reply :-)