maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.85k stars 115 forks source link

[BUG] Logos is unable to recover because of whitespace #358

Open Rafaeltheraven opened 8 months ago

Rafaeltheraven commented 8 months ago

Not exactly sure how to title this issue or how to describe this bug. I'll present you the following enum:

#[logos(skip r"/\*([^*]|\*[^/])+\*/|[ \t\n\f]+")]
pub enum Metadata {
    #[regex(r#"[a-zA-Z][a-zA-Z0-9]*[\s]*:[\s]*[^:;][^;]*"#, |lex| split_to_twople(lex.slice(), ':'))]
    Map((String, String)),

    #[token("include")]
    Include,

        #[token(";")]
        Semi,

        #[regex(r"[a-zA-Z]*(::[a-zA-Z]*)+", |lex| lex.slice().to_string())]
        Path(String),
}

Given the input string: "include some::path" this will make the lexer return an error.

Some explanation of the enum. The regex for Map describes a key: value pair. The key must be alphanumeric. The value can be anything. Each pair is terminated with a semicolon. The paths are Rust style paths import paths. You can ignore the split_to_twople method. It returns a specific error in case of failure, as opposed to the generic error that is being returned by the issue. We can be sure that's not where the error lies.

Through experimenting, I have found that removing the [\s]* to the left of the : in the Map regex fixes the issue, but as a result, you are no longer able to do key : value, which I would like to still lex.

It seems like somewhere, after it has tried the whitespace and then fails on the : it is unable to recover and try other valid lexes (in this case, include).

jeertmans commented 8 months ago

Hello!

Did you try without the whitespace? Your skip regex is quite complicated, but I guess it should match white spaces. Maybe you can try with a simple skip whitespace regex (so remove other tokens) and see if it works.

In general, you regexes seem overly complicated for what they should achieve. Also, I am not sure you should put \s inside [] for white spaces matching. Maybe simplifying your regexes can first help.

Rafaeltheraven commented 8 months ago

The skip regex matches white space, as well as comments (// and /* */ style). IIRC I got it from one of the issues on this repo. Simplifying it down to just [ \t\n\f]+ does not resolve the problem.

I agree that my regexes are much too complicated, and I've resolved to doing more simple lexing and solving this problem on the parser level. However, the issue still stands that (at least it seems to me) this should be possible and there is some bug somewhere that causes this to fail.

Finally, yeah you don't have to put \s inside [], but I also don't think it changes anything (as opposed to just matching \s it matches anything in the set [\s] which is just \s). I think I started doing that because of the alternative [:space:] notation which does require [].

Rafaeltheraven commented 8 months ago

Update, simplifying the regex to [a-zA-Z][a-zA-Z0-9]*\s*: gives the exact same issues.