maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.88k stars 116 forks source link

No longest match for similar regexs #75

Closed shlnz closed 4 years ago

shlnz commented 5 years ago

Hi,

while playing with Logos, I noticed that for similar regex's (which share the same start, for example) the longest match is not necessarily chosen. I don't know if this is the same issue as #18, because in this case two regexs create the problem.

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
enum Token {
    #[end]
    End,
    #[error]
    Error,
    #[regex = "[a-zA-Z]+"]
    Text,
    #[regex = r"[a-zA-Z0-9+/]+[=]*"]
    Base64,
}

fn main() {
    let lexer = Token::lexer("SGVsbG8gV29ybGQ=");
    assert_eq!(lexer.token, Token::Base64);
}

I tried the code with version 0.9.7 and version 0.10.0-rc2, but both versions behave the same.

maciejhirsz commented 5 years ago

Thanks for reporting this!

maciejhirsz commented 4 years ago

This should be already fixed in 0.10.0-rc3, however there might still be disambiguation issues if you produce a string that matches either Text or Base64, since the new disambiguation rules would give both regexes priority 1. I'll add a priority flag that you can attach to declarations, to make it possible to bump it up.

pascalkuthe commented 4 years ago

Hi I encountered a bug using the newest (rc5) version. I am unsure if it is realted so I am posting it here for now. The following tokens show some weird bevaior

    #[regex = r"[0-9][0-9_]*"]
    LiteralUnsignedNumber,
    #[regex = r"[0-9][0-9_]*\.[0-9][0-9_]*[TGMKkmupfa]"]
    LiteralRealNumberDotScaleChar,
    #[regex = r"[0-9][0-9_]*\.[0-9][0-9_]*[eE][+-]?[0-9][0-9_]*"]
    LiteralRealNumberDotExp,
     #[regex = r"[0-9][0-9_]*[TGMKkmupfa]"]
    LiteralRealNumberScaleChar,
     #[regex = r"[0-9][0-9_]*[eE][+-]?[0-9][0-9_]*"]
    LiteralRealNumberExp,
    #[regex = r"[0-9][0-9_]*\.[0-9][0-9_]*"]
    LiteralRealNumberDot,

All strings that should match LiteralRealNumberDot (0.1, 3.141) match LiteralUnsignedNumber but the rest matches correctly. When I remove LiteralUnsignedNumber. Strings that should match LiteralRealNumberDot don't match at all. But when I removeLiteralRealNumberScaleChar or LiteralRealNumberExp the problems disappears and I can even add LiteralUnsignedNumber without the previous issue reappearing.

maciejhirsz commented 4 years ago

That is definitely a bug, can reproduce, let me see if I can fix it quickly...

maciejhirsz commented 4 years ago

Well, I know where the bug in the code is, but I can't seem to fix it right now without breaking other tests. In short, the graph produced for relevant nodes looks as follows:

(...)
               40: {
                   [0-9] ⇒ 40,
                   E ⇒ 12,
                   G ⇒ 3,
                   K ⇒ 3,
                   M ⇒ 3,
                   T ⇒ 3,
                   _ ⇒ 40,
                   a ⇒ 3,
                   e ⇒ 12,
                   f ⇒ 3,
                   k ⇒ 3,
                   m ⇒ 3,
                   p ⇒ 3,
                   u ⇒ 3,
               },
               44: [0-9] ⇒ 40,
               47: {
                   . ⇒ 44,
                   [0-9] ⇒ 47,
                   E ⇒ 25,
                   G ⇒ 18,
                   K ⇒ 18,
                   M ⇒ 18,
                   T ⇒ 18,
                   _ ⇒ 47,
                   a ⇒ 18,
                   e ⇒ 25,
                   f ⇒ 18,
                   k ⇒ 18,
                   m ⇒ 18,
                   p ⇒ 18,
                   u ⇒ 18,
                   _ ⇒ 48,
               },
(...)

47 is a loop for integers without a fraction, on . it jumps to 44, expects one digit, then jumps to 40. 40 is a loop, and somewhere along the way some node merging happened that removed the escape branch from 40, which means it will loop and, given absence of a definite error, backtrack.

So there are actually two separate bugs here:

  1. In the generated code: after bumping the lexer past . backtracking should be disabled, because the produced token will be incorrect, so this should be a hard error.
  2. The missing branch.

First is easy, looking into 2.

maciejhirsz commented 4 years ago

@DSPOM2 I've narrowed it down, but still need to spend some time to figure out a proper solution to this, so will get back to this tomorrow morning.

In the mean time, if you move LiteralRealNumberDot up so that it's first in that list of variants, the branch merging works as intended, so that can be a temporary solution till I have this worked out.

pascalkuthe commented 4 years ago

Thank you so much for the quick response! I already tried to move LiteralRealNumberDot but apperently I missed most obvious solution

maciejhirsz commented 4 years ago

So I found the bug, my merging strategy was conservative and I was only merging forks with forks (think: match), skipping ropes (think: if let Some(...) for more than one byte), which caused one of the exit branches to be dropped.

The fix was to actually be more aggressive with the merges, which also produces a more compact graph, which in turn, for a reason that is beyond me right now, sped up my new strings benchmark by some 40%... I added more tests to make sure the benchmarks are actually running correctly, they are. Gonna release rc6 in a moment.