maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.85k stars 115 forks source link

Stack overflow at ~20,000 characters #384

Open rscarson opened 6 months ago

rscarson commented 6 months ago

I have a StringLiteral token with the following expressions on it:

    #[regex(r#"(?:/(?:\\.|[^\\/])+/[a-zA-Z]*)"#)] // Regex literal
    #[regex(r#"(?:"(?:(?:[^"\\])|(?:\\.))*")"#)] // " string literal "
    #[regex(r#"(?:'(?:(?:[^'\\])|(?:\\.))*')"#)] // ' string literal '

And encountering a string literal above ~20,000 characters stack overflows within logos generated code Is there an error in one of these expressions?

maciejhirsz commented 6 months ago

Groups are never capturing in logos so you don't need ?:, and you don't need groups around |, so string literal can be rewritten as #[regex(r#""([^"\\]|\\.)*""#)]. If that still overflows try #[regex(r#""([^"\\]+|\\.)*""#)]. I believe it loops internally by doing tail call recursion so the greedily looping all non-escaped non-quote characters inside the group could help (but I have not tested it).

This is one of those cases that really needs the codegen rewrite (#291) into a loop with enum branching working as faux-goto instead of function calls.

rscarson commented 6 months ago

I had already rewritten it as #[regex(r#""([^"\\]|\\.)*""#)] with no luck but #[regex(r#""([^"\\]+|\\.)*""#)] did the trick! Thanks!

rscarson commented 6 months ago

it still overflows at some point though, at a larger number

Should I leave this issue open for tracking?

maciejhirsz commented 6 months ago

Should I leave this issue open for tracking?

Ye, let's do that.