hanickadot / compile-time-regular-expressions

Compile Time Regular Expression in C++
https://twitter.com/hankadusikova
Apache License 2.0
3.32k stars 183 forks source link

Ctre causes stack overflow when using certain text with certain regex ?? #252

Closed beemibrahim closed 2 years ago

beemibrahim commented 2 years ago

This Code Fails miserably :

static constexpr auto line_q = ctll::fixed_string{ "^(\\w+)=\"(.*)\"\\s*$" };
    std::string data = "EXAMPLE=\"TEXT TEXT TEXT TEXT TEXT-TEXT-THAT TEXT.TEXT TEXT TEXT + I I II * I II + I NVM_>U=SOMEMORETEXT#&PhK&2 \"\"TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT\"\"\"";
    std::string_view data_view = data;
    auto matches = ctre::search<line_q>(data_view);  // Line causes stack overflow
beemibrahim commented 2 years ago

It seems to continously call ctre::evalute_recursive ???

hanickadot commented 2 years ago

That's how system stack based regex engines are supposed to work. They are using your system stack to store all possible paths thru your subject.

There are a few problems with your regex:

1) if you have anchors on both sides you shouldn't be using ctre::search but ctre::match as that's what is matching whole input (equivalent of search with ^...$) 2) what's the point for \s* at the end? 2) your pattern should be changed to stop on doublequotes \"([^\"]*)\" otherwise it will skip every doublequotes until the end, and then it fails to match it because of missing doublequotes, and then it tries to add one character to \s*

beemibrahim commented 2 years ago

Oh I thought they were the ctre::search and ctre::match were the same : )

beemibrahim commented 2 years ago

I didn't write that regular expression

hanickadot commented 2 years ago

https://compiler-explorer.com/z/TPj983fzb

beemibrahim commented 2 years ago

Wait it fails with ctre::match too

beemibrahim commented 2 years ago

In perl regex and in std::regex it works , but doesnt work with ctre , it should at least throw an exception that can be caught

hanickadot commented 2 years ago

CTRE is not designed to save you from bad regular expressions, I consider them a program. And in some cases especially greedy expressions can lead to such issue. Consider modifying them into possessive / lazy expressions.

a+ ... greedy a+? ... lazy a++ ... possessive

beemibrahim commented 2 years ago

CTRE is not designed to save you from bad regular expressions, I consider them a program. And in some cases especially greedy expressions can lead to such issue. Consider modifying them into possessive / lazy expressions.

a+ ... greedy a+? ... lazy a++ ... possessive

it that because it is done in compile time ??

hanickadot commented 2 years ago

Because using non-system stack would mean it needs to allocate for each evaluation, which would be expensive. Problem of your expression is it's exploding in complexity O(n^2) and for each path take it's storing its matches. mostly because .* in your expression takes everything including double-quotes inside of other double-quotes. And because it's greedy matching algorithm, it tries every possible match starting with longest. So it needs to store possible way for each character between start of the cycle and end of the subject.

Easiest way how to avoid this complexity (don't get me wrong, but this creates problem for other engines too) is to replace .* inside quotes with lazy one .*?

Lazy tries all paths too, but starts with shortest, so it needs only to store last successful match.

beemibrahim commented 2 years ago

https://compiler-explorer.com/z/TPj983fzb

The output is wrong , it returns the value as

TEXT TEXT TEXT TEXT TEXT-TEXT-THAT TEXT.TEXT TEXT TEXT + I I II * I II + I NVM_>U=SOMEMORETEXT#&PhK&2.

When it should return it as

TEXT TEXT TEXT TEXT TEXT-TEXT-THAT TEXT.TEXT TEXT TEXT + I I II * I II + I NVM_>U=SOMEMORETEXT#&PhK&2 ""TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT""

beemibrahim commented 2 years ago

Because using non-system stack would mean it needs to allocate for each evaluation, which would be expensive. Problem of your expression is it's exploding in complexity O(n^2) and for each path take it's storing its matches. mostly because .* in your expression takes everything including double-quotes inside of other double-quotes. And because it's greedy matching algorithm, it tries every possible match starting with longest. So it needs to store possible way for each character between start of the cycle and end of the subject.

Easiest way how to avoid this complexity (don't get me wrong, but this creates problem for other engines too) is to replace .* inside quotes with lazy one .*?

Lazy tries all paths too, but starts with shortest, so it needs only to store last successful match.

Okay I'll try this

hanickadot commented 2 years ago

https://compiler-explorer.com/z/TPj983fzb

The output is wrong , it returns the value as

TEXT TEXT TEXT TEXT TEXT-TEXT-THAT TEXT.TEXT TEXT TEXT + I I II * I II + I NVM_>U=SOMEMORETEXT#&PhK&2.

When it should return it as

TEXT TEXT TEXT TEXT TEXT-TEXT-THAT TEXT.TEXT TEXT TEXT + I I II * I II + I NVM_>U=SOMEMORETEXT#&PhK&2 ""TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT""

Where to you expect it to stop? .* will go thru whole input, it doesn't count pairs of double-quotes. What input would you expect for:

key=""""""""""""""""""" ?

beemibrahim commented 2 years ago

Because using non-system stack would mean it needs to allocate for each evaluation, which would be expensive. Problem of your expression is it's exploding in complexity O(n^2) and for each path take it's storing its matches. mostly because .* in your expression takes everything including double-quotes inside of other double-quotes. And because it's greedy matching algorithm, it tries every possible match starting with longest. So it needs to store possible way for each character between start of the cycle and end of the subject. Easiest way how to avoid this complexity (don't get me wrong, but this creates problem for other engines too) is to replace .* inside quotes with lazy one .*? Lazy tries all paths too, but starts with shortest, so it needs only to store last successful match.

Okay I'll try this

Hey this works , It returns the right output for me :)

beemibrahim commented 2 years ago

https://compiler-explorer.com/z/TPj983fzb

The output is wrong , it returns the value as TEXT TEXT TEXT TEXT TEXT-TEXT-THAT TEXT.TEXT TEXT TEXT + I I II I II + I NVM_>U=SOMEMORETEXT#&PhK&2. When it should return it as TEXT TEXT TEXT TEXT TEXT-TEXT-THAT TEXT.TEXT TEXT TEXT + I I II I II + I NVM_>U=SOMEMORETEXT#&PhK&2 ""TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT""

Where to you expect it to stop? .* will go thru whole input, it doesn't count pairs of double-quotes. What output would you expect for:

key=""""""""""""""""""" ?

Nevermind

beemibrahim commented 2 years ago

I guess I'm gonna have to be more careful with regex while using ctre :|

hanickadot commented 2 years ago

This applies to every back-tracking regex engine. Only engines immune against this are non-backtracking, and they are not able to provide other features.