hanickadot / compile-time-regular-expressions

Compile Time Regular Expression in C++
https://twitter.com/hankadusikova
Apache License 2.0
3.22k stars 177 forks source link

PCRE "single-line mode" not properly represented in CTRE #282

Open Minty-Meeo opened 1 year ago

Minty-Meeo commented 1 year ago

I want to preface this with the fact that I am quite inexperienced with regular expressions, so I may be wrong about some things.

When I created issue #281, the example I linked for CTRE used a ctre::multiline_starts_with. This was because it was a simplified snippet from a personal project I am attempting to convert to using CTRE. I intended to use ctre::starts_with, as that is the direct analogue for the std::regex mode I was using before. However, ctre::starts_with consistently caused stack overflow crashes. I have now discovered, through trial and error, why this was.

STL: https://godbolt.org/z/vP9YqGP3v CTRE: https://godbolt.org/z/bedTY8jxo

I do not know how to describe, it, but it seems regular expressions of various flavors (when not in multi-line mode) have special rules for the '\n' and '\r' characters that CTRE does not follow. I found a website that helps support this claim: https://regex101.com/r/Syt781/1. Notice that the regex behaves identically in ECMAScript, PCRE, and PCRE2 modes. I say it is a special rule for these characters in particular because other characters, including escape sequences like '\a', do still result in the greedy capture going too far with std::regex: https://godbolt.org/z/1cj3KqMas.

Minty-Meeo commented 1 year ago

I think there is code that tries to achieve this in ctre::evaluate, but it is hidden behind multi-line mode.

Minty-Meeo commented 1 year ago

Here is a simplified example of the std::regex behavior on a string containing '\r' or '\n'. https://godbolt.org/z/q555G3hdo Even when not part of the expression, '\r' or '\n' halts any capture. I had no clue my project relied on this behavior until just today.

Minty-Meeo commented 1 year ago

It seems like ECMAScript is the only flavor available to std::regex with this special rule for '\n' and '\r'. I don't know enough about PCRE to know if the same is true, or if this is the nature of "multi-line" mode for PCRE and it is simply on by default in any online examples I can find.

iulian-rusu commented 1 year ago

Even when not part of the expression, '\r' or '\n' halts any capture. I had no clue my project relied on this behavior until just today.

By default, the . metacharacter does not match line breaks (\r or \n). As far as I know, CTRE has the behavior that . matches anything by default, including line breaks. This is not the same as in std::regex, hence why it halted the capture once it found a line break.

Here is a useful website which explains how the dot character works. In short, there is this flag called "single-line" (or sometimes "dotall") which makes the dot actually match line breaks.

I usually use something like [\d\D] or sometimes [^] (if this syntax is supported) when I want to be absolutely sure the pattern will match anything.

Minty-Meeo commented 1 year ago

I see, so this is a quirk exclusive to Perl-Compatible Regular Expressions. I think CTRE makes the mistake of assuming multi-line mode is the opposite of single-line mode, like this website says, as I found in the source code while making PR #283 that multi-line mode is what enables the behavior of never matching '\r' or '\n' for CTRE.

Minty-Meeo commented 1 year ago

Oh dear, this documentation you linked says PCRE is supposed to allow configuring which characters are line endings. So my PR isn't really PCRE valid, now it just matches the std::regex behavior. This is a complicated topic.