hanickadot / compile-time-regular-expressions

Compile Time Regular Expression in C++
https://twitter.com/hankadusikova
Apache License 2.0
3.22k stars 177 forks source link

Unexpected behavior with '?' quantifier #281

Closed Minty-Meeo closed 1 year ago

Minty-Meeo commented 1 year ago

STL regex: https://godbolt.org/z/vMqbTdz3a CTRE regex: https://godbolt.org/z/fKvGKjTc1

I am attempting to switch from std::regex for a project which reads a multi-line text file sort of like a binary file by using regular expressions. I found that a '?' quantifier was useful for supporting both CRLF and LF line endings, but upon switching to CTRE, my code was broken. For some reason, the carriage return character is being captured. Is this a defect of CTRE, or am I doing something wrong?

hanickadot commented 1 year ago

Hi,

1) in multiline mode currently CTRE only consider \n as new line. Hence .+ will eat the \r, the optional \r? won't match and then \n will match. That's the current state of thing, and properly looking at \r\n would fix that (changes would be needed in evaluation.hpp)

2) you can avoid this issue by not using .+ and instead [^\r\n]+ as this will trigger an optimization and modify the loop into a possessive one, and it will give you much faster regex

3) (bonus) if you are parsing some sort of a document, look at the ctre::tokenize or ctre::range:

for (auto match: ctre::tokenize<regex>(subject)) {
  // each match
}

ctre::tokenize is an equivalent of repeated calls of ctre::starts_with ctre::range is an equivalent of ctre::search

hanickadot commented 1 year ago
  1. alternatively you can change the loop into .+? which is a lazy loop, it will always try next character before looping again.
Minty-Meeo commented 1 year ago

Thank you for your quick response.