hanickadot / compile-time-regular-expressions

Compile Time Regular Expression in C++
https://twitter.com/hankadusikova
Apache License 2.0
3.37k stars 186 forks source link

Valid regex does not match #257

Open ghlecl opened 2 years ago

ghlecl commented 2 years ago

Hello. I have the following regex:

#include "ctll/fixed_string.hpp"
#include "ctre.hpp"

static constexpr auto ptv_regex = ctll::fixed_string(
    "^(\\d+)(?:[vV](\\d+))?[Pp][Tt][Vv](?:(\\d+)|(\\d+)([AaBbCc]?))?(!)?(?:$|\\^(.+))"
);

which I try to match using:

auto found = ctre::search<ptv_regex>( "1v2PTV1a" );

On regex101.com, this regex matches for the python, PCRE and PCRE2 regex "engines", so I believe the regex is valid and OK. With the CTRE library (version 3.6 from vcpkg and that which is available as trunk on compiler explorer today), it does not match.

Sorry that I don't have the knowledge to try and pinpoint the problem. Just thought I would report it.

Alcaro commented 2 years ago

Yep, here's a bug. I can reduce it to

#include "ctll/fixed_string.hpp"
#include "ctre.hpp"

static constexpr auto ptv_regex = ctll::fixed_string(
    "(?:a|ab)?"
);

bool x()
{
    auto found = ctre::match<ptv_regex>( "ab" );
    return found;
}

https://godbolt.org/z/jqn9vrrWc

hanickadot commented 2 years ago

Thanks for the minimisation, it seems to me there is a deeper problem, and I'm currently don't know how to fix it. Will try to look at it soon.

hanickadot commented 2 years ago

It seems to related when you have a common prefix in selection inside a cycle.

iulian-rusu commented 2 years ago

I am not an expert on how this library implements matching, but it seems to me that there are two factors:

  1. Alternations of the form a|ab will always try to match a before ab.
  2. Cycles are implemented by repeatedly calling evaluate on the content of the cycle:
    while (less_than_or_infinite<B>(i)) {
    auto inner_result = evaluate(..., ctll::list<Content..., end_cycle_mark>());
    // ... match Tail if inner_result matches
    }

    The problem is that, at some point, the expression a|ab has to figure that it needs to match ab immediately instead of a, even if it can only match the prefix a. This cannot happen if the content of the cycle is matched as ctll::list<Content..., end_cycle_mark>, because Content has no idea that Tail comes after it and that it might need to backtrack based on Tail's ability to match. The evaluation of Content thinks the regex ends at end_cycle_mark.

I am writing this because I have also implemented compile-time regular expression matching as a hobby-project (inspired by this library, of course) and I have come across the same bug in my code. The solution that i have found involves using a callback function that calls evaluate for the rest of the expression and passing this function to the evaluation method for Content, so that, after matching Content, the same evaluate method can use the callback to check if Tail matches. It's basically a continuation-passing style approach.