hanickadot / compile-time-regular-expressions

Compile Time Regular Expression in C++
https://twitter.com/hankadusikova
Apache License 2.0
3.22k stars 177 forks source link

ctre gives different result compared with icu and rust #286

Open DamonsJ opened 1 year ago

DamonsJ commented 1 year ago

here is the test code :

int test2()
{
    using namespace std::literals;
    //std::string original = "๐”พ๐• ๐• ๐•• ๐•ž๐• ๐•ฃ๐•Ÿ๐•š๐•Ÿ๐•˜ ๐”พ๐• ๐• ๐•• ๐•ž๐• ๐•ฃ๐•Ÿ๐•š๐•Ÿ๐•˜";
    std::string original = "ๆˆฆๅ ดใฎใƒดใ‚กใƒซใ‚ญใƒฅใƒชใ‚ข3";
    auto bdata = original.data();
    static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
    auto matcher = ctre::search<pattern>;

    std::string_view cur_data((char*)original.data(),original.size());

    std::vector<std::pair<std::pair<int32_t, int32_t>, bool>> splits;
    splits.reserve(original.size());
    int prev = 0;
    bool is_matched =false;
    do {
        auto matched = matcher(cur_data);
        is_matched = matched;
        if (is_matched){

            auto start_byte_index =  matched.begin() - original.data();
            auto end_byte_index =  matched.end() - original.data();

            if (prev != start_byte_index) {
                std::pair<int32_t, int32_t> p(prev, start_byte_index);
                splits.push_back(
                                 std::pair<std::pair<int32_t, int32_t>, bool>(p, false));
            }
            std::pair<int32_t, int32_t> p(start_byte_index, end_byte_index);
            splits.push_back(std::pair<std::pair<int32_t, int32_t>, bool>(p,
                                                                          true));
            prev = end_byte_index;
            int pos = matched.end() - cur_data.data();
            cur_data.remove_prefix(pos);
        }
    } while(is_matched);

rust and icu give the same result the matched string is "ๆˆฆๅ ดใฎใƒดใ‚กใƒซใ‚ญใƒฅใƒชใ‚ข3" and ctre gives two part "ๆˆฆๅ ดใฎใƒดใ‚กใƒซใ‚ญใƒฅใƒชใ‚ข" and "3" why that happen?

hanickadot commented 1 year ago

Can you minimize it?

DamonsJ commented 1 year ago

Yes!

int test2()
{

    std::string original = "ๆˆฆๅ ดใฎใƒดใ‚กใƒซใ‚ญใƒฅใƒชใ‚ข3";
    int size_of_str = original.size(); // size_of_str = 31;
    auto bdata = original.data();
    static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
    auto matcher = ctre::search<pattern>;
    std::string_view cur_data((char*)original.data(),original.size());

    int prev = 0;
    bool is_matched =false;
    do {
        auto matched = matcher(cur_data);
        is_matched = matched;
        if (is_matched){
            int pos = matched.end() - cur_data.data();
            cur_data.remove_prefix(pos);
        }
    } while(is_matched);
}

the code give me two matches, one is "ๆˆฆๅ ดใฎใƒดใ‚กใƒซใ‚ญใƒฅใƒชใ‚ข" and the other is "3"

but when I do the same regex search using ICU library and rust, they give me one match : "ๆˆฆๅ ดใฎใƒดใ‚กใƒซใ‚ญใƒฅใƒชใ‚ข3" so why that happen?

DamonsJ commented 1 year ago

by the way, if I use this string : std::string original = "Media.Vision"; ctre , ICU library and rust, they give same three matches:

  1. "Media"
  2. "."
  3. "Vision"
iulian-rusu commented 1 year ago

\w+ in Rust is unicode-aware, it will match any word character in any script (equivalent to [\p{L}\p{N}_]). In PCRE it only matches ASCII letters, digits and underscore.

https://regex101.com/r/jVmHsw/1

marzer commented 1 year ago

For a compile-time regex library to be fully Unicode-aware is a huge ask, FYI @DamonsJ. Unicode is incredibly complex, requiring lots of very large lookup-tables and other short-circuiting mechanisms to implement all the code point identification logic correctly and efficiently.

DamonsJ commented 1 year ago

Thanks very much @marzer @iulian-rusu

I know it is hard to fully support for unicode regex!

For my question, I write pattern like this :

static constexpr auto pattern = ctll::fixed_string{
        "[\\p{L}\\p{N}\\p{M}\\p{Pc}]+|[^\\p{L}\\p{N}\\p{M}\\p{Pc}\\p{Zs}\\u{A}\\u{B}\\u{C}\\u{D}"
        "\\u{85}\\u{2028}\\u{2029}\\u{DA}]+"};

it works for me, but you know it is not exactly same with :

static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};

but hope to help others who has same problems!