Open DamonsJ opened 1 year ago
Can you minimize it?
Yes!
int test2()
{
std::string original = "ๆฆๅ ดใฎใดใกใซใญใฅใชใข3";
int size_of_str = original.size(); // size_of_str = 31;
auto bdata = original.data();
static constexpr auto pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
auto matcher = ctre::search<pattern>;
std::string_view cur_data((char*)original.data(),original.size());
int prev = 0;
bool is_matched =false;
do {
auto matched = matcher(cur_data);
is_matched = matched;
if (is_matched){
int pos = matched.end() - cur_data.data();
cur_data.remove_prefix(pos);
}
} while(is_matched);
}
the code give me two matches, one is "ๆฆๅ ดใฎใดใกใซใญใฅใชใข" and the other is "3"
but when I do the same regex search using ICU library and rust, they give me one match : "ๆฆๅ ดใฎใดใกใซใญใฅใชใข3" so why that happen?
by the way, if I use this string : std::string original = "Media.Vision"; ctre , ICU library and rust, they give same three matches:
\w+
in Rust is unicode-aware, it will match any word character in any script (equivalent to [\p{L}\p{N}_]
).
In PCRE it only matches ASCII letters, digits and underscore.
For a compile-time regex library to be fully Unicode-aware is a huge ask, FYI @DamonsJ. Unicode is incredibly complex, requiring lots of very large lookup-tables and other short-circuiting mechanisms to implement all the code point identification logic correctly and efficiently.
Thanks very much @marzer @iulian-rusu
I know it is hard to fully support for unicode regex!
For my question, I write pattern like this :
static constexpr auto pattern = ctll::fixed_string{
"[\\p{L}\\p{N}\\p{M}\\p{Pc}]+|[^\\p{L}\\p{N}\\p{M}\\p{Pc}\\p{Zs}\\u{A}\\u{B}\\u{C}\\u{D}"
"\\u{85}\\u{2028}\\u{2029}\\u{DA}]+"};
it works for me, but you know it is not exactly same with :
static constexpr auto pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
but hope to help others who has same problems!
here is the test code :
rust and icu give the same result the matched string is "ๆฆๅ ดใฎใดใกใซใญใฅใชใข3" and ctre gives two part "ๆฆๅ ดใฎใดใกใซใญใฅใชใข" and "3" why that happen?