rurban / re-engine-PCRE2

use pcre-jit instead of slow perl regex
Other
1 stars 4 forks source link

unicode: add UTF flag if subject is UTF #15

Open rurban opened 7 years ago

rurban commented 7 years ago

if the pattern is not UTF8 (but ambivalent with \D\W...) but the subject is, recompile with UTF and match.

failing re_tests:

\w  \x{200C}    yp  $&  \x{200C}
\W  \x{200C}    np  -   -
\w  \x{200D}    yp  $&  \x{200D}
\W  \x{200D}    np  -   -

/^\D{11}/a  \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}    np  -   -
/^\S{11}/a  \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}    np  -   -
/^\W{11}/a  \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}    np  -   -

# [ perl #114272]
\Vn \xFFn/  yp  $&  \xFFn

a?\X         a\x{100}   yp  $&  a\x{100}
rurban commented 7 years ago

plan for the implementation strategy:

todd-richmond commented 1 year ago

any progress on this? I'm looking to use PCRE2 for better perf, but need mixed UTF8 (regex and subject) to work