unicode: add UTF flag if subject is UTF

rurban commented 7 years ago

if the pattern is not UTF8 (but ambivalent with \D\W...) but the subject is, recompile with UTF and match.

failing re_tests:

\w  \x{200C}    yp  $&  \x{200C}
\W  \x{200C}    np  -   -
\w  \x{200D}    yp  $&  \x{200D}
\W  \x{200D}    np  -   -

/^\D{11}/a  \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}    np  -   -
/^\S{11}/a  \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}    np  -   -
/^\W{11}/a  \x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}    np  -   -

# [ perl #114272]
\Vn \xFFn/  yp  $&  \xFFn

a?\X         a\x{100}   yp  $&  a\x{100}

rurban commented 7 years ago

plan for the implementation strategy:

if a pattern contains unicode classes like \w, \s \d, always compile with /u. if the subject is ascii, compile again with /a and do the ascii match.
otherwise if the pattern is compiled /a and the subject is /u, re-compile again.
cache the optional second pattern. in pprivate as struct of compiled_ascii_pattern and compiled_uni_pattern, together with the engine. see e.g. re::engine::Hyperscan where I also store two ptrs in pprivate.
also cache statistics about asc/uni usage to make better predictions. (e.g. 2 more ints)

todd-richmond commented 1 year ago

any progress on this? I'm looking to use PCRE2 for better perf, but need mixed UTF8 (regex and subject) to work

rurban / re-engine-PCRE2

unicode: add UTF flag if subject is UTF #15