Closed raskad closed 10 months ago
Given that
unicode
is a property of the regex itself, I think this should be attached to theInputIndexer
. That is, there should be two differentUtf16Input
types - one that decodes surrogates, and one that does not. I suggest the namesUtf16Input
andUcs2Input
respectively. Then we won't need to pass around theunicode
flag.
This is a really good idea! I implemented it exactly like that and tested it with boa. Works perfect!
This PR implements UTF-16 based matching to fix #43.
utf16
feature that enables the UTF-16 api and disables some UTF-8 based optimizations.find_from_utf16
api based on theBacktrackExecutor
for now.Utf16Input
InputIndexer
.unicode
flag toInputIndexer
functions, as UTF-16 surrogate handling is based on this flag.cursor
functionstry_slice
,try_match_lit
andsubrange_eq
intoInputIndexer
to avoidInputIndexer
functions that have return types which are not directly compatible with UTF-16 input.InputIndexer
that where only used internally.There is a lot more to be done (see https://github.com/ridiculousfish/regress/issues/43#issuecomment-1853268508), but this version allows for UTF-16 support with more 262 tests passing and no regressions in our
boa
testing. In the next steps I would like to fix the few 262 tests that remain and then add tests for UTF-16 input toregress
itself.