ridiculousfish / regress

REGex in Rust with EcmaScript Syntax
Apache License 2.0
176 stars 11 forks source link

Implement UTF-16 based matching #75

Closed raskad closed 10 months ago

raskad commented 11 months ago

This PR implements UTF-16 based matching to fix #43.

There is a lot more to be done (see https://github.com/ridiculousfish/regress/issues/43#issuecomment-1853268508), but this version allows for UTF-16 support with more 262 tests passing and no regressions in our boa testing. In the next steps I would like to fix the few 262 tests that remain and then add tests for UTF-16 input to regress itself.

raskad commented 10 months ago

Given that unicode is a property of the regex itself, I think this should be attached to the InputIndexer. That is, there should be two different Utf16Input types - one that decodes surrogates, and one that does not. I suggest the names Utf16Input and Ucs2Input respectively. Then we won't need to pass around the unicode flag.

This is a really good idea! I implemented it exactly like that and tested it with boa. Works perfect!