Closed kmarius closed 2 years ago
I think for now just implementing naively without caching would be best. Ignore the clear performance penalties for now and get the functionality working.
Once the functions are done, maybe a special jsstring userdata type (utf16 version of a string) that can be passed in as a valid argument (user passes string or this userdata). So for the higher level JS functions string.search
/string.match
it can store the reference to the UTF-16 converted string without there being an awkward cache in the registry.
function jsregexp.to_jsstring(string) -> userdata.jsstring
function Regexp.exec(text: string | userdata.jsstring)
function Regexp.test(text: string | userdata.jsstring)
function Regexp.search(text: string | userdata.jsstring) -- for lua add to Regexp object
-- and so on for other functions
I am going to merge this as is and handle UTF16 later. Feel free to have a go at the higher level functions.
One idea I had is converting libregex from utf16 to utf8. Not sure how easy that would be, but could make the library drastically simpler
Now that would be amazing, I don't even know where one would start. (Probably by inquiring with Fabrice Bellard if it is feasible)
@nathanrpage97 I would like your opinion on how we can handle unicode strings. As in JS, one can build
match
,findAll
etc. by repeatedly callingreg:exec
on the input and getting the next match each time. The starting index is stored inside theRegExp
object. Converting the string to UTF16 on each call toexec
would clearly be bad. My idea is to cache the converted string and save a reference to the input string (in the registry?) to check if we are still working on the same input.