Regular Expressions in Ficus

right now the standard library, module Re, offers 2 functions:

Re.compile(regexp_str: string) : Re.regex_t and
Re.match_(regexp_obj: Re.regex_t, matched_str: string): bool

The implementation uses Rob Pike's VM approach (O(N) regexp matching algorithm, where N is the matched string length), described and implemented at https://swtch.com/~rsc/regexp/; in particular, it's a wrapper on top of re1 library.

The implementation is just at "technology preview" level of functionality and stability. Here is the list of things to make this module really useful:

- [x] change the implementation to directly process UTF32 strings instead of UTF8 strings. This is one of the first things to make, because it will let us to avoid conversion of matched strings to UTF8 (extra step that requires O(N) time and extra O(N) space) and will also let us to return the positions of matched sub-expressions as-is, without UTF8 to UTF32 conversion.
- [x] avoid the use of global variables, in particular, the pool of "fx_re_Sub" buffers (freesub). It should be possible to invoke the regexp engine simultaneously from multiple threads (using different compiled regexp objects or even the same regexp).
- [x] change the error handling. Now in the case of error exit() function is called, which is inappropriate. We should return negative error code to the user (i.e. throw exception using Ficus runtime facilities) with proper deallocation of temporary allocated buffers. Probably, it will require to make the step 4 first.
- [x] rewrite regexp parsing code without using yacc. The regexp syntax is quite simple and it can be parsed using a simple recursive approach. Currently, we use statically compiled into .c Yacc/BISON-based parser, and with any further extensions to regexp syntax (which are expected) we'd need to regenerate .c and update it (add fx_re prefixes etc.).
- [x] optimize match; skip the initial ".*?" processing (the first 2-3 instructions in regexp VM). Now we just run "find" and then check if sub[0] == 0 && sub[1] == str_length.
- [x] return sub-matches indices to the user: Re.match_(r: Re.regex_t, s: string): (bool, (int, int)[]). That is, in addition to the success flag, return positions of submatches as 1D array of 2-element tuples (start_j, end_j). (start_0, end_0) — position of the whole regexp within the searched string. Remember that int in Ficus corresponds to int_ ~ ptrdiff_t in C.
- [x] add syntax r"regexp_str" to Ficus to automatically mirror back slashes.
- [ ] (optional) add s ~ regexp syntax to Ficus (syntactic sugar): if input_filename ~ "{basename:.+}\.fx" { println(f"base name of the ficus source file is {basename}") }. That is, regexp matching, as in Perl, can be an operator with optional value capture. If matching is successful, we can get the named captured values and use them.
- [x] implement Re.find(regexp_obj: Re.regex_t, str: string): (bool, (int, int) []). The semantics matches Re.match_(), but the match can be a substring of str, instead of the whole str. Maybe there should be overloaded version that will take extra offset: int parameter — from where to start the search.
- [x] implement Re.findall(regexp_obj: Re.regex_t, str: string): (int, int) [,]. Returns matrix. i-th row corresponds to i-th match and contains the position of the whole match and positions of sub-matches.
- [x] implement Re.replace(regexp_obj: Re.regexp_t, str: str: string, replace_with: string): string. replace_with string can use \<n> syntax, e.g. \0 (the whole matched string) \1 (the first matched sub-expression), \2(the second matched sub-expression) etc.
- [ ] (optional) We can also extend Re.replace() functionality. \U1 will mean convert the first submatch to uppercase, \L2 will mean convert the second submatch to lowercase, \C0 will mean capitalize the whole submatch (convert the first letter of it to uppercase).
- [x] extend regexp syntax with the following constructions:
- [[^]c1-c2c3-c4c5-c6] syntax for specifying positive and negative character ranges. There should be new instructions in VM to implement this. Maybe it should be a variable-length instruction FX_RE_OP_POSRANGE and FX_RE_OP_NEGRANGE. Single characters can be encoded (internally) by duplicating the beginning and end of the range, e.g. [+-0-9] can be encoded as [+-+\--\-0-9].
- \b - word boundary. There should probably be a new VM instruction (like FX_RE_OP_WORDBOUND) to implement it.
- support \t, \r, \n, \x.., \u...., \U........ syntax to specify various characters.
- built-in class characters: \s (space), \d (decimal digit), \x (hexadecimal digit), \a (letter), \w (letter, decimal digit or underscore), and maybe some others. fx_isdigit() and other standard functions from ficus runtime can be used to implement this part, and probably a new VM instruction (e.g. FX_RE_OP_CHARCLASS <class_id>) should take care of it.
- [ ] (partially emulated by passing multiline=true) Add support for:
- ^ (beginning-of-line) anchor (FX_RE_OP_BEGIN_OF_LINE)
- $ (end-of-line) anchor (FX_RE_END_OF_LINE).
- [x] (apparently, it's not needed, several alternatives, separated with '|', can be used to emulate the same behaviour) (optional). A very useful functionality for data parsing. Not sure about API, but it can be something like (not optimal) Re.matchsome(rl: Re.regex_t list, str: string[, offset: int]): (int, (int, int) []). That is, we check if str or str[offset:] matches with any of regexp from rl. If there are several matches, the longest is chosen. If there are several matches of the same length, the earliest expression in the list is chosen. The index of matched expression is returned. Such functionality can be useful to implement parsers (but using lex should be much more efficient in general for such tasks).
- [x] (optional). Add optional flags to Re.match*(), Re.find*() and Re.replace(). The most important flags are ignore-case and multi-line matches. Need to double-check if it's consistent with other regexp engines. In the case of multi-line matches . or \s can match \r or \n or \r\n, ^ can match beginning of each line, $ - the end of each line.

=======

IMPORTANT NOTE:

It's a lot of work to bring the current very early draft to a solid state, even though many of the items are relatively easy to implement. Probably, instead of trying to make re1-based engine more mature, it makes sense to write a ficus wrapper on top of re2, implemented in C++, where all of the above features (and probably more) are already implemented.

vpisarev / ficus

Regexp in ficus #2

Regular Expressions in Ficus