Pre-compilation of enhanced patterns

jcgoble3 commented 8 years ago

For enhanced patterns covered by #9 (#6, #7, and #8) might it be a good idea to pre-compile these patterns to bytecode? It would likely make these features easier to implement. Again, need to study Python's re module code to see how this is done (in particular, how to build an array of unknown length of numbers; maybe also study Lua's source code compiler to see how it builds the bytecode?). Could also build one byte at a time using Lua's buffer facilities, then convert the string to a proper array of numbers.

jcgoble3 commented 7 years ago

Python's re module actually compiles in pure Python, then passes the list of bytecodes to C, which stores them in a struct. So the length of the array is already known by the time it reaches C. (The actual matching is done in C.) Possibly could use a similar approach here, building it in a sequence table (which could be done using either the C API or pure Lua).

jcgoble3 commented 7 years ago

This comment serves as a scratchpad for possible opcodes.

These are always present:

SUCCESS()
FAILURE()
ANY() -- matches any character
LITERAL(byte) -- matches byte literally
NOTLITERAL(byte) -- matches anything but byte literally
CATEGORY(class, mode) -- matches character class (%d, etc.); mode is 1 for a positive match or 0 for a negated match
RANGE(lowbyte, highbyte) -- matches anything between lowbyte and highbyte, inclusive
NOTRANGE(lowbyte, highbyte) -- matches anything not between lowbyte and highbyte, inclusive
SET(mode, next) -- matches any of subsequent tokens, mode is 1 for positive match or 0 for negated match, if success, jumps to next
- Any number of LITERAL, CATEGORY, RANGE
- FAILURE -- signals end of set
CAPTURE(n) -- starts capture number n (named captures handled outside bytecode); must record last index of backtracking array
ENDCAPTURE(n) -- completes capture number n; must record last index of backtracking array
CAPTUREPOS(n) -- captures string position in capture number n; must record last index of backtracking array
FRONTIER(mode, next) -- the frontier pattern, used like SET
BALANCE(startbyte, endbyte, escapebyte) -- finds balanced startbyte and endbyte, skipping the character after each escapebyte (which may be a dummy value for no escape)
AT(location) -- matches if at specified location; values can be beginning or end of string or line (dependent on single/multi-line mode)
BACKREFERENCE(n) -- backreference to capture number n (named backreferences would be converted to numbers at compile time)

These are present only in basic PUC matching:

OPTIONAL() -- next token is optional (the ? quantifier)
REPEAT() -- next token is repeated greedily (*)
REPEATONE() -- next token is repeated greedily, but must match once (+)
LAZYREPEAT() -- next token is repeated lazily (-)

These are present only in enhanced matching:

SINGLEREPEAT(min, max) -- repeats next single-character token greedily at least min times and no more than max times; records backtracking info once after minth match and then modifies it each time through
SINGLELAZY(min, max) -- same as SINGLEREPEAT, but lazy; records backtracking info after minth match and replaces it each time through
SINGLEPOSSESSIVE(min, max) -- same as SINGLEREPEAT, but possessive; does not record backtracking info
GROUPREPEAT(min, max, next) -- like SINGLEREPEAT, but for a subpattern group that ends with SUCCESS; next is next opcode after subpattern; records backtracking info each time through after completion of minth match
GROUPLAZY(min, max, next) -- same as GROUPREPEAT, but lazy; records backtracking info each time through after completion of minth match, but only keeps one backtracking point (since each time through after that is the result of backtracking)
GROUPPOSSESSIVE(min, max, next) -- same as GROUPREPEAT, but possessive; does not record backtracking info
BRANCH(next) -- used in | alternation at the start of each branch; marks source and pattern position in backtracking array; next is the next BRANCH opcode (or, after last branch, FAILURE)
JUMP(next) -- used at end of each branch to jump to next opcode after branching

jcgoble3 commented 7 years ago

Scratchpad on backtracking:

typedef struct BacktrackInfo {
    void* srcpos;  /* pointer into source char/codepoint array */
    void* minsrc;  /* how far back in source to go before failing; used by LAZY_REPEAT */
    size_t codepos;  /* index into bytecode array */
} BacktrackInfo;

typedef struct CaptureInfo {
    void* start;
    void* end;
    size_t backtrack;
} CaptureInfo

jcgoble3 commented 5 years ago

If we're going to rewrite the whole engine, here's some links on NFA implementations (non-backtracking): https://stackoverflow.com/questions/1084069/building-a-regex-engine-online-resources

In particular: https://swtch.com/~rsc/regexp/regexp1.html

This I think would be a better way to do it. https://perl.plover.com/Regex/article.html offers tips on backreferences as well.

jcgoble3 / lua-matchext

Pre-compilation of enhanced patterns #10