Open jcgoble3 opened 8 years ago
Python's re
module actually compiles in pure Python, then passes the list of bytecodes to C, which stores them in a struct. So the length of the array is already known by the time it reaches C. (The actual matching is done in C.) Possibly could use a similar approach here, building it in a sequence table (which could be done using either the C API or pure Lua).
This comment serves as a scratchpad for possible opcodes.
These are always present:
SUCCESS()
FAILURE()
ANY()
-- matches any characterLITERAL(byte)
-- matches byte
literallyNOTLITERAL(byte)
-- matches anything but byte
literallyCATEGORY(class, mode)
-- matches character class
(%d
, etc.); mode
is 1 for a positive match or 0 for a negated matchRANGE(lowbyte, highbyte)
-- matches anything between lowbyte
and highbyte
, inclusiveNOTRANGE(lowbyte, highbyte)
-- matches anything not between lowbyte
and highbyte
, inclusiveSET(mode, next)
-- matches any of subsequent tokens, mode
is 1 for positive match or 0 for negated match, if success, jumps to next
LITERAL
, CATEGORY
, RANGE
FAILURE
-- signals end of setCAPTURE(n)
-- starts capture number n
(named captures handled outside bytecode); must record last index of backtracking arrayENDCAPTURE(n)
-- completes capture number n
; must record last index of backtracking arrayCAPTUREPOS(n)
-- captures string position in capture number n
; must record last index of backtracking arrayFRONTIER(mode, next)
-- the frontier pattern, used like SET
BALANCE(startbyte, endbyte, escapebyte)
-- finds balanced startbyte
and endbyte
, skipping the character after each escapebyte
(which may be a dummy value for no escape)AT(location)
-- matches if at specified location; values can be beginning or end of string or line (dependent on single/multi-line mode)BACKREFERENCE(n)
-- backreference to capture number n
(named backreferences would be converted to numbers at compile time)These are present only in basic PUC matching:
OPTIONAL()
-- next token is optional (the ?
quantifier)REPEAT()
-- next token is repeated greedily (*
)REPEATONE()
-- next token is repeated greedily, but must match once (+
)LAZYREPEAT()
-- next token is repeated lazily (-
)These are present only in enhanced matching:
SINGLEREPEAT(min, max)
-- repeats next single-character token greedily at least min
times and no more than max
times; records backtracking info once after min
th match and then modifies it each time throughSINGLELAZY(min, max)
-- same as SINGLEREPEAT
, but lazy; records backtracking info after min
th match and replaces it each time throughSINGLEPOSSESSIVE(min, max)
-- same as SINGLEREPEAT
, but possessive; does not record backtracking infoGROUPREPEAT(min, max, next)
-- like SINGLEREPEAT
, but for a subpattern group that ends with SUCCESS
; next
is next opcode after subpattern; records backtracking info each time through after completion of min
th matchGROUPLAZY(min, max, next)
-- same as GROUPREPEAT
, but lazy; records backtracking info each time through after completion of min
th match, but only keeps one backtracking point (since each time through after that is the result of backtracking)GROUPPOSSESSIVE(min, max, next)
-- same as GROUPREPEAT
, but possessive; does not record backtracking infoBRANCH(next)
-- used in |
alternation at the start of each branch; marks source and pattern position in backtracking array; next
is the next BRANCH
opcode (or, after last branch, FAILURE
)JUMP(next)
-- used at end of each branch to jump to next
opcode after branchingScratchpad on backtracking:
typedef struct BacktrackInfo {
void* srcpos; /* pointer into source char/codepoint array */
void* minsrc; /* how far back in source to go before failing; used by LAZY_REPEAT */
size_t codepos; /* index into bytecode array */
} BacktrackInfo;
typedef struct CaptureInfo {
void* start;
void* end;
size_t backtrack;
} CaptureInfo
If we're going to rewrite the whole engine, here's some links on NFA implementations (non-backtracking): https://stackoverflow.com/questions/1084069/building-a-regex-engine-online-resources
In particular: https://swtch.com/~rsc/regexp/regexp1.html
This I think would be a better way to do it. https://perl.plover.com/Regex/article.html offers tips on backreferences as well.
For enhanced patterns covered by #9 (#6, #7, and #8) might it be a good idea to pre-compile these patterns to bytecode? It would likely make these features easier to implement. Again, need to study Python's
re
module code to see how this is done (in particular, how to build an array of unknown length of numbers; maybe also study Lua's source code compiler to see how it builds the bytecode?). Could also build one byte at a time using Lua's buffer facilities, then convert the string to a proper array of numbers.