telekons / one-more-re-nightmare

A fast regular expression compiler in Common Lisp
https://applied-langua.ge/projects/one-more-re-nightmare/
BSD 2-Clause "Simplified" License
138 stars 9 forks source link

Customise prefix code generation #2

Open no-defun-allowed opened 3 years ago

no-defun-allowed commented 3 years ago

one-more-re-nightmare splits a regular expression into a prefix string and the rest of the RE. The prefix is scanned for more efficiently using Boyer-Moore-Horspool, before entering the DFA body generated for the rest of the RE. (This technique is also used in cl-ppcre and GNU grep to my knowledge.)

However, it is possible that BMH is not ideal for processing prefixes. Modern processors have vector arithmetic units which can perform many element comparisons at once, and it is not too hard to design a substring search which uses vectorised code. When combined with heuristics based on properties known about the vector to search, vectorised searching can be faster than clever serial searching algorithms.

It should be possible to allow the client to generate their own prefix scanning code. The interface to the BMH code generator provides most of the required information to generate code, with the lambda list (start length vector prefix aref-generator fail). The first three arguments are symbols naming internal variables o-m-r-n uses, then the prefix sequence, then the :aref-generator argument to compile-regular-expression, then some code to evaluate when there are no more matches.

no-defun-allowed commented 3 years ago

bbfa2cf provides an object for prefix code generation, but it is "bare" and the heuristics it uses are implemented using inheritance and call-next-method instead of a compiler object making decisions.

no-defun-allowed commented 2 years ago

This has sort of been done now, but there is no public interface yet. I have to think about how to expose everything in the right way.