Open martinflack opened 2 years ago
It's currently only intentional for as long as I'm still musing on the nicest way to represent it in a DFA. Gilbert told me it's possible to handle arbitrary lookahead and lookbehind with derivatives, but I still don't get his description. Or I could generalise derivatives to consume boundary information as well as characters.
Is something like this on the virtual in-your-head roadmap? I want to benchmark OMRN for matching of terminals of an LL(1) language and therefore only care about the longest match from START.
It is, but I'm not sure of just what to do. At the moment it's possible I could have a function that just scans once from the start, but the issue is about handling ^ and $ which is more general, and there are a couple of ways to do them.
A way to match just at the start wouldn't be too hard still. Rather than wrapping every RE in grep
as we do, just not wrapping suffices to get a machine that scans only from the start.
I don't know my way through the code well enough to give this a stab. Could you point me to the right lines to edit or some branch so I could run an initial vague benchmark for my case?
Here is a huge kludge to make OMRN generate code to only scan at the start of string. Note that it makes all regular expressions only match at the start of the string; I'd need to make a separate caches for lexing(?) and grepping functions which requires more changes. Also please pull from master as I needed to not hard-code one thing that should be a use of the (restart)
local macro, for this to work.
(in-package :one-more-re-nightmare)
(defclass scan-from-start ()
())
;;; Copied verbatim from the START-CODE method for SCAN-EVERYTHING. Oh no
(defmethod start-code ((strategy scan-from-start) states)
(destructuring-bind (state) states
(let ((expression (state-expression state)))
(cond
((state-never-succeeds-p state)
;; Just return immediately if we're told to match nothing.
`(start (return)))
((re-empty-p expression)
;; Succeed for every character?
`(start
(cond
((> position end)
(return))
(t
(incf position)
,(setf-from-assignments (state-exit-effects state))
(win ,@(win-locations (state-exit-map state)))))))
(t
`(start
(setf position start)
(go ,(find-state-name state :bounds-check))))))))
(defmethod initial-states ((strategy scan-from-start) expression)
;; Don't do parallel GREP scanning stuff.
(list (alpha (add-tags expression) (empty-set))))
(defmethod macros-for-strategy append ((strategy scan-from-start))
'((restart ()
;; Don't actually restart the DFA.
'(return))))
(defun make-default-strategy (layout expression)
(declare (ignore layout expression))
(make-instance (dynamic-mixins:mix 'scan-from-start 'call-continuation)))
After which:
CL-USER> (one-more-re-nightmare:first-match "f" "food")
#(0 1)
CL-USER> (one-more-re-nightmare:first-match "f" "oodf")
NIL
Great project. Is it intentional that a caret
^
does not match the start of the string and a dollar$
does not match the end? Wikipedia seems to think those are included in POSIX regex.