telekons / one-more-re-nightmare

A fast regular expression compiler in Common Lisp
https://applied-langua.ge/projects/one-more-re-nightmare/
BSD 2-Clause "Simplified" License
138 stars 9 forks source link

Match start and end of string #13

Open martinflack opened 2 years ago

martinflack commented 2 years ago

Great project. Is it intentional that a caret ^ does not match the start of the string and a dollar $ does not match the end? Wikipedia seems to think those are included in POSIX regex.

no-defun-allowed commented 2 years ago

It's currently only intentional for as long as I'm still musing on the nicest way to represent it in a DFA. Gilbert told me it's possible to handle arbitrary lookahead and lookbehind with derivatives, but I still don't get his description. Or I could generalise derivatives to consume boundary information as well as characters.

madnificent commented 2 years ago

Is something like this on the virtual in-your-head roadmap? I want to benchmark OMRN for matching of terminals of an LL(1) language and therefore only care about the longest match from START.

no-defun-allowed commented 2 years ago

It is, but I'm not sure of just what to do. At the moment it's possible I could have a function that just scans once from the start, but the issue is about handling ^ and $ which is more general, and there are a couple of ways to do them.

A way to match just at the start wouldn't be too hard still. Rather than wrapping every RE in grep as we do, just not wrapping suffices to get a machine that scans only from the start.

madnificent commented 2 years ago

I don't know my way through the code well enough to give this a stab. Could you point me to the right lines to edit or some branch so I could run an initial vague benchmark for my case?

no-defun-allowed commented 2 years ago

Here is a huge kludge to make OMRN generate code to only scan at the start of string. Note that it makes all regular expressions only match at the start of the string; I'd need to make a separate caches for lexing(?) and grepping functions which requires more changes. Also please pull from master as I needed to not hard-code one thing that should be a use of the (restart) local macro, for this to work.

(in-package :one-more-re-nightmare)

(defclass scan-from-start ()
  ())

;;; Copied verbatim from the START-CODE method for SCAN-EVERYTHING. Oh no 
(defmethod start-code ((strategy scan-from-start) states)
  (destructuring-bind (state) states
    (let ((expression (state-expression state)))
      (cond
        ((state-never-succeeds-p state)
         ;; Just return immediately if we're told to match nothing.
         `(start (return)))
        ((re-empty-p expression)
         ;; Succeed for every character?
         `(start
           (cond
             ((> position end)
              (return))
             (t
              (incf position)
              ,(setf-from-assignments (state-exit-effects state))
              (win ,@(win-locations (state-exit-map state)))))))
        (t
         `(start
           (setf position start)
           (go ,(find-state-name state :bounds-check))))))))

(defmethod initial-states ((strategy scan-from-start) expression)
  ;; Don't do parallel GREP scanning stuff.
  (list (alpha (add-tags expression) (empty-set))))

(defmethod macros-for-strategy append ((strategy scan-from-start))
  '((restart ()
     ;; Don't actually restart the DFA.
     '(return))))

(defun make-default-strategy (layout expression)
  (declare (ignore layout expression))
  (make-instance (dynamic-mixins:mix 'scan-from-start 'call-continuation)))

After which:

CL-USER> (one-more-re-nightmare:first-match "f" "food")
#(0 1)
CL-USER> (one-more-re-nightmare:first-match "f" "oodf")
NIL