Unexpected behavior in from/stop-before

tgbugs commented 3 years ago

The code block below illustrates two issues with from/stop-before (implementation seen here) https://github.com/mbutterick/brag/blob/6c161ae31df9b4ae7f55a14f754c0b216b60c9a6/brag-lib/brag/support.rkt#L115-L121.

The primary issue is that from/stop-before only stops before the last element in a :seq and does not correctly back up to before the first element of the stop-before pattern.

There is a secondary issue, which is possibly a matter of documentation, is that from/stop-before will match eof to stop-before.

I'm fairly certain that the second issue is due to the behavior of complement that is used in the implementation.

I am not sure about the underlying cause of the first issue.

#lang racket

(require brag/support)

(define bad-lexer
  (lexer-srcloc
   ; from/stop-before does not correctly backtrack to the point before "\n*"
   [(from/stop-before "start" (:seq "\n" (:+ "*") (:or " " "\t")))
    (token 'ARGH lexeme)]
   [any-char (token 'AC lexeme)]))

(define bad-lexer-2
  ; from/stop-before matches an empty string at end of file
  (lexer-srcloc
   [(from/stop-before "\n" (:seq "\n" (:+ "*") (:or " " "\t")))
    (token 'ARGH lexeme)]
   [any-char (token 'AC lexeme)]))

(define (get-tokens tokenizer)
  (define (sigh next-token [accum '()])
    (let ([t (next-token)])
      (if (eq? t eof)
          accum
          (sigh next-token (cons t accum)))))
  (reverse (sigh tokenizer)))

(define (make-bad-lexer lexer port)
  (define (next-token)
    (lexer port))
  next-token)

(module+ test
  (define (make-port) (open-input-string "start
a
* H
p
argh"))

  (println "start issues")
  (get-tokens (make-bad-lexer bad-lexer (make-port)))
  (println "end issues")
  (get-tokens (make-bad-lexer bad-lexer-2 (make-port)))

  )

Output

"start issues"
(list
 (srcloc-token (token-struct 'ARGH "start\na\n*" #f #f #f #f #f) (srcloc 'string #f #f 1 9))
 (srcloc-token (token-struct 'AC " " #f #f #f #f #f) (srcloc 'string #f #f 10 1))
 (srcloc-token (token-struct 'AC "H" #f #f #f #f #f) (srcloc 'string #f #f 11 1))
 (srcloc-token (token-struct 'AC "\n" #f #f #f #f #f) (srcloc 'string #f #f 12 1))
 (srcloc-token (token-struct 'AC "p" #f #f #f #f #f) (srcloc 'string #f #f 13 1))
 (srcloc-token (token-struct 'AC "\n" #f #f #f #f #f) (srcloc 'string #f #f 14 1))
 (srcloc-token (token-struct 'AC "a" #f #f #f #f #f) (srcloc 'string #f #f 15 1))
 (srcloc-token (token-struct 'AC "r" #f #f #f #f #f) (srcloc 'string #f #f 16 1))
 (srcloc-token (token-struct 'AC "g" #f #f #f #f #f) (srcloc 'string #f #f 17 1))
 (srcloc-token (token-struct 'AC "h" #f #f #f #f #f) (srcloc 'string #f #f 18 1)))
"end issues"
(list
 (srcloc-token (token-struct 'AC "s" #f #f #f #f #f) (srcloc 'string #f #f 1 1))
 (srcloc-token (token-struct 'AC "t" #f #f #f #f #f) (srcloc 'string #f #f 2 1))
 (srcloc-token (token-struct 'AC "a" #f #f #f #f #f) (srcloc 'string #f #f 3 1))
 (srcloc-token (token-struct 'AC "r" #f #f #f #f #f) (srcloc 'string #f #f 4 1))
 (srcloc-token (token-struct 'AC "t" #f #f #f #f #f) (srcloc 'string #f #f 5 1))
 (srcloc-token (token-struct 'ARGH "\na\n*" #f #f #f #f #f) (srcloc 'string #f #f 6 4))
 (srcloc-token (token-struct 'AC " " #f #f #f #f #f) (srcloc 'string #f #f 10 1))
 (srcloc-token (token-struct 'AC "H" #f #f #f #f #f) (srcloc 'string #f #f 11 1))
 (srcloc-token (token-struct 'ARGH "\np\nargh" #f #f #f #f #f) (srcloc 'string #f #f 12 7)))

tgbugs commented 3 years ago

I think that the implementation of complement is in https://github.com/mbutterick/br-parser-tools/blob/32fc3b68d14a027ec70fb5cca38471ebdfed9ee7/br-parser-tools-lib/br-parser-tools/private-lex/stx.rkt#L84-L88.

edit: Further exploration of the expanded form suggests that the issue may be in the definition of lexer-body here https://github.com/mbutterick/br-parser-tools/blob/32fc3b68d14a027ec70fb5cca38471ebdfed9ee7/br-parser-tools-lib/br-parser-tools/lex.rkt#L215-L283. If this is ultimately an issue in br-parser-tools, let me know if I should create a new issue there.

aside: The expansion of lexer includes an unused reference to lexeme-srcloc-p that bloats the expanded form and probably slows down compile times.

mbutterick commented 3 years ago

As you can see, from/stop-before is a dumb little lexer macro. So the question is whether you can get the matching behavior you want without using from/stop-before —

If so, then that should become the new implementation for from/stop-before.
If not, then it’s just a limitation of the underlying pattern matchers in br-parser-tools (which is basically just a fork of parser-tools). In that case, the only fix is to update the documentation to describe the limitation.

tgbugs commented 3 years ago

As I suspected. I'll dig around to see if I can find a way to implement the behavior I'm looking for and report back. If there is a lexer form that can do limited lookahead it should be possible.

tgbugs commented 3 years ago

As suspected, there isn't a way to fix this using lexer abbreviations alone, so the documentation of the behavior should be updated.

However, there is a way to fix the issue on the action side by using file-position to set the port to the correct position. The solution resolves both issues, but with some trade-offs, and I do no think that it is not sufficiently general for inclusion in brag/support, though maybe with a few tweaks it could be?

For my narrow use-case find-last works because I only need to search backward to the last newline in the lexeme, however more complicated stop-before patterns would likely require running something like regexp-match-positions for the stop pattern over the lexeme.

(define (find-last char str)
  ; we don't want to do string reverse because it is slow we just want
  ; to find the first char going backward, string-ref backward seems
  ; like it is ok if we really want this to go fast maybe could chunk
  ; bytes into cacheline size and search for a sub-offset within that?
  (letrec ([search (λ (look-at)
                     (println (string-ref str look-at))
                     (if (eq? (string-ref str look-at) char)
                         (values (substring str 0 look-at) (sub1 look-at))
                         (search (sub1 look-at))))])
    (search (sub1 (string-length str)))))

(define (token-stop-before TOKEN TOKEN-EOF lexeme char input-port start-pos)
  "Encapsulates the machinery needed to correctly reset the location of input-port when
using from/stop-before where the stop-before pattern contains multiple charachters."
  (if (eq? (peek-char input-port) eof)
      (token TOKEN-EOF lexeme)
      (let-values ([(lexeme-correct offset) (find-last char lexeme)])
        (file-position input-port (+ (position-offset start-pos) offset))
        (token TOKEN lexeme-correct))))

(define ok-lexer
  (lexer
   [(from/stop-before "start" (:seq "\n" (:+ "*") (:or " " "\t")))
    (token-stop-before 'ARGH 'ARGH-EOF lexeme #\newline input-port start-pos)]
   [any-char (token 'AC lexeme)]))

mbutterick commented 3 years ago

OK. I will update the docs. I think token-stop-before would make more sense as an add-on or patch for the underlying parser-tools package, because that’s the primary residence of those lexer matchers.

mbutterick commented 3 years ago

Closed by 243f32a7bb8c52bf330f2eaabc80a5c6715bace9

tgbugs commented 3 years ago

Just a note in case anyone comes across this in the future. from/stop-before can be used with :seq or any pattern than can match more than a single literal token, but it only ever stops before the final element of the pattern.

mbutterick / brag

Unexpected behavior in from/stop-before #26