stchang / parsack

A basic Parsec-like monadic parser combinator library implementation in Racket.
MIT License
50 stars 10 forks source link

Can `lookAhead` return `Consumed` instead of `Empty`? #34

Closed greghendershott closed 10 years ago

greghendershott commented 10 years ago

When I try to use lookAhead as the termination parser in manyTill with the status quo I get:

;; Intent: Parse until "X" but leave it for consumption.
(parse (manyTill $anyChar (lookAhead (char #\X)))
       "abcX")
; parse ERROR: at 1:5:5
; unexpected: "end of input"
;   expected: "X"

However if I change the definition of lookAhead to return Consumed instead of Empty, like so:

;; Parse p and return the result, but don't consume input.
(define (lookAhead p)
  (match-lambda
    [(and input (State inp pos _))
     (match (p input)
       [(Consumed! (Ok result _ (Msg _ str strs)))
        #;(Empty (Ok result input (Msg pos inp strs)))   ;; <--
        (Consumed (Ok result input (Msg pos inp strs)))] ;; <--
       [emp emp])]))

Then it works:

;; Intent: Parse until "X" but leave it in input for consumption.
(parse (manyTill $anyChar (lookAhead (char #\X)))
       "abcX")
(Consumed #<promise!#(struct:Ok (#\a #\b #\c) #(struct:State "X" #0=#(struct:Pos 1 4 4) #hasheq()) #(struct:Msg #0# "" ()))>)

I ran some unit tests against the updated definition and they all pass.

So I'm trying to understand why it was written to use Empty, not Consumed. I don't have a crisp grasp of the flow, so I'm worried I might be overlooking some reason that Empty was in fact important to use?

stchang commented 10 years ago

Conceptually, I imagined lookAhead to be analogous to peek. So you would use lookAhead when you want to see if the upcoming state has some form, and then you use the information gained to guide the parsing from the original position. If lookAhead consumed input, why wouldnt you just use the parser you pass to lookAhead directly?

ps, Haskell parsec's lookAhead works the same way I think.

greghendershott commented 10 years ago

Well the use case I had in mind was parsing HTML tags that have optional closing tags. All of the following are valid HTML, believe it or not:

<ul>
  <li>Zero</li>
  <li>One
  <li>Two
</ul>

The first closes with an explicit </li>, the second closes because another <li> implies a close, and the last closes because the parent ends.

So I was thinking of (manyTill $html-element (close-for-tag 'li)), where close-for-tag would be something like (<or> (close-tag 'li) (lookAhead (<or> (open-tag 'li) (close-tag 'ul) (close-tag 'ol)))).

In other words, don't consume those last three, leave them input for parsing as usual.

Having said all that, I'm going to tackle this differently, so I don't have the use case anymore. That plus you think that would be a weird change to lookAhead, so that is N/A and I'll close it.

stchang commented 10 years ago

Ok I think I understand the problem. Unfortunately, I think you've knocked up against something that is fundamental to Parsec. In the example

(parse (manyTill $anyChar (lookAhead (char #\X))) "abcX")

it errors because $anyChar and (char #\X) overlap. When this happens, Parsec is designed to always favor the parser that consumes more input (if I remember correctly from the original paper, this enables a more efficient implementation somehow). However, a lot of valid HTML violates the "non-overlapping" property, so you get a lot of headaches :)

If I replace your $html-element with a disjoint parser, then it works like you'd expect:

#lang racket
(require parsack)

(define ul-example "<ul><li>Zero</li><li>One<li>Two</ul>")

(define (open-tag name) 
  (try (parser-one (char #\<) (~> (string name)) (char #\>))))
(define (close-tag name) 
  (try (parser-one (char #\<) (char #\/) (~> (string name)) (char #\>))))

(define $li-element 
  (>> (open-tag "li") 
      (manyTill (noneOf "<")    ; <----- replaced $html-element
                (<or> (close-tag "li")
                      (lookAhead (<or> (open-tag "li") 
                                       (close-tag "ul") 
                                       (close-tag "ol")))))))
(define $ul-element
  (>> (open-tag "ul")
      (manyTill $li-element (close-tag "ul"))))

(parse $ul-element ul-example)

=>

(Consumed
 #<promise!#(struct:Ok
             ((#\Z #\e #\r #\o) (#\O #\n #\e) (#\T #\w #\o))
             #(struct:State "" #(struct:Pos 1 37 37) #hasheq())
             #(struct:Msg #(struct:Pos 1 37 37) "" ()))>)

Of course I grossly oversimplified things and maybe it's not feasible to always get the non-overlap. An alternative is to add a new kind of parser. Unfortunately, the root of the problem is with <or>. <or> keeps going until a parser consumes input. What would fix your problem is a "short-circuiting" version of <or> that stops as soon as a parser returns non-failure, even if it consumes no input. I'm happy to add the new <or> (name suggestions welcome) but then there's still the problem that manyTill and friends are implemented in terms of <or> so we would need new versions of every parser that uses <or>.

So let me know which you prefer.

stchang commented 10 years ago

I've pushed a change that I think gives you what you want. I added an <any> parser that is the "short-circuiting" <or> I mentioned before. I then used it to implement manyUntil and many1Until. (Actually, I added some keywords to many to make it more "rackety" and so (manyUntil p end) is just a wrapper for (many p #:till end #:or <any>).

The changes are currently in the dev branch still. I ran your markdown tests and I think it all passes, but rather than merge to master and break your stuff like last time, I'll let you take a look first this time :) Let me know what you think. If you don't like any of the new parser names, I'm open to suggestions. If you approve of everything, I'll merge to master (and update the docs).

Here's your previous example:

> (parse (manyTill $anyChar (lookAhead (char #\X))) "abcX")
parse ERROR: at 1:5:5
unexpected: "end of input"
  expected: "X"
> (parse (manyUntil $anyChar (lookAhead (char #\X))) "abcX")
(Consumed
 #<promise!#(struct:Ok
             (#\a #\b #\c)
             #(struct:State "X" #(struct:Pos 1 4 4) #hasheq())
             #(struct:Msg #(struct:Pos 1 4 4) "" ()))>)

Any here's the html list example, with an overlapping parser:

#lang racket
(require parsack)

(define ul-example "<ul><li>Zero</li><li>One<li>Two</ul>")

(define (open-tag name) 
  (try (parser-one (char #\<) (~> (string name)) (char #\>))))
(define (close-tag name) 
  (try (parser-one (char #\<) (char #\/) (~> (string name)) (char #\>))))

(define $li-element 
  (>> (open-tag "li") 
      (manyUntil $anyChar  ; <----- repeated parser overlaps with end condition
                (<or> (close-tag "li")
                      (lookAhead (<or> (open-tag "li") 
                                       (close-tag "ul") 
                                       (close-tag "ol")))))))
(define $ul-element
  (>> (open-tag "ul")
      (manyTill $li-element (close-tag "ul"))))

(parse $ul-element ul-example)

=>

(Consumed
 #<promise!#(struct:Ok
             ((#\Z #\e #\r #\o) (#\O #\n #\e) (#\T #\w #\o))
             #(struct:State "" #(struct:Pos 1 37 37) #hasheq())
             #(struct:Msg #(struct:Pos 1 37 37) "" ()))>)

The above examples would have also worked if I used manyTill #:or <any> instead of manyUntil.

greghendershott commented 10 years ago

That's awesome!

I was heads-down and able to make progress with the status quo <or> based on the hint you gave. In fact I think I'm close to finishing my reworked html parser. I think I need to focus on testing, fixing bugs, and releasing that, because I've been working on it too long already and there's an outstanding performance bug.

However I'm looking forward to looping back and looking at this. Thanks!!

greghendershott commented 10 years ago

OK, I had a chance to loop back and look at this. It's awesome. In fact I do need it after all. I had a hole in my tests that made me think otherwise.

Tomorrow I plan to merge dev to master. I'll bump the parsack version, too, so I can update my downstream dependency to require it.