Closed greghendershott closed 10 years ago
Conceptually, I imagined lookAhead
to be analogous to peek
. So you would use lookAhead
when you want to see if the upcoming state has some form, and then you use the information gained to guide the parsing from the original position. If lookAhead
consumed input, why wouldnt you just use the parser you pass to lookAhead
directly?
ps, Haskell parsec's lookAhead
works the same way I think.
Well the use case I had in mind was parsing HTML tags that have optional closing tags. All of the following are valid HTML, believe it or not:
<ul>
<li>Zero</li>
<li>One
<li>Two
</ul>
The first closes with an explicit </li>
, the second closes because another <li>
implies a close, and the last closes because the parent ends.
So I was thinking of (manyTill $html-element (close-for-tag 'li))
, where close-for-tag
would be something like (<or> (close-tag 'li) (lookAhead (<or> (open-tag 'li) (close-tag 'ul) (close-tag 'ol))))
.
In other words, don't consume those last three, leave them input for parsing as usual.
Having said all that, I'm going to tackle this differently, so I don't have the use case anymore. That plus you think that would be a weird change to lookAhead
, so that is N/A and I'll close it.
Ok I think I understand the problem. Unfortunately, I think you've knocked up against something that is fundamental to Parsec. In the example
(parse (manyTill $anyChar (lookAhead (char #\X))) "abcX")
it errors because $anyChar
and (char #\X)
overlap. When this happens, Parsec is designed to always favor the parser that consumes more input (if I remember correctly from the original paper, this enables a more efficient implementation somehow). However, a lot of valid HTML violates the "non-overlapping" property, so you get a lot of headaches :)
If I replace your $html-element
with a disjoint parser, then it works like you'd expect:
#lang racket
(require parsack)
(define ul-example "<ul><li>Zero</li><li>One<li>Two</ul>")
(define (open-tag name)
(try (parser-one (char #\<) (~> (string name)) (char #\>))))
(define (close-tag name)
(try (parser-one (char #\<) (char #\/) (~> (string name)) (char #\>))))
(define $li-element
(>> (open-tag "li")
(manyTill (noneOf "<") ; <----- replaced $html-element
(<or> (close-tag "li")
(lookAhead (<or> (open-tag "li")
(close-tag "ul")
(close-tag "ol")))))))
(define $ul-element
(>> (open-tag "ul")
(manyTill $li-element (close-tag "ul"))))
(parse $ul-element ul-example)
=>
(Consumed
#<promise!#(struct:Ok
((#\Z #\e #\r #\o) (#\O #\n #\e) (#\T #\w #\o))
#(struct:State "" #(struct:Pos 1 37 37) #hasheq())
#(struct:Msg #(struct:Pos 1 37 37) "" ()))>)
Of course I grossly oversimplified things and maybe it's not feasible to always get the non-overlap. An alternative is to add a new kind of parser. Unfortunately, the root of the problem is with <or>
. <or>
keeps going until a parser consumes input. What would fix your problem is a "short-circuiting" version of <or>
that stops as soon as a parser returns non-failure, even if it consumes no input. I'm happy to add the new <or>
(name suggestions welcome) but then there's still the problem that manyTill
and friends are implemented in terms of <or>
so we would need new versions of every parser that uses <or>
.
So let me know which you prefer.
I've pushed a change that I think gives you what you want. I added an <any>
parser that is the "short-circuiting" <or>
I mentioned before. I then used it to implement manyUntil
and many1Until
. (Actually, I added some keywords to many
to make it more "rackety" and so (manyUntil p end)
is just a wrapper for (many p #:till end #:or <any>)
.
The changes are currently in the dev
branch still. I ran your markdown tests and I think it all passes, but rather than merge to master
and break your stuff like last time, I'll let you take a look first this time :) Let me know what you think. If you don't like any of the new parser names, I'm open to suggestions. If you approve of everything, I'll merge to master (and update the docs).
Here's your previous example:
> (parse (manyTill $anyChar (lookAhead (char #\X))) "abcX")
parse ERROR: at 1:5:5
unexpected: "end of input"
expected: "X"
> (parse (manyUntil $anyChar (lookAhead (char #\X))) "abcX")
(Consumed
#<promise!#(struct:Ok
(#\a #\b #\c)
#(struct:State "X" #(struct:Pos 1 4 4) #hasheq())
#(struct:Msg #(struct:Pos 1 4 4) "" ()))>)
Any here's the html list example, with an overlapping parser:
#lang racket
(require parsack)
(define ul-example "<ul><li>Zero</li><li>One<li>Two</ul>")
(define (open-tag name)
(try (parser-one (char #\<) (~> (string name)) (char #\>))))
(define (close-tag name)
(try (parser-one (char #\<) (char #\/) (~> (string name)) (char #\>))))
(define $li-element
(>> (open-tag "li")
(manyUntil $anyChar ; <----- repeated parser overlaps with end condition
(<or> (close-tag "li")
(lookAhead (<or> (open-tag "li")
(close-tag "ul")
(close-tag "ol")))))))
(define $ul-element
(>> (open-tag "ul")
(manyTill $li-element (close-tag "ul"))))
(parse $ul-element ul-example)
=>
(Consumed
#<promise!#(struct:Ok
((#\Z #\e #\r #\o) (#\O #\n #\e) (#\T #\w #\o))
#(struct:State "" #(struct:Pos 1 37 37) #hasheq())
#(struct:Msg #(struct:Pos 1 37 37) "" ()))>)
The above examples would have also worked if I used manyTill #:or <any>
instead of manyUntil
.
That's awesome!
I was heads-down and able to make progress with the status quo <or>
based on the hint you gave. In fact I think I'm close to finishing my reworked html parser. I think I need to focus on testing, fixing bugs, and releasing that, because I've been working on it too long already and there's an outstanding performance bug.
However I'm looking forward to looping back and looking at this. Thanks!!
OK, I had a chance to loop back and look at this. It's awesome. In fact I do need it after all. I had a hole in my tests that made me think otherwise.
Tomorrow I plan to merge dev
to master
. I'll bump the parsack version, too, so I can update my downstream dependency to require it.
When I try to use
lookAhead
as the termination parser inmanyTill
with the status quo I get:However if I change the definition of
lookAhead
to returnConsumed
instead ofEmpty
, like so:Then it works:
I ran some unit tests against the updated definition and they all pass.
So I'm trying to understand why it was written to use
Empty
, notConsumed
. I don't have a crisp grasp of the flow, so I'm worried I might be overlooking some reason thatEmpty
was in fact important to use?