stchang / parsack

A basic Parsec-like monadic parser combinator library implementation in Racket.
MIT License
50 stars 10 forks source link

`parse-result` should handle (Empty (OK _ _ _)). #22

Closed greghendershott closed 10 years ago

greghendershott commented 10 years ago

Using randomized testing of my markdown parser, I discovered that sometimes I would get this error:

; parse-result: (Empty (Ok '() (State "" #0=(Pos 0 0 0)) (Msg #0# "end of input" '("space or tab" "new-line" "block" "space or tab" "new-line" "end-of-file"))))

However the parse result should have been OK. It would have been a sensible result for the input.

parse-result was expecting only (Consumed (OK)) and treating all else as an error. I think it should also handle (Empty (Ok)).

In fact I wonder if it needs to because lookAhead can now return something of the form (Empty (Ok)), as a result of: https://github.com/stchang/parsack/pull/17

FWIW, after making the change in this commit, I ran Parsack's tests, as well as my markdown tests, including more of the randomized ones. All passed.

Even so, I'd appreciate your review. Not only do I want to make sure it's correct, I wonder if I've overlooked other places that should be changed similarly?

greghendershott commented 10 years ago

Even so, I'd appreciate your review. Not only do I want to make sure it's correct, I wonder if I've overlooked other places that should be changed similarly?

In other words, even though you've kindly given me the ability to merge this pull request, myself (thanks again!) I'd like you to review the change. But no rush. This is not an urgent showstopper.

stchang commented 10 years ago

Nice, I missed this too. Other things like $eof can return (Empty (Ok ...)) as well. Thanks for catching it.

ps - What are you using for random testing? redex? :)

greghendershott commented 10 years ago

ps - What are you using for random testing? redex? :)

I think that joke is over my head, as is redex. :)

I'm simply generating random characters and running them through the markdown grammar. Because unlike a programming language or even most text protocols, there's no such thing as a syntax error for markdown. Everything should parse to at least ((p () ,the-input-string)), or even just `(,the-input-string). Therefore I don't want users getting non-parses -- ever. As a result I'm trying to catch any corner cases in the grammar that would result in an error, through sheer bloody-minded trial and error.

greghendershott commented 10 years ago

p.s OK I realize I could catch any parse exception and... just return the original input as-is. Maybe eventually I'll add that as a safety net. But right now I'm curious to see what the randomized input might turn up.

stchang commented 10 years ago

Ah ok. I thought you were generating well-formed Markdown, where redex's random tester may help, but I think that's not what you are trying to test here?

greghendershott commented 10 years ago

Oh duh, now I remember the RacketCon talk about using redex for randomized testing.

stchang commented 10 years ago

To try it out, I wrote an html grammar in redex (see https://github.com/stchang/markdown/blob/master/markdown/redex-tests/redex-html.rkt) and then used redex-check to test parse-markdown on randomly generated html inputs.

(Note: the following reveals my ignorance of the html and markdown specs.)

This test failed: (parse-markdown "<img></img>") (for simplicity I generated only pairs of open-close tags). It parsed to '((img ()) "</img>") when I expected '((img ())). A quick google search says that <img></img> is xhtml but not html, but I don't know if that means it is supposed to be valid markdown? For comparison though, a similar case <br></br> parses to '((br ())).

greghendershott commented 10 years ago

To try it out, I wrote an html grammar in redex (see https://github.com/stchang/markdown/blob/master/markdown/redex-tests/redex-html.rkt) and then used redex-check to test parse-markdown on randomly generated html inputs.

Neat. Yes, I want to loop back and improve the HTML grammar to make sure it handles HTML generally correctly, not just the subset of HTML that's likely to be in markdown.

This test failed: (parse-markdown "<img></img>")

This is a specific issue where I need to get crisp on which of the following are legal:

  1. <img /> ?
  2. <img></img> ?
  3. <img> ?
  4. All of the above? But if so how to disambiguate 3 and 3?

I discovered an old markdown doc where I'd used 3 in one case. I made a hack to allow that, commenting it as such:

https://github.com/greghendershott/markdown/blob/master/markdown/parse.rkt#L171-L186

Maybe instead I should have disallowed it and changed the old markdown doc.

And to be honest I'm not 100% sure what's the target -- HTML, XHTML, HTML5? I'll double-check but I think this is one of the very many things that the markdown "spec" under-specifies. Instead I should probably defer to he wisdom of the crowd, i.e. a survey of GitHub, StackOverflow, and BabelMark.

greghendershott commented 10 years ago

After taking out the hack the redex test reported "no counterexamples in 10000 attempts".

OK, it looks like I need to do some homework regarding HTML, XHTML, and HTML5 with respect to self-closing tags (e.g. <br />) and void elements (e.g. <meta> not <meta /> or <meta>/meta>) and decide what the goal is, what is the desired grammar.

Meanwhile, if you wanted to make a pull request with that redex test, I'd happily accept it. That's very cool!

greghendershott commented 10 years ago

Although I want to think more about this, meanwhile I just pushed https://github.com/greghendershott/markdown/commit/65e1487e09fde453d33a0713eb56550b629e78c5 which I think may be sufficient.

greghendershott commented 10 years ago

BTW if you do make a pull request with your redex-test.rkt (as I hope you will!), could you please change the (require redex) to (require redex/reduction-semantics)? At least on OS X that avoids the GUI weirdness with the invisible Racket window while it runs. Also this might make it more likely to run in non-GUI env like Travis CI.