greghendershott / markdown

Markdown parser written in Racket.
101 stars 28 forks source link

parser cannot handle Windows-style line breaks #27

Closed stchang closed 10 years ago

stchang commented 10 years ago

I believe the parser does not handle Windows-style line breaks, ie "\r\n", properly. For example, all the tests that rely on test.md fail in Windows.

Here is another example:

> (parse-markdown "[test][1]\n\n[1]:http://test.com \"test-title\"")
'((p () (a ((href "http://test.com") (title "test-title")) "test")))
> (parse-markdown "[test][1]\r\n\r\n[1]:http://test.com \"test-title\"")
Reference link not defined: (linkref "1")
'((a ((href "")) "test") "\r \r [1]:http://test.com " ldquo "test-title" rdquo)
greghendershott commented 10 years ago

Thanks for catching that!

To get this test to pass:

(check-equal?
   (parse-markdown "[test][1]\n\n[1]:http://test.com \"test-title\"")
   (parse-markdown "[test][1]\r\n\r\n[1]:http://test.com \"test-title\""))

what I tried just now was this:

(require (rename-in parsack
                    [$newline $parsack-newline])

(define $newline
  ;; `pdo` is just my shorthand for `parser-compose`. I would have named it `do` except
  ;; for various editors having another idea about indenting that. :)
  (<?> (pdo (optional (char #\return))
            (char #\newline)
            (return #\newline))
       "new-line"))

Running markdown/perf-test.rkt I didn't see any appreciable slowdown as a result of this.

The status quo $newline in Parsack is simply:

(define $newline (<?> (char #\newline) "new-line"))

Should we change this in Parsack?

Otherwise I could have parse-markdown do a regexp-replace on the input string before feeding it to the parser. On the one hand that's more-targeted and perhaps safer than changing it in Parasack. On the other hand, that's a bit gross, and more importantly, it seems like most parsers would want this? Hmm, maybe not. An HTTP request parser might want to be picky about requiring exactly "\r\n" or"\n". Although, real-world ones seem to tolerate either.

What do you think?

greghendershott commented 10 years ago

p.s. Of course a third choice is that I define a $cr-or-cr-lf parser (maybe a better name than that) and replace all my uses of Parsack's $newline with that -- i.e. essentially what I tried above, that works.

That's definitely what I would do if I didn't know the Parsack author. :) It's also what the Parsack author prefer me to do if they didn't want to risk any backward incompatibility, and/or wanted to match Parsec in this regard exactly. BTW what does Parsec do for this?

stchang commented 10 years ago

Sorry, this report should have been more constructive. Can you use parsack's $eol? (I should probably update the docs to clarify the difference between $newline and $eol).

greghendershott commented 10 years ago

Yeah, my fault, I probably saw $eol very early on, but forgot it existed. I pushed a commit redoing it the easier/simpler way. :)

stchang commented 10 years ago

So one test is still failing (on windows machine):

$ /c/pltpkg/racket/raco test -x test.rkt
raco test: (submod "test.rkt" test)
diff: extra operand
diff: Try `diff --help' for more information.
--------------------
FAILURE
actual:     2
expected:   0
name:       check-equal?
location:   (#<path:c:\Users\Stephen Chang\Documents\racket\markdown\markdown\test.rkt> 37 2 1218 92)
expression: (check-equal? (system/exit-code (~a "diff " test.html " " test.out.html)) 0)

Check failure
--------------------
1/112 test failures

Doing a diff suggests that it's still a line break issue.

stchang commented 10 years ago

Here's one line of diff. Line 11 in test.out.html is broken into two lines:

<p>Here is some word-level formatting. Also, this uses two spaces before
newline to ensure hard line breaks:</p>

while in test.html, it's one line:

<p>Here is some word-level formatting. Also, this uses two spaces before newline to ensure hard line breaks:</p>

And there are many similar others, eg "italic phrase wrapped across two lines"

greghendershott commented 10 years ago

Can you open test.out.html in something capable of seeing what's after "before"?

Is it solely a \r, perhaps? Or some other combo of \r and/or \n? Please let me know.

The file's at:

(build-path (find-system-path 'temp-dir) "test.out.html"))

Maybe C:\tmp\test.out.html for you? The test does not delete it afterwards.

p.s. Yes I should try on Windows myself. But my Windows image is stale. Need to set aside some time to update with latest tools, source, etc. Will do, eventually.

greghendershott commented 10 years ago

Can you open test.out.html in something capable of seeing what's after "before"?

Is it solely a \r, perhaps? Or some other combo of \r and/or \n? Please let me know.

Actually I'm 99% sure you'll find a \r there.

I just pushed a commit. See its comments via link above, but in short: My new take on this is that parse-markdown reverts back to expecting \n line terminations, only. When reading files, callers should use #:mode 'text. #:mode 'text is designed for exactly this; let it do its job, and don't complicate the parser itself with anything except \n.

stchang commented 10 years ago

Yes you were absolutely right about the \r. Here is the result of

(port->string 
 (open-input-file 
  (build-path (find-system-path 'temp-dir) "test.out.html")))

(ie with #:mode = 'binary)

"<!DOCTYPE html>\n<html>\n <head>\n  <meta charset=\"utf-8\" /></head>\n <body>\n  <h1 id=\"heading-level-1\">Heading level 1</h1>\n  <h2 id=\"heading-level-2\">Heading level 2</h2>\n  <h1 id=\"heading-level-1\">Heading level 1</h1>\n  <h2 id=\"heading-level-2\">Heading level 2</h2>\n  <p>Some normal text.</p>\n  <p>Here is some word-level formatting. Also, this uses two spaces before\r newline to ensure hard line breaks:</p>\n  <p><em>Italic</em>.\n   <br /><em>Italic</em>.\n   <br /><strong>Bold</strong>.\n   <br /><strong>Bold</strong>.\n   <br /><strong>Bold with <em>italic</em> inside it</strong>.\n   <br /><em>Italic with <strong>bold</strong> inside it</em>.\n   <br /><code>I am code</code>.</p>\n  <p>I am <em>italic phrase on one line</em>.</p>\n  <p>I am <em>italic phrase wrapped across two\r lines in the Mardkdown source</em>, did that work?</p>\n  <p>This is _surrounded by literal underlines_ and this is *surrounded\r by literal asterisks*.</p>\n  <p>A single backtick in a code span: <code>`</code></p>\n  <p>A backtick-delimited string in a code span: <code>`foo`</code></p>\n  <p>Here&rsquo;s an example &ndash; right here &ndash; of automatic <code>&amp;ndash;</code>, which\r properly <em>is</em> surrounded by spaces.</p>\n  <p>Here&rsquo;s an example&ndash;right here&ndash;of automatic <code>&amp;mdash;</code>, which properly\r is <em>not</em> surrounded by spaces.</p>\n  <p>Did <code>&lt;html&gt;</code> get escaped properly?</p>\n  <p>Here are some footnote<sup><a href=\"#unit-test-footnote-1-definition\" name=\"unit-test-footnote-1-return\">1</a></sup> examples<sup><a href=\"#unit-test-footnote-2-definition\" name=\"unit-test-footnote-2-return\">2</a></sup>. A non-numeric footnote\r label<sup><a href=\"#unit-test-footnote-3-definition\" name=\"unit-test-footnote-3-return\">3</a></sup>.</p>\n  <p>Bulleted (unordered) list:</p>\n  <ul>\n   <li>Bullet 1</li>\n   <li>Bullet 2</li>\n   <li>Bullet 3</li></ul>\n  <p>Ordered list:</p>\n  <ol>\n   <li>Item 1</li>\n   <li>Item 2</li>\n   <li>Item 3</li></ol>\n  <p>Bulleted list with space between bullets:</p>\n  <ul>\n   <li>\n    <p>Item 1</p></li>\n   <li>\n    <p>Item 2</p></li>\n   <li>\n    <p>Item 3</p></li></ul>\n  <p>Bulleted, with sublists:</p>\n  <ul>\n   <li>Bullet 1\n    <ul>\n     <li>Bullet 1a\n      <ul>\n       <li>Bullet 1a1</li></ul></li>\n     <li>Bullet 1b</li></ul></li>\n   <li>Bullet 2\n    <ul>\n     <li>Bullet 2a</li>\n     <li>Bullet 2b</li></ul></li></ul>\n  <p>Ordered, with sublists:</p>\n  <ol>\n   <li>One\n    <ol>\n     <li>One / One</li>\n     <li>One / Two</li></ol></li>\n   <li>Two\n    <ol>\n     <li>Two / One</li>\n     <li>Two / Two</li></ol></li></ol>\n  <p>A code block that was denoted by 4 leading spaces:</p>\n  <pre>Code line 1.\r\nCode line 2. These spaces [    ] should be preserved.\r\nCode line 3. Don't do _italic_ or **bold** in here.\r\nCode line 4. Stuff like &lt;html&gt; and &amp; should be escaped.</pre>\n  <p>A code block that was denoted by backticks:</p>\n  <pre class=\"brush: \r\">Code line 1.\r\nCode line 2. These spaces [    ] should be preserved.\r\nCode line 3. Don't do _italic_ or **bold** in here.\r\nCode line 4. Stuff like &lt;html&gt; and &amp; should be escaped.</pre>\n  <p>A code block that was denoted by backticks, with a lang:</p>\n  <pre class=\"brush: racket\r\">Code line 1.\r\nCode line 2. These spaces [    ] should be preserved.\r\nCode line 3. Don't do _italic_ or **bold** in here.\r\nCode line 4. Stuff like &lt;html&gt; and &amp; should be escaped.</pre>\n  <p>Back to non-code-block. Next, how about a block quote:</p>\n  <blockquote>\n   <p>Here is a multi-line block quote started\r with a single <code>&gt;</code> to open it. Word formatting\r like <em>italic</em> and <strong>bold</strong> <em>should</em> work in this\r kind of block.</p></blockquote>\n  <p>And:</p>\n  <blockquote>\n   <p>Another multi-line block quote, where\r each line starts with its own <code>&gt;</code> char.</p></blockquote>\n  <p>Here is a link: <a href=\"http://www.racket-lang.org/\">Racket</a></p>\n  <p>Here are linkref style links: <a href=\"http://www.racket-lang.org/\" title=\"Racket\">Racket</a> and <a href=\"http://www.google.com/\" title=\"Google\">Google</a>.</p>\n  <p>And here are the footnote refs:</p>\n  <p>Here an auto-link: <a href=\"http://www.racket-lang.org\">http://www.racket-lang.org</a></p>\n  <p>Here an auto-link email: <a href=\"mailto:foo@bar.com\">foo@bar.com</a></p>\n  <p>Finally, here is an image:</p>\n  <p><img src=\"http://racket-lang.org/logo.png\" alt=\"alt\" title=\"Racket logo\" /></p>\n  <p>Here is [normal bracketed text].</p>\n  <p>[More normal backeted text.]</p>\n  <p>Following are <code>&lt;hr&gt;</code> elements:</p>\n  <hr />\n  <hr />\n  <hr />\n  <hr />\n  <p>How about literal HTML?</p>\n  <p>Here&rsquo;s a table:</p>\n  <table border=\"1\">\r \n   <tr>\r \n    <td>Row 1 Col 1</td>\r \n    <td>Row 1 Col 2</td></tr>\r \n   <tr>\r \n    <td>Row 2 Col 1</td>\r \n    <td>Row 2 Col 2</td></tr></table>\n  <p>Here is <code>&lt;span style=\"font-weight:bold\"&gt;span&lt;/span&gt;</code> &ndash; <span style=\"font-weight:bold\">span</span> &ndash; in the middle of a sentence.</p>\n  <pre class=\"brush: \r\">code block</pre><!-- more-->\n  <p>Some text.</p>\n  <p>The end.</p>\n  <div class=\"footnotes\">\n   <ol>\n    <li id=\"unit-test-footnote-1-definition\" class=\"footnote-definition\">\n     <p>Footnote definition 1.\r&nbsp;<a href=\"#unit-test-footnote-1-return\">↩</a></p></li>\n    <li id=\"unit-test-footnote-2-definition\" class=\"footnote-definition\">\n     <p>The first paragraph of the definition.</p>\n     <p>Paragraph two of the definition.</p>\n     <blockquote>\n      <p>A blockquote with\r multiple lines.</p></blockquote>\n     <pre>a code block</pre>\n     <p>A final paragraph.\r&nbsp;<a href=\"#unit-test-footnote-2-return\">↩</a></p></li>\n    <li id=\"unit-test-footnote-3-definition\" class=\"footnote-definition\">\n     <p>Definition of footnote def.\r&nbsp;<a href=\"#unit-test-footnote-3-return\">↩</a></p></li></ol></div></body></html>"
stchang commented 10 years ago

And I agree with the reversion to just \n and using #:mode 'text.

With the newest changes, perf-test.rkt also needs a #:mode 'text, and the diff test in test.rkt still breaks. I replaced the system call with

  (check-equal? (file->string test.html #:mode 'text)
                (file->string test.out.html #:mode 'text))

to get it working.

With these changes, all the tests pass for me. (I would have done a pull request but I messed up my branch with other commits.)

ps - Does Travis support multiple platforms?

greghendershott commented 10 years ago

perf-test.rkt also needs a #:mode 'text,

I didn't think this would matter, because this doesn't care what the output is, just how long it took. But even so, it might change the timing of the parse. In any case sets a good example of usage. Will change. Thanks!

the diff test in test.rkt still breaks. I replaced the system call with

 (check-equal? (file->string test.html #:mode 'text)
               (file->string test.out.html #:mode 'text))

Shoot. I was hoping to keep the diff because the output is so useful when there is a mismatch.

Probably the reference file markdown/test/test.html gets changed from \n to \r\n on git pull -- due to an autocrlf option in your .gitconfig -- but the generated test.out.html has \n?

If so, I could use #:mode 'text when generating it, so it gets \r\n:

  (with-output-to-file test.out.html
                       #:exists 'replace
                       #:mode 'text ; <----- add
                        . . .

And then the diff should work.

greghendershott commented 10 years ago

ps - Does Travis support multiple platforms?

In addition to the Ubuntu env, apparently there's an OS X environment but I don't know much about it. (For something like this markdown project, doubt it would add much value. Linux and OS X are pretty compatible when it comes to vanilla headless things like a markdown parser.)

What would add value is Windows. But AFIK, alas, no.

stchang commented 10 years ago

perf-test.rkt also needs a #:mode 'text,

I didn't think this would matter

It matters because the parse doesn't finish properly. I'm not at home to check the exact problem but it was a hash-ref "1" error so probably a reference got parsed incorrectly, similar to the example I posted a few comments up in this thread.

What would add value is Windows. But AFIK, alas, no.

Ah yeah that's what I was getting at. Too bad.

greghendershott commented 10 years ago

Thanks again for your help. Tonight if you could please, try commit d356bdb and let me know?

(If it's still not working, then I'll find time to freshen my Windows env and test directly.)

stchang commented 10 years ago

Tonight if you could please, try commit d356bdb and let me know?

Yup, will let you know.

stchang commented 10 years ago

Everything passes in Windows now.

greghendershott commented 10 years ago

Great, and thanks for letting me know!