Closed stchang closed 10 years ago
Thanks for catching that!
To get this test to pass:
(check-equal?
(parse-markdown "[test][1]\n\n[1]:http://test.com \"test-title\"")
(parse-markdown "[test][1]\r\n\r\n[1]:http://test.com \"test-title\""))
what I tried just now was this:
(require (rename-in parsack
[$newline $parsack-newline])
(define $newline
;; `pdo` is just my shorthand for `parser-compose`. I would have named it `do` except
;; for various editors having another idea about indenting that. :)
(<?> (pdo (optional (char #\return))
(char #\newline)
(return #\newline))
"new-line"))
Running markdown/perf-test.rkt
I didn't see any appreciable slowdown as a result of this.
The status quo $newline
in Parsack is simply:
(define $newline (<?> (char #\newline) "new-line"))
Should we change this in Parsack?
Otherwise I could have parse-markdown
do a regexp-replace
on the input string before feeding it to the parser. On the one hand that's more-targeted and perhaps safer than changing it in Parasack. On the other hand, that's a bit gross, and more importantly, it seems like most parsers would want this? Hmm, maybe not. An HTTP request parser might want to be picky about requiring exactly "\r\n"
or"\n"
. Although, real-world ones seem to tolerate either.
What do you think?
p.s. Of course a third choice is that I define a $cr-or-cr-lf
parser (maybe a better name than that) and replace all my uses of Parsack's $newline
with that -- i.e. essentially what I tried above, that works.
That's definitely what I would do if I didn't know the Parsack author. :) It's also what the Parsack author prefer me to do if they didn't want to risk any backward incompatibility, and/or wanted to match Parsec in this regard exactly. BTW what does Parsec do for this?
Sorry, this report should have been more constructive. Can you use parsack's $eol
? (I should probably update the docs to clarify the difference between $newline
and $eol
).
Yeah, my fault, I probably saw $eol
very early on, but forgot it existed. I pushed a commit redoing it the easier/simpler way. :)
So one test is still failing (on windows machine):
$ /c/pltpkg/racket/raco test -x test.rkt
raco test: (submod "test.rkt" test)
diff: extra operand
diff: Try `diff --help' for more information.
--------------------
FAILURE
actual: 2
expected: 0
name: check-equal?
location: (#<path:c:\Users\Stephen Chang\Documents\racket\markdown\markdown\test.rkt> 37 2 1218 92)
expression: (check-equal? (system/exit-code (~a "diff " test.html " " test.out.html)) 0)
Check failure
--------------------
1/112 test failures
Doing a diff suggests that it's still a line break issue.
Here's one line of diff. Line 11 in test.out.html
is broken into two lines:
<p>Here is some word-level formatting. Also, this uses two spaces before
newline to ensure hard line breaks:</p>
while in test.html
, it's one line:
<p>Here is some word-level formatting. Also, this uses two spaces before newline to ensure hard line breaks:</p>
And there are many similar others, eg "italic phrase wrapped across two lines"
Can you open test.out.html
in something capable of seeing what's after "before"
?
Is it solely a \r
, perhaps? Or some other combo of \r
and/or \n
? Please let me know.
The file's at:
(build-path (find-system-path 'temp-dir) "test.out.html"))
Maybe C:\tmp\test.out.html
for you? The test does not delete it afterwards.
p.s. Yes I should try on Windows myself. But my Windows image is stale. Need to set aside some time to update with latest tools, source, etc. Will do, eventually.
Can you open test.out.html in something capable of seeing what's after "before"?
Is it solely a \r, perhaps? Or some other combo of \r and/or \n? Please let me know.
Actually I'm 99% sure you'll find a \r
there.
I just pushed a commit. See its comments via link above, but in short: My new take on this is that parse-markdown
reverts back to expecting \n
line terminations, only. When reading files, callers should use #:mode 'text
. #:mode 'text
is designed for exactly this; let it do its job, and don't complicate the parser itself with anything except \n
.
Yes you were absolutely right about the \r
. Here is the result of
(port->string
(open-input-file
(build-path (find-system-path 'temp-dir) "test.out.html")))
(ie with #:mode
= 'binary
)
"<!DOCTYPE html>\n<html>\n <head>\n <meta charset=\"utf-8\" /></head>\n <body>\n <h1 id=\"heading-level-1\">Heading level 1</h1>\n <h2 id=\"heading-level-2\">Heading level 2</h2>\n <h1 id=\"heading-level-1\">Heading level 1</h1>\n <h2 id=\"heading-level-2\">Heading level 2</h2>\n <p>Some normal text.</p>\n <p>Here is some word-level formatting. Also, this uses two spaces before\r newline to ensure hard line breaks:</p>\n <p><em>Italic</em>.\n <br /><em>Italic</em>.\n <br /><strong>Bold</strong>.\n <br /><strong>Bold</strong>.\n <br /><strong>Bold with <em>italic</em> inside it</strong>.\n <br /><em>Italic with <strong>bold</strong> inside it</em>.\n <br /><code>I am code</code>.</p>\n <p>I am <em>italic phrase on one line</em>.</p>\n <p>I am <em>italic phrase wrapped across two\r lines in the Mardkdown source</em>, did that work?</p>\n <p>This is _surrounded by literal underlines_ and this is *surrounded\r by literal asterisks*.</p>\n <p>A single backtick in a code span: <code>`</code></p>\n <p>A backtick-delimited string in a code span: <code>`foo`</code></p>\n <p>Here’s an example – right here – of automatic <code>&ndash;</code>, which\r properly <em>is</em> surrounded by spaces.</p>\n <p>Here’s an example–right here–of automatic <code>&mdash;</code>, which properly\r is <em>not</em> surrounded by spaces.</p>\n <p>Did <code><html></code> get escaped properly?</p>\n <p>Here are some footnote<sup><a href=\"#unit-test-footnote-1-definition\" name=\"unit-test-footnote-1-return\">1</a></sup> examples<sup><a href=\"#unit-test-footnote-2-definition\" name=\"unit-test-footnote-2-return\">2</a></sup>. A non-numeric footnote\r label<sup><a href=\"#unit-test-footnote-3-definition\" name=\"unit-test-footnote-3-return\">3</a></sup>.</p>\n <p>Bulleted (unordered) list:</p>\n <ul>\n <li>Bullet 1</li>\n <li>Bullet 2</li>\n <li>Bullet 3</li></ul>\n <p>Ordered list:</p>\n <ol>\n <li>Item 1</li>\n <li>Item 2</li>\n <li>Item 3</li></ol>\n <p>Bulleted list with space between bullets:</p>\n <ul>\n <li>\n <p>Item 1</p></li>\n <li>\n <p>Item 2</p></li>\n <li>\n <p>Item 3</p></li></ul>\n <p>Bulleted, with sublists:</p>\n <ul>\n <li>Bullet 1\n <ul>\n <li>Bullet 1a\n <ul>\n <li>Bullet 1a1</li></ul></li>\n <li>Bullet 1b</li></ul></li>\n <li>Bullet 2\n <ul>\n <li>Bullet 2a</li>\n <li>Bullet 2b</li></ul></li></ul>\n <p>Ordered, with sublists:</p>\n <ol>\n <li>One\n <ol>\n <li>One / One</li>\n <li>One / Two</li></ol></li>\n <li>Two\n <ol>\n <li>Two / One</li>\n <li>Two / Two</li></ol></li></ol>\n <p>A code block that was denoted by 4 leading spaces:</p>\n <pre>Code line 1.\r\nCode line 2. These spaces [ ] should be preserved.\r\nCode line 3. Don't do _italic_ or **bold** in here.\r\nCode line 4. Stuff like <html> and & should be escaped.</pre>\n <p>A code block that was denoted by backticks:</p>\n <pre class=\"brush: \r\">Code line 1.\r\nCode line 2. These spaces [ ] should be preserved.\r\nCode line 3. Don't do _italic_ or **bold** in here.\r\nCode line 4. Stuff like <html> and & should be escaped.</pre>\n <p>A code block that was denoted by backticks, with a lang:</p>\n <pre class=\"brush: racket\r\">Code line 1.\r\nCode line 2. These spaces [ ] should be preserved.\r\nCode line 3. Don't do _italic_ or **bold** in here.\r\nCode line 4. Stuff like <html> and & should be escaped.</pre>\n <p>Back to non-code-block. Next, how about a block quote:</p>\n <blockquote>\n <p>Here is a multi-line block quote started\r with a single <code>></code> to open it. Word formatting\r like <em>italic</em> and <strong>bold</strong> <em>should</em> work in this\r kind of block.</p></blockquote>\n <p>And:</p>\n <blockquote>\n <p>Another multi-line block quote, where\r each line starts with its own <code>></code> char.</p></blockquote>\n <p>Here is a link: <a href=\"http://www.racket-lang.org/\">Racket</a></p>\n <p>Here are linkref style links: <a href=\"http://www.racket-lang.org/\" title=\"Racket\">Racket</a> and <a href=\"http://www.google.com/\" title=\"Google\">Google</a>.</p>\n <p>And here are the footnote refs:</p>\n <p>Here an auto-link: <a href=\"http://www.racket-lang.org\">http://www.racket-lang.org</a></p>\n <p>Here an auto-link email: <a href=\"mailto:foo@bar.com\">foo@bar.com</a></p>\n <p>Finally, here is an image:</p>\n <p><img src=\"http://racket-lang.org/logo.png\" alt=\"alt\" title=\"Racket logo\" /></p>\n <p>Here is [normal bracketed text].</p>\n <p>[More normal backeted text.]</p>\n <p>Following are <code><hr></code> elements:</p>\n <hr />\n <hr />\n <hr />\n <hr />\n <p>How about literal HTML?</p>\n <p>Here’s a table:</p>\n <table border=\"1\">\r \n <tr>\r \n <td>Row 1 Col 1</td>\r \n <td>Row 1 Col 2</td></tr>\r \n <tr>\r \n <td>Row 2 Col 1</td>\r \n <td>Row 2 Col 2</td></tr></table>\n <p>Here is <code><span style=\"font-weight:bold\">span</span></code> – <span style=\"font-weight:bold\">span</span> – in the middle of a sentence.</p>\n <pre class=\"brush: \r\">code block</pre><!-- more-->\n <p>Some text.</p>\n <p>The end.</p>\n <div class=\"footnotes\">\n <ol>\n <li id=\"unit-test-footnote-1-definition\" class=\"footnote-definition\">\n <p>Footnote definition 1.\r <a href=\"#unit-test-footnote-1-return\">↩</a></p></li>\n <li id=\"unit-test-footnote-2-definition\" class=\"footnote-definition\">\n <p>The first paragraph of the definition.</p>\n <p>Paragraph two of the definition.</p>\n <blockquote>\n <p>A blockquote with\r multiple lines.</p></blockquote>\n <pre>a code block</pre>\n <p>A final paragraph.\r <a href=\"#unit-test-footnote-2-return\">↩</a></p></li>\n <li id=\"unit-test-footnote-3-definition\" class=\"footnote-definition\">\n <p>Definition of footnote def.\r <a href=\"#unit-test-footnote-3-return\">↩</a></p></li></ol></div></body></html>"
And I agree with the reversion to just \n
and using #:mode 'text
.
With the newest changes, perf-test.rkt
also needs a #:mode 'text
, and the diff
test in test.rkt
still breaks. I replaced the system call with
(check-equal? (file->string test.html #:mode 'text)
(file->string test.out.html #:mode 'text))
to get it working.
With these changes, all the tests pass for me. (I would have done a pull request but I messed up my branch with other commits.)
ps - Does Travis support multiple platforms?
perf-test.rkt also needs a #:mode 'text,
I didn't think this would matter, because this doesn't care what the output is, just how long it took. But even so, it might change the timing of the parse. In any case sets a good example of usage. Will change. Thanks!
the
diff
test intest.rkt
still breaks. I replaced the system call with(check-equal? (file->string test.html #:mode 'text) (file->string test.out.html #:mode 'text))
Shoot. I was hoping to keep the diff
because the output is so useful when there is a mismatch.
Probably the reference file markdown/test/test.html
gets changed from \n
to \r\n
on git pull
-- due to an autocrlf
option in your .gitconfig
-- but the generated test.out.html
has \n
?
If so, I could use #:mode 'text
when generating it, so it gets \r\n
:
(with-output-to-file test.out.html
#:exists 'replace
#:mode 'text ; <----- add
. . .
And then the diff
should work.
ps - Does Travis support multiple platforms?
In addition to the Ubuntu env, apparently there's an OS X environment but I don't know much about it. (For something like this markdown project, doubt it would add much value. Linux and OS X are pretty compatible when it comes to vanilla headless things like a markdown parser.)
What would add value is Windows. But AFIK, alas, no.
perf-test.rkt also needs a #:mode 'text,
I didn't think this would matter
It matters because the parse doesn't finish properly. I'm not at home to check the exact problem but it was a hash-ref "1"
error so probably a reference got parsed incorrectly, similar to the example I posted a few comments up in this thread.
What would add value is Windows. But AFIK, alas, no.
Ah yeah that's what I was getting at. Too bad.
Thanks again for your help. Tonight if you could please, try commit d356bdb and let me know?
(If it's still not working, then I'll find time to freshen my Windows env and test directly.)
Tonight if you could please, try commit d356bdb and let me know?
Yup, will let you know.
Everything passes in Windows now.
Great, and thanks for letting me know!
I believe the parser does not handle Windows-style line breaks, ie
"\r\n"
, properly. For example, all the tests that rely ontest.md
fail in Windows.Here is another example: