jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.62k stars 3.38k forks source link

Inline Raw HTML creates unbalanced tags #2146

Closed cefn closed 9 years ago

cefn commented 9 years ago

I have a markdown file which contains some raw HTML. Pandoc doesn't seem to be able to process it without creating problems.

Depending on how I indent it, I either get literal HTML markdown appearing as text in the final HTML or I get unbalanced tags which break my downstream XHTML processing pipeline, a bit like...

<form>
<p>
</form>
</p>

...which I guess should never happen.

The current source is a pretty simple test case and can be seen at https://raw.githubusercontent.com/ShrimpingIt/website_text/5a313aa6a6d82c53d35ad2cdfa2a57911ec464de/src/content/contribute.md

I've been trying to follow the guidance at http://pandoc.org/demo/example9/pandocs-markdown.html under the heading 'Raw HTML' but making sure matching tags are unindented at the start and end (the form tags in this case) still doesn't force pandoc to treat the lines cleanly as HTML.

The python script which runs the pandoc is https://github.com/ShrimpingIt/website_text/blob/bba7a83d39230b6051e2bfd23c3d6a4972486911/src/python/writeraw.py and the options for the invocation are therefore...

pandoc --from=markdown_github --to=html --standalone {inputpath}
nkalvi commented 9 years ago

Verbatim (code) blocks Indented code blocks A block of text indented four spaces (or one tab) is treated as verbatim text: that is, special characters do not trigger special formatting, and all spaces and line breaks are preserved.

In the example, <input>s are indented with a single tab. Replacing these according to the rule (f.ex. with two spaces) would stop them from being treated as code blocks.

nkalvi commented 9 years ago

Or you could try with -f markdown_github-markdown_in_html_blocks if you're not using markdown inside your HTML blocks.

Extension: markdown_in_html_blocks

Standard markdown allows you to include HTML “blocks”: blocks of HTML between balanced tags that are separated from the surrounding text with blank lines, and start and end at the left margin. Within these blocks, everything is interpreted as HTML, not markdown; so (for example), * does not signify emphasis.

Pandoc behaves this way when the markdown_strict format is used; but by default, pandoc interprets material between HTML block tags as markdown.

cefn commented 9 years ago

Thanks for taking the time to get back to me.

However, I'm unable to satisfy pandoc by modifying the indentation as you suggest. Did you actually try this and succeed? For me it just creates an element interleaving bug. I'm currently running against https://raw.githubusercontent.com/ShrimpingIt/website_text/553d9009990076256809f5bac8553ace13b126ce/src/content/contribute.md

It's better to run the pandoc export using standard github options for all the source files, rather than having to have a special case where this form appears. For this reason I'm trying to find out how to satisfy pandoc's rule for detecting raw HTML, even if I have to change the HTML source to do that.

Even with the and tags indented with two spaces it creates unbalanced tags as I mentioned in my original post, which I believe is a bug. I don't know why it's not treating the left-aligned <form ...> and </form> tags as defining a raw html block. So the original source like this (with child elements indented by two spaces)...

<form style="float:right;" action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
  <input type="hidden" name="cmd" value="_s-xclick" />
  <input type="hidden" name="hosted_button_id" value="8Q7DPJ7Z5YGBN" />
  <input type="image" src="https://www.paypalobjects.com/en_US/GB/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal – The safer, easier way to pay online." />
  <img alt="" border="0" src="https://www.paypalobjects.com/en_GB/i/scr/pixel.gif" width="1" height="1" />
</form>

Gets turned into this, which surely must be a bug, note the badly indented 'form' and 'p'.

<form style="float:right;" action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">

<p><input type="hidden" name="cmd" value="_s-xclick" /><br /> <input type="hidden" name="hosted_button_id" value="8Q7DPJ7Z5YGBN" /><br /> <input type="image" src="https://www.paypalobjects.com/en_US/GB/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal – The safer, easier way to pay online." /><br /> <img alt="" border="0" src="https://www.paypalobjects.com/en_GB/i/scr/pixel.gif" width="1" height="1" /><br /></form></p>
<p>Practical support for the project can be offered by hitting the donation button, or <a href="/#kit">buying a kit</a>.</p>

If you really can't recreate this bug, then I need to go back to the drawing board with my pandoc configuration. Currently it reports...

pandoc --version
pandoc 1.12.2.1
Compiled with texmath 0.6.5.2, highlighting-kate 0.5.5.1.
Syntax highlighting is supported for the following languages:
    actionscript, ada, apache, asn1, asp, awk, bash, bibtex, boo, c, changelog,
    clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css, curry, d,
    diff, djangotemplate, doxygen, doxygenlua, dtd, eiffel, email, erlang,
    fortran, fsharp, gnuassembler, go, haskell, haxe, html, ini, java, javadoc,
    javascript, json, jsp, julia, latex, lex, literatecurry, literatehaskell,
    lua, makefile, mandoc, markdown, matlab, maxima, metafont, mips, modelines,
    modula2, modula3, monobasic, nasm, noweb, objectivec, objectivecpp, ocaml,
    octave, pascal, perl, php, pike, postscript, prolog, python, r,
    relaxngcompact, rhtml, roff, ruby, rust, scala, scheme, sci, sed, sgml, sql,
    sqlmysql, sqlpostgresql, tcl, texinfo, verilog, vhdl, xml, xorg, xslt, xul,
    yacc, yaml
Default user data directory: /home/cefn/.pandoc
cefn commented 9 years ago

Also added the option suggested so my command line is currently...

pandoc --from=markdown_github --to=html -f markdown_github-markdown_in_html_blocks --standalone {inputpath}

...and the problem with interleaved tags remains, as if pandoc doesn't know that <form ...> is HTML at all. Still outputs as...

<form style="float:right;" action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">

<p><input type="hidden" name="cmd" value="_s-xclick" /><br /> <input type="hidden" name="hosted_button_id" value="8Q7DPJ7Z5YGBN" /><br /> <input type="image" src="https://www.paypalobjects.com/en_US/GB/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal – The safer, easier way to pay online." /><br /> <img alt="" border="0" src="https://www.paypalobjects.com/en_GB/i/scr/pixel.gif" width="1" height="1" /><br /></form></p>
jgm commented 9 years ago

This does seem to me to be a bug. With either markdown-markdown_in_html_blocks or markdown_github, the whole form should be treated as a solid chunk of HTML (regardless of indentation).

jgm commented 9 years ago

Interesting data point:

% pandoc -f markdown_github
<form>
 *hi*
</form>
^D
<form>
 *hi*
</form>
% pandoc -f markdown_github
<form style="float:right;" action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
 *hi*
</form>
^D
<form style="float:right;" action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<p><em>hi</em><br />
</form></p>

So, somehow the attributes in the form tag are preventing pandoc from recognizing it as a block html tag. I have no idea why this would be, need to investigate further.

nkalvi commented 9 years ago

@cefn You're right - I was only viewing the output in the browser; sorry about that.

Just out of curiosity, I tested wrapping the form in <div> (with single tab indentation) and this is how it looks like with pandoc 2146div.md -f markdown_github-markdown_in_html_blocks -s -o 2146div.html:

<h2 id="purchases-and-donations">Purchases and Donations</h2>
<div>
<form style="float:right;" action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
    <input type="hidden" name="cmd" value="_s-xclick" />
    <input type="hidden" name="hosted_button_id" value="8Q7DPJ7Z5YGBN" />
    <input type="image" src="https://www.paypalobjects.com/en_US/GB/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal – The safer, easier way to pay online." />
    <img alt="" border="0" src="https://www.paypalobjects.com/en_GB/i/scr/pixel.gif" width="1" height="1" />
</form>
</div>
<p>Practical support for the project can be offered by hitting the donation button, or <a href="/#kit">buying a kit</a>.</p>

BTW, I tested it with pandoc 1.13.2.1.

jgm commented 9 years ago

Update: it is the action attribute, specifically, that causes the problem. Even weirder, it depends on the value of this attribute!

% pandoc -f markdown_github
<form action="foo">
 *div*
</form>
^D
<form action="foo">
 *div*
</form>
% pandoc -f markdown_github
<form action="https://www.paypal.com/cgi-bin/webscr">
 *hi*
</form>
^D
<form action="https://www.paypal.com/cgi-bin/webscr">
<p><em>hi</em><br /></form></p>

On further testing, it seems that the problem is triggered by a forward slash (/) in any attribute value. This must be something really dumb in the code, but now that we know the trigger it should be fairly easy to track down.

cefn commented 9 years ago

Yay, thanks for looking into it, and I found a real bug!

Happy to have contributed something to what's been a very valuable project for our work at @ShrimpingIt

We have a whole website authored in Haroopad markdown editor and filtered through Pandoc at http://start.shrimping.it ready for a big relaunch soon.

nkalvi commented 9 years ago

@jgm

Could it be due to the following in Readers/HTML.hs (called by `strictHtmlBlock = htmlInBalanced (not . isInlineTag)':

htmlInBalanced f = try $ do
  (TagOpen t _, tag) <- htmlTag f
  guard $ '/' `notElem` tag      -- not a self-closing tag

On further testing, it seems that the problem is triggered by a forward slash (/) in any attribute value. This must be something really dumb in the code, but now that we know the trigger it should be fairly easy to track down.

jgm commented 9 years ago

@nkalvi Your diagnosis was correct! I've just committed a fix. Now we get the expected raw HTML with markdown_github, and we get properly nested HTML tags with pandoc markdown.