Open rose00 opened 7 years ago
Which version are you using? I only had v1.16.0.2 around to test this and got the expected output.
It is reproducible.
% pandoc --version
pandoc 1.18
Compiled with pandoc-types 1.17.0.4, texmath 0.8.6.6, highlighting-kate 0.6.3
Default user data directory: /Users/jmacfarlane/.pandoc
Copyright (C) 2006-2016 John MacFarlane
Web: http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.
% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t html
Update A<sub>i</sub>: <code>Ai++</code>
It would be possible to get output like A<sub>i</sub>: <code>A</code><sub><code>i</code></sub><code>++</code>
with some extra work; in fact, I think we do something like this in one of the readers, but I can't recall which.
My bad: had a typo in closing code
tag and only looked for the subscript. Now I get it too.
This "wrinkle" in pandoc causes the largest number of divergences when translating the Java JLS and JVMS documents from HTML to pandoc markdown and back again to HTML.
(Added suggestion for refactoring the IR of Code spans and CodeBlock blocks. It would be a breaking change to those who read -t native
, unfortunately, but would allow HTML to be represented more uniformly, and tease apart verbatim-mode from code-mode, which are really separate concepts in separate layers of the system, even though they often appear together.)
I'm still having this same old trouble, but now I have a new idea for a fix, which does not require an AST change. When parsing HTML <code>X</code>
, if X has anything incompatible with plain text, then convert that code
tag to raw HTML. (Perhaps gate this on whether raw_html
is enabled.)
Specifically:
$ echo 'Update A<sub>i</sub>: <my.code>A<sub>i</sub>++</my.code>' | pandoc -f html+raw_html -t markdown
Update A~i~: `<my.code>`{=html}A~i~++`</my.code>`{=html}
$ echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html+raw_html -t markdown
Update A~i~: `<code>`{=html}A~i~++`</code>`{=html}
$ echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html+raw_html -t native
[Plain [Str "Update",Space,Str "A",Subscript [Str "i"],Str ":",Space,RawInline (Format "html") "<code>",Str "A",Subscript [Str "i"],Str "++",RawInline (Format "html") "</code>"]]
The first line shows how unknown tags are handled. The other lines show my new AST-preserving proposal, that code
tags with embedded markup should be treated as unknown. Because, frankly, Markdown doesn't know what to do with such code
spans, so they should be treated as unknown HTML.
I know it's ugly and non-markdown-like, but it renders the HTML much more faithfully at that point, and that's what I need. The point here is to shove the hard-to-render information through the HTML reader into the AST, and not drop it on the floor. From there scripts can DTRT with it in an ad hoc manner, if further adjustment is needed.
The raw_html
bit, when turned off, strips my.code
tags as if they were never there, and clearly in that case the existing logic should apply, erasing the non-renderable parts of the code
span as well.
Here is proof that code
tags, passed via AST as raw HTML, look useful coming out the other end as HTML (this works today, in pandoc 2.14):
echo 'Update A~i~: `<code>`{=html}A~i~++`</code>`{=html}' | pandoc -f markdown -t html
<p>Update A<sub>i</sub>: <code>A<sub>i</sub>++</code></p>
As a partial workaround for this bug, I could preprocess my HTML files, textually with sed
, to change code
tags to my.code
tags, when they fail to match their closing tags over simple-enough spans. But any Unix-oriented tool is likely to mis-process spans which contain line breaks. I'm hoping that pandoc can be the proper HTML parser for this job, if only it can get over its preconceptions about what code
tags mean in HTML!
To be clear: I'm not asking for parsing changes to the Markdown parser, just HTML parser. Nor is there a need for changes to the parsing of any HTML inputs other than code
spans that have embedded markup, and only when raw html is requested. In those cases, since there is a way to pass the data through the AST, it should be passed.
OK, there's more too: The handling of pre
tags is similar, in the HTML parser. When parsing <pre><em>hello</em></pre>
, the HTML parser seems to make unilateral decision to treat embedded markup as either stripped HTML or as literal pointy brackets. Neither is a correct parsing of HTML, I think.
$ echo '<pre><em>hello</em>,<hr>world!</pre>' | pandoc -f html -t native
[CodeBlock ("",[],[]) "hello,world!"]
$ echo '<code><pre><i>hello</i>,<hr>world!</pre></code>' | pandoc -f html+raw_html -t native
[Para [Code ("",[],[]) ""]
,CodeBlock ("",[],[]) "hello,world!"
,Para [RawInline (Format "html") "</code>"]]
$ : Who ordered that??
My browser renders such em
and hr
tags just fine from HTML.
OK, it's a mess. But I think there's a good way forward through the existing AST. Please consider?
In Readers/HTML.hs
, tagToText
is used to unconditionally flatten <pre>
blocks into flat text.
It might return Maybe Text
instead of Text
, and the sole caller, pCodeBlock
, should discard the list of parsed tags (a rare event) and re-parse as raw HTML, as sketched above.
I think the idea of being sensitive to raw_html
and falling back to HTML when the code contains markup is a reasonable one. And note, that would not require an AST change. We already have the capacity to represent raw HTML bits in the AST.
When the HTML reader reads a
<code>
span, interior markup is dropped.(A similar point goes for
<pre>
spans. The HTML parser is aggressive about throwing away legitimate HTML markup within these spans.)Native
Code
elements cannot contain markup, because they are designed to carry what other wiki processors call<verbatim>
content. Verbatim content does not have embedded markup, and therefore should be rendered with special quoting, for ease of reading and compactness. But none of that is logically tied to the font associated with the HTML<code>
attribute, nor even to the special handling of<pre>
blocks in HTML.I think this issue comes from a dual purpose for
<code>
spans, to turn on a certain kind of look (mono-space type) and also to cause markdown to use a verbatim encoding syntax. The verbatim function is what seems to erase the interior markup.I would think that the interior markdown nodes should clearly distinguish code spans from requests to use "verbatim".
Perhaps
Code
elements should be refactored to use a nestedFormat "text"
when possible, but not when the HTML input has markup.This would allow the reader to preserve internal markup inside of
<code>
spans, and use some other kind of output tactic, such as literal<code>
markup instead of back-ticks, to express which markdown content is to look like<code>
.Perhaps the "verbatim" syntax should be produced by the markdown only if two conditions hold: 1. there is an ambient
<code>
attribute on the current span, andSuggested output:
Or, maybe the back-tick syntax needs a way to turn on recursive markdown parsing, in its capacity as a translation for
<code>
(instead of verbatim).Or, break up the
<code>
span, commuting the interior markup outside the code spans.The theory could be that the
<code>
tag is a sort of verbatim super-quote (per MD usage) that distributes down to each character in the span. Because of this special role, it is reasonable to break up code spans if the there are additional variations in markup inside the<code>
span.Suggested output:
Although it looks ugly to break up an input span, the break-up is required because of the verbatim function of back-tick. Perhaps when converting to HTML the writer should sew consecutive
<code>
spans into one.