jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.51k stars 3.37k forks source link

<code> spans read from HTML erase superscript spans instead of splitting #3227

Open rose00 opened 7 years ago

rose00 commented 7 years ago

When the HTML reader reads a <code> span, interior markup is dropped.

% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t html
Update A<sub>i</sub>: <code>Ai++</code>
% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t markdown
Update A~i~: `Ai++`
% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t native
[Plain [Str "Update",Space,Str "A",Subscript [Str "i"],Str ":",Space,Code ("",[],[]) "Ai++"]]

(A similar point goes for <pre> spans. The HTML parser is aggressive about throwing away legitimate HTML markup within these spans.)

Native Code elements cannot contain markup, because they are designed to carry what other wiki processors call <verbatim> content. Verbatim content does not have embedded markup, and therefore should be rendered with special quoting, for ease of reading and compactness. But none of that is logically tied to the font associated with the HTML <code> attribute, nor even to the special handling of <pre> blocks in HTML.

I think this issue comes from a dual purpose for <code> spans, to turn on a certain kind of look (mono-space type) and also to cause markdown to use a verbatim encoding syntax. The verbatim function is what seems to erase the interior markup.

I would think that the interior markdown nodes should clearly distinguish code spans from requests to use "verbatim".

Perhaps Code elements should be refactored to use a nested Format "text" when possible, but not when the HTML input has markup.

-    | Code Attr String      -- ^ Inline code (literal)
+    | Code Attr [Inline]    -- ^ Inline code (list of inlines)

This would allow the reader to preserve internal markup inside of <code> spans, and use some other kind of output tactic, such as literal <code> markup instead of back-ticks, to express which markdown content is to look like <code>.

Perhaps the "verbatim" syntax should be produced by the markdown only if two conditions hold: 1. there is an ambient <code> attribute on the current span, and

  1. the actual characters to output do not contain any further variations in style.

Suggested output:

% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t html
Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>
% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t markdown
Update A~i~: <code>A~i~++</code>

Or, maybe the back-tick syntax needs a way to turn on recursive markdown parsing, in its capacity as a translation for <code> (instead of verbatim).

% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t markdown
Update A~i~: ` A~i~++ `

Or, break up the <code> span, commuting the interior markup outside the code spans.

The theory could be that the <code> tag is a sort of verbatim super-quote (per MD usage) that distributes down to each character in the span. Because of this special role, it is reasonable to break up code spans if the there are additional variations in markup inside the <code> span.

Suggested output:

% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t html
Update A<sub>i</sub>: <code>A</code><sub><code>i</code></sub><code>++</code>
% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t markdown
Update A~i~: `A`~`i`~`++`

Although it looks ugly to break up an input span, the break-up is required because of the verbatim function of back-tick. Perhaps when converting to HTML the writer should sew consecutive <code> spans into one.

tarleb commented 7 years ago

Which version are you using? I only had v1.16.0.2 around to test this and got the expected output.

jgm commented 7 years ago

It is reproducible.

% pandoc --version
pandoc 1.18
Compiled with pandoc-types 1.17.0.4, texmath 0.8.6.6, highlighting-kate 0.6.3
Default user data directory: /Users/jmacfarlane/.pandoc
Copyright (C) 2006-2016 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.
% echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html -t html
Update A<sub>i</sub>: <code>Ai++</code>

It would be possible to get output like A<sub>i</sub>: <code>A</code><sub><code>i</code></sub><code>++</code> with some extra work; in fact, I think we do something like this in one of the readers, but I can't recall which.

tarleb commented 7 years ago

My bad: had a typo in closing code tag and only looked for the subscript. Now I get it too.

rose00 commented 7 years ago

This "wrinkle" in pandoc causes the largest number of divergences when translating the Java JLS and JVMS documents from HTML to pandoc markdown and back again to HTML.

(Added suggestion for refactoring the IR of Code spans and CodeBlock blocks. It would be a breaking change to those who read -t native, unfortunately, but would allow HTML to be represented more uniformly, and tease apart verbatim-mode from code-mode, which are really separate concepts in separate layers of the system, even though they often appear together.)

rose00 commented 3 years ago

I'm still having this same old trouble, but now I have a new idea for a fix, which does not require an AST change. When parsing HTML <code>X</code>, if X has anything incompatible with plain text, then convert that code tag to raw HTML. (Perhaps gate this on whether raw_html is enabled.)

Specifically:

$ echo 'Update A<sub>i</sub>: <my.code>A<sub>i</sub>++</my.code>' | pandoc -f html+raw_html -t markdown
Update A~i~: `<my.code>`{=html}A~i~++`</my.code>`{=html}

$ echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html+raw_html -t markdown
Update A~i~: `<code>`{=html}A~i~++`</code>`{=html}

$ echo 'Update A<sub>i</sub>: <code>A<sub>i</sub>++</code>' | pandoc -f html+raw_html -t native
[Plain [Str "Update",Space,Str "A",Subscript [Str "i"],Str ":",Space,RawInline (Format "html") "<code>",Str "A",Subscript [Str "i"],Str "++",RawInline (Format "html") "</code>"]]

The first line shows how unknown tags are handled. The other lines show my new AST-preserving proposal, that code tags with embedded markup should be treated as unknown. Because, frankly, Markdown doesn't know what to do with such code spans, so they should be treated as unknown HTML.

I know it's ugly and non-markdown-like, but it renders the HTML much more faithfully at that point, and that's what I need. The point here is to shove the hard-to-render information through the HTML reader into the AST, and not drop it on the floor. From there scripts can DTRT with it in an ad hoc manner, if further adjustment is needed.

The raw_html bit, when turned off, strips my.code tags as if they were never there, and clearly in that case the existing logic should apply, erasing the non-renderable parts of the code span as well.

Here is proof that code tags, passed via AST as raw HTML, look useful coming out the other end as HTML (this works today, in pandoc 2.14):

echo 'Update A~i~: `<code>`{=html}A~i~++`</code>`{=html}' | pandoc -f markdown -t html
<p>Update A<sub>i</sub>: <code>A<sub>i</sub>++</code></p>

As a partial workaround for this bug, I could preprocess my HTML files, textually with sed, to change code tags to my.code tags, when they fail to match their closing tags over simple-enough spans. But any Unix-oriented tool is likely to mis-process spans which contain line breaks. I'm hoping that pandoc can be the proper HTML parser for this job, if only it can get over its preconceptions about what code tags mean in HTML!

To be clear: I'm not asking for parsing changes to the Markdown parser, just HTML parser. Nor is there a need for changes to the parsing of any HTML inputs other than code spans that have embedded markup, and only when raw html is requested. In those cases, since there is a way to pass the data through the AST, it should be passed.

OK, there's more too: The handling of pre tags is similar, in the HTML parser. When parsing <pre><em>hello</em></pre>, the HTML parser seems to make unilateral decision to treat embedded markup as either stripped HTML or as literal pointy brackets. Neither is a correct parsing of HTML, I think.

$ echo '<pre><em>hello</em>,<hr>world!</pre>' | pandoc -f html -t native
[CodeBlock ("",[],[]) "hello,world!"]

$ echo '<code><pre><i>hello</i>,<hr>world!</pre></code>' | pandoc -f html+raw_html -t native
[Para [Code ("",[],[]) ""]
,CodeBlock ("",[],[]) "hello,world!"
,Para [RawInline (Format "html") "</code>"]]

$ : Who ordered that??

My browser renders such em and hr tags just fine from HTML.

OK, it's a mess. But I think there's a good way forward through the existing AST. Please consider?

rose00 commented 3 years ago

In Readers/HTML.hs, tagToText is used to unconditionally flatten <pre> blocks into flat text. It might return Maybe Text instead of Text, and the sole caller, pCodeBlock, should discard the list of parsed tags (a rare event) and re-parse as raw HTML, as sketched above.

jgm commented 3 years ago

I think the idea of being sensitive to raw_html and falling back to HTML when the code contains markup is a reasonable one. And note, that would not require an AST change. We already have the capacity to represent raw HTML bits in the AST.