jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.7k stars 3.39k forks source link

Definition lists: HTML comment edge case #7778

Open xrat opened 2 years ago

xrat commented 2 years ago

In the following minimal example the HTML comment <!-- (…) --> is falsely not recognized as a comment:

Term
: Def
<!--
: comment def
-->

Pandoc v2.16.2 with pandoc --from markdown --to html5 produces

<dl>
<dt>Term</dt>
<dd>Def &lt;!–
</dd>
<dd>comment def –&gt;
</dd>
</dl>

pandoc --strip-comments produces the same output as above. A workaround is to put any character in front of the commented :. In other words, the commented : (or a ~) is what triggers the bug.

mb21 commented 2 years ago

for what it's worth, the following works:

Term
: Def
<!-- : comment def
-->

not sure this is a bug... just an edge case in the pandoc's markdown syntax...

xrat commented 2 years ago

@mb21 your suggested workaround is not feasible for larger comments b/c any line starting with : triggers the bug. IMHO this warrants the label bug.

tarleb commented 2 years ago

FWIW, multimarkdown and kramdown give the same result as pandoc.

Pandoc's CommonMark parser, with the definition_lists extension enabled, behaves more like you expect; try with --from=commonmark_x.

jgm commented 2 years ago

This happens because the way the def list parser works is by gobbling up raw lines comprising the definition, and then parsing after the fact. This method isn't sophisticated enough to skip a multiline HTML comment. Here's another case worth considering.

Term
: Def
test <!--
: comment def
and -->

Note that this case is parsed by the commonmark+definition_list parser as

<dl>
<dt>Term</dt>
<dd>Def test &lt;!–
</dd>
<dd>comment def and –&gt;
</dd>
</dl>

which is correct given the commonmark principle that block-level structure takes precedence over inline-level structure.

gpoore commented 2 years ago

This doesn't just happen with HTML comments. It can also occur with other inline elements like inline code. The output makes sense based on the parsing algorithm, but it is somewhat surprising from a user perspective to discover that the validity of inline code depends on line break locations.

Here's an example with inline code:

Term
: Def
`code
: comment def
more code`

Pandoc Markdown produces this:

<dl>
<dt>Term</dt>
<dd>Def `code
</dd>
<dd>comment def more code`
</dd>
</dl>

And CommonMark (-f commonmark+definition_lists) gives the same thing. Simply removing the line break before : comment results in valid inline code.

jgm commented 2 years ago

If you indent your definitions properly, you're less likely to run into problems like this:

Term
:   Def
    `code
:   comment def
    more code`

versus

Term
:   Def
    `code
    :   comment def
    more code`
xrat commented 2 years ago

I am very thankful for Pandoc and its contributors. So, please excuse me asking for a clarification why in this case it is acceptable that an HTML comment of type <!-- (...) --> is not parsed as a comment by Pandoc's Markdown whereas I do not know any other such case.

jgm commented 2 years ago

I don't think the issue should have been closed.

fumiyas commented 2 years ago

Same here with list:

* foo
<!--
* bar
-->
* baz
   * baz
<!--
* qux
-->
* end
$ pandoc --from markdown --to html5 list-with-comment.md
<ul>
<li>foo <!--
* bar
--></li>
<li>baz
<ul>
<li>baz &lt;!–</li>
</ul></li>
<li>qux –&gt;</li>
<li>end</li>
</ul>