jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.14k stars 3.35k forks source link

Fenced code block as a list item #9865

Open jsx97 opened 3 months ago

jsx97 commented 3 months ago

A bug report or maybe a request for improvement.

Sometimes it is necessary to have a fenced code block as a list item. As I have discovered, the proper syntax for this is not very intuitive.

pandoc input.md -o output.htm
- example 1
- ```
  list item two
  list item two

In Example 1, each line inside the pre block is indented with two spaces, whereas I expected the lines won't be indented.

-   example 2
-   ```
    list item two

    list item two

Example 2 works fine, but ony if the two lines inside the pre block are separated with the empty one. If there is no empty line between them, the markup will be <li><code>list item two list item two</code></li>.

- example 3
- ```
list item two
list item two

Example 3 demonstrates the syntax that works fine. Though we can use it, I would prefer the syntax from the Example 1.

jgm commented 3 months ago

Hm. I can reproduce this. It's definitely not intended, and you won't get that behavior with -f commonmark or -f gfm.

% pandoc -t native
- example 1
- ```
  list item two
  list item two

[ BulletList [ [ Plain [ Str "example" , Space , Str "1" ] ] , [ CodeBlock ( "" , [] , [] ) " list item two\n list item two" ] , [ Plain [ Str "list" , Space , Str "item" , Space , Str "three" ] ] ] ]


A bug I would say.
jgm commented 3 months ago

Even more minimal case:

% pandoc
- ```
  item

^D

jgm commented 3 months ago

The problem lies with

listLineCommon :: PandocMonad m => MarkdownParser m Text
listLineCommon = T.concat <$> manyTill
              (  many1Char (satisfy $ \c -> c `notElem` ['\n', '<', '`'])
             <|> fmap snd (withRaw code)
             <|> fmap (renderTags . (:[]) . fst) (htmlTag isCommentTag)
             <|> countChar 1 anyChar
              ) newline

Originally this function was just grabbing the first literal text line of the list item (whose raw contents would be reparsed later). But special handling was added for inline code and HTML comment tags (likely for good reasons which we can look up). Note that ``` can delimit inline code as well as code blocks. So, this function is gobbling the whole thing, instead of just the first line. And because of this, the code that would have removed the extra indentation doesn't get triggered (that's in listLine).

Code is a bit of a mess here -- I need to revisit some things, but I'm recording this diagnosis here for when I have a chance to do that.

Ref #5628

jgm commented 3 months ago

The story begins 15 years ago, with this commit: https://github.com/jgm/pandoc/commit/eb2e560d861387414fe03056189f32e54e83851b

That was meant to deal with cases like the following:

- a <!--

- b

-->
- c

That is still a case pandoc handles nicely (whereas commonmark doesn't recognize the HTML comment in this kind of context).

But the cost of dealing with this case was that, in consuming raw content for the list item, we needed to gobble material inside HTML comments. Fine! For many years we did that. But then someone came up with a case like

- a `<!--`
- b `-->`

in which the special characters are quoted in inline code. Well, clearly our "raw line" parser needs to gobble up inline code sections, too. And that's all fine until we have a case like yours. Note that

abc

would be perfectly valid inline code (were it not parsed first as a code block). So the raw list item parser gobbles up this whole chunk, avoiding the line-by-line reading that strips leading indentation.

What a mess!

In this case we could add an additional band-aid to the current pile of band-aids, probably. But I'm tempted to think that this was all a mistake, and that the way to sanity is the approach we took with commonmark, which just makes it very clear that indicators of block structure take precedence over inline parsing, and render the first example above as

<ul>
<li>
<p>a &lt;!--</p>
</li>
<li>
<p>b</p>
</li>
</ul>
<p>--&gt;</p>
<ul>
<li>c</li>
</ul>

So, I'm tempted to take out all the special-purpose code instead of adding something else that will probably break in some new way in the future...

jgm commented 3 months ago

See also #7778 for another related case.