jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.2k stars 3.36k forks source link

Pandoc adding new line leads to non-existing unnumbered list in Commonmark output #5597

Open Anders-E opened 5 years ago

Anders-E commented 5 years ago

Overview

Stumbled upon this while converting HTML to Markdown using pandoc. Basically when pandoc breaks up long lines of text using new lines, it might lead to a line starting with a number followed by a period.

This in turns means that the output contains a list element where the input does not.

Reproduction

Pandoc Version

pandoc 2.7.3
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1

Command Line Used

pandoc -o doc.md doc.html

(tried it with all available Markdown formats and they all produce the same error)

Input used (doc.html)

<p>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 1. aaaaaaaaaaaaaaaaaaaaaa</p>

Output received (doc.md)

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1. aaaaaaaaaaaaaaaaaaaaaa

(Notice the numbered list element)

Expected output

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 1. aaaaaaaaaaaaaaaaaaaaaa

(or the same containing a new line which does not result in a numbered list)

mb21 commented 5 years ago

Well, in commonmark and pandoc markdown, the list item needs to be preceeded by an empty line. So I think that's not a bug.

You can use the --wrap=none option to get your expected result.

Anders-E commented 5 years ago

It seems that's correct, thank you for pointing it out.

However, if you replace the 1. in my input example with - an non-numbered list will be output as CommonMark does not require a newline before regular lists.

Would this constitute a bug or should one use --wrap=none to avoid these lists from popping up?

mb21 commented 5 years ago

As of now, in pandoc markdown you need the newline even for bullet lists (this will change at some point in the future).

But indeed, this is even a bug in current commonmark output:

echo '<p>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - aaaaaaaaaaaaaaaaaaaaaaa</p>' | pandoc -f html -t commonmark
Anders-E commented 5 years ago

I tried it with * and + as well. * gets escaped correctly but + leads to the same bug as -.

Also thank you for the very quick replies!

jgm commented 5 years ago

As a workaround you can do --wrap=none

jgm commented 5 years ago

I just tried latest cmark and its commonmark renderer properly escapes these cases. This was unexpected, because pandoc uses libcmark (or rather the amplified version maintained by GitHub) to render commonmark! It should behave the same.

Probably upstream cmark has some changes that aren't yet in GitHub's cmark fork, or perhaps they are but the cmark-gfm package doesn't contain the latest?

jgm commented 5 years ago

I see this commit which is part of the 0.29 release of cmark:

commit 6122d5cc3c5e5e8f94f203daddfd38a36be7aed4
Author: John MacFarlane <jgm@berkeley.edu>
Date:   Sat Apr 6 10:20:02 2019 -0700

    commonmark renderer: improve escaping.

    URL-escape special characters when escape mode is URL,
    and not otherwise.

    Entity-escape control characters (< 0x20) in non-literal
    escape modes.

Looks like these changes are in cmark-gfm 0.2, though, so I'm still not understanding why pandoc isn't working... (EDIT: These changes don't seem relevant to list bullets.)

jgm commented 5 years ago

Hm.

*Text.Pandoc.CSV CMarkGFM> nodeToCommonmark [] (Just 72) $ commonmarkToNode [] [] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n1\\. aaaaaa\n"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n1. aaaaaa\n"

So the escaping isn't done properly in the cmark-gfm Haskell library. Yet if I compile cmark-gfm C library and run the executable, it is done properly.

jgm commented 5 years ago

Interesting.

% pandoc -f commonmark -t commonmark --wrap=preserve
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1\. aaaaaaaaaa 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1\. aaaaaaaaaa
% pandoc -f commonmark -t commonmark --wrap=auto
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1\. aaaaaaaaaa 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1. aaaaaaaaaa

So, with --wrap=preserve it works fine but with --wrap=auto it fails to escape properly. I can duplicate this using the cmark executable from the C library:

% ./build/src/cmark-gfm -t commonmark --width 0
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1\. aaaaaaaaaa 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1\. aaaaaaaaaa
% ./build/src/cmark-gfm -t commonmark --width 72
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1\. aaaaaaaaaa 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1. aaaaaaaaaa

So this really is a problem in the cmark library, not pandoc itself.

lewer commented 1 year ago

I am not sure it's the same issue, but I've noticed that escaped characters disappear after line breaks :

% printf "foo  \n\- bar" | pandoc -f commonmark_x -t commonmark_x
foo  
- bar
% printf "foo  \n1\. bar" | pandoc -f commonmark_x -t commonmark_x                                                                                                                 
foo  
1. bar

which is annoying because it creates a list

Instead I would expect


% printf "foo  \n\- bar" | pandoc -f commonmark_x -t commonmark_x
foo  
\- bar
jgm commented 1 year ago

Escapes are not represented in the AST, so they will not round-trip.