jgm / commonmark-hs

Pure Haskell commonmark parsing library, designed to be flexible and extensible
135 stars 31 forks source link

gfm parsing oddity with links and raw HTML #147

Closed jgm closed 7 months ago

jgm commented 7 months ago

Note: this may only affect platforms with CR+LF line endings.

Discussed in https://github.com/jgm/pandoc/discussions/9406

Originally posted by **TripleCamera** February 3, 2024 Hi. I am using pandoc to convert markdown to html. For the following lines: ```markdown aaabbb aaabbb [link](https://baidu.com)aaabbb [link](https://baidu.com)aaabbb ``` When the source language is [`commonmark`](https://pandoc.org/try/?params=%7B%22text%22%3A%22aaabbb%5Cn%5Cnaaa%3Cspan%3E%3C%2Fspan%3Ebbb%5Cn%5Cn%5Blink%5D%28https%3A%2F%2Fbaidu.com%29aaabbb%5Cn%5Cn%5Blink%5D%28https%3A%2F%2Fbaidu.com%29aaa%3Cspan%3E%3C%2Fspan%3Ebbb%5Cn%22%2C%22to%22%3A%22html%22%2C%22from%22%3A%22commonmark%22%2C%22standalone%22%3Afalse%2C%22embed-resources%22%3Afalse%2C%22table-of-contents%22%3Afalse%2C%22number-sections%22%3Afalse%2C%22citeproc%22%3Afalse%2C%22html-math-method%22%3A%22plain%22%2C%22wrap%22%3A%22auto%22%2C%22highlight-style%22%3Anull%2C%22files%22%3A%7B%7D%2C%22template%22%3Anull%7D), the raw HTML tags are preserved when following a link: ```html

aaabbb

aaabbb

linkaaabbb

linkaaabbb

``` However, when the source language is [`gfm`](https://pandoc.org/try/?params=%7B%22text%22%3A%22aaabbb%5Cn%5Cnaaa%3Cspan%3E%3C%2Fspan%3Ebbb%5Cn%5Cn%5Blink%5D%28https%3A%2F%2Fbaidu.com%29aaabbb%5Cn%5Cn%5Blink%5D%28https%3A%2F%2Fbaidu.com%29aaa%3Cspan%3E%3C%2Fspan%3Ebbb%5Cn%22%2C%22to%22%3A%22html%22%2C%22from%22%3A%22gfm%22%2C%22standalone%22%3Afalse%2C%22embed-resources%22%3Afalse%2C%22table-of-contents%22%3Afalse%2C%22number-sections%22%3Afalse%2C%22citeproc%22%3Afalse%2C%22html-math-method%22%3A%22plain%22%2C%22wrap%22%3A%22auto%22%2C%22highlight-style%22%3Anull%2C%22files%22%3A%7B%7D%2C%22template%22%3Anull%7D), they are escaped: ```html

aaabbb

aaabbb

linkaaabbb

linkaaa<span></span>bbb

``` I have read the specs and couldn't find any difference for links & raw HTML. Is this a bug in Pandoc?
jgm commented 7 months ago

Minimal repro in try pandoc

jgm commented 7 months ago

As noted in the linked discussion, this only affects parsing with CR+LF line endings.

The issue may be related to https://github.com/jgm/commonmark-hs/issues/136

jgm commented 7 months ago

Observations:

  1. This bug appears with -f gfm but NOT -f commonmark. So it has to do with an extension. Need to isolate which extension with further testing.

  2. The bug is in commonmark-hs, not pandoc itself.

    % echo -e "[link](https://baidu.com)aaa<span></span>bbb\n" | commonmark -xgfm
    <p><a href="https://baidu.com">link</a>aaa&lt;span&gt;&lt;/span&gt;bbb</p>
  3. I can reproduce it even with LF line endings using commonmark-cli, so I'm not sure why things seem different with pandoc.

I will transfer this to commonmark-hs.

jgm commented 7 months ago

Using -xautolinks instead of -xgfm produces the issue. So it can be attributed to the autolinks extension.

jgm commented 7 months ago

The code for the autolinks extension is quite bad and needs work! There is an extensive set of tests here that we might attend to. And here is a syntax: https://unifiedjs.com/explore/package/micromark-extension-gfm-autolink-literal/#syntax

Some work in issue147 branch.

TripleCamera commented 7 months ago

Thank you. :smiling_face_with_three_hearts: