syntax-tree / mdast-util-gfm-autolink-literal

mdast extension to parse and serialize GFM autolink literals
https://unifiedjs.com
MIT License
12 stars 6 forks source link

hostname inside link text with formatting is wrongly autolinked #1

Closed tripodsan closed 3 years ago

tripodsan commented 3 years ago

Subject of the issue

Consider the following markdown:

[**www.richardianson.com**](https://richardianson.com/)

which renders correctly on github: www.richardianson.com

but with mdast-util-gfm-autolink-literal, it parses to:

root[1] (1:1-1:56, 0-55)
└─0 paragraph[2] (1:1-1:56, 0-55)
    ├─0 text "[**" (1:1-1:4, 0-3)
    └─1 link[1] (1:4-1:56, 3-55)
        │ title: null
        │ url: "http://www.richardianson.com**](https://richardianson.com/)"
        └─0 text "www.richardianson.com**](https://richardianson.com/)" (1:4-1:56, 3-55)

instead of:

root[1] (1:1-1:56, 0-55)
└─0 paragraph[1] (1:1-1:56, 0-55)
    └─0 link[1] (1:1-1:56, 0-55)
        │ title: null
        │ url: "https://richardianson.com/"
        └─0 strong[1] (1:2-1:27, 1-26)
            └─0 text "www.richardianson.com" (1:4-1:25, 3-24)

workaround

escape the hostname:

[**www\.richardianson\.com**](https://richardianson.com/)
tripodsan commented 3 years ago

btw, the serialization doesn't escape the hostname

const mdast = root([
  paragraph([
    link('https://richardianson.com/', undefined, [
      strong([
        text('www.richardianson.com'),
      ])
    ]),
  ]),
]);

const doc = unified()
  .use(stringify)
  .use(gfm)
  .stringify(mdast);

console.log(doc);

gives:

[**www.richardianson.com**](https://richardianson.com/)

so, it's rather hard to handle this properly in the client. the only workaround is to move all formattings out of the link children.

wooorm commented 3 years ago

What GFM (well, github.com, because it’s not documented in GFM and doesn’t work like CM) seems to be doing here is rather complex... Take this markdown:

**www.a.com**](b)

Yields:

www.a.com](b)

^ So, all the characters are valid in a URL. And the attention (strong) does not “break out” of the URL. Nor does the label end (](b)).

Whereas:

[**www.a.com**](b)

Yields:

www.a.com

^ So, apparently, the URLs are parsed after the label end matches a label start. But before attention (**) is parsed? 🤔

wooorm commented 3 years ago

For serialization, link and image might have to be removed, if I can’t think of anything else

wooorm commented 3 years ago

this should all be fixed with https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/commit/a2fac797b315885cf6e279eb7577f1db3a8562e4 btw!