micromark / micromark-extension-gfm-autolink-literal

micromark extension to support GFM autolink literals
https://unifiedjs.com
MIT License
7 stars 3 forks source link

Regression: links inside html are now link nodes #5

Closed tripodsan closed 3 years ago

tripodsan commented 3 years ago

don't know if this is really a bug, but at least a regression.

with 0.5.4, a link inside html was parsed as a text node:

markdown:

| links |
|-------|
|<p><a href="https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be">https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be</a></p>|

mdast:
root[1] (1:1-5:1, 0-176)
└─0 table[2] (2:1-4:155, 1-175)
    │ align: [null]
    ├─0 tableRow[1] (2:1-2:10, 1-10)
    │   └─0 tableCell[1] (2:1-2:10, 1-10)
    │       └─0 text "links" (2:3-2:8, 3-8)
    └─1 tableRow[1] (4:1-4:155, 21-175)
        └─0 tableCell[5] (4:1-4:155, 21-175)
            ├─0 html "<p>" (4:2-4:5, 22-25)
            ├─1 html "<a href=\"https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be\">" (4:5-4:81, 25-101)
            ├─2 text "https://www.youtube.com/watch?v=kd2i50J9MZI&feature=youtu.be" (4:81-4:146, 101-166)
            ├─3 html "</a>" (4:146-4:150, 166-170)
            └─4 html "</p>" (4:150-4:154, 170-174)

but with version 0.5.5, it is parsed into a link node:

markdown:

| links |
|-------|
|<p><a href="https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be">https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be</a></p>|

mdast:
root[1] (1:1-5:1, 0-176)
└─0 table[2] (2:1-4:155, 1-175)
    │ align: [null]
    ├─0 tableRow[1] (2:1-2:10, 1-10)
    │   └─0 tableCell[1] (2:1-2:10, 1-10)
    │       └─0 text "links" (2:3-2:8, 3-8)
    └─1 tableRow[1] (4:1-4:155, 21-175)
        └─0 tableCell[5] (4:1-4:155, 21-175)
            ├─0 html "<p>" (4:2-4:5, 22-25)
            ├─1 html "<a href=\"https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be\">" (4:5-4:81, 25-101)
            ├─2 link[1] (4:81-4:146, 101-166)
            │   │ title: null
            │   │ url: "https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be"
            │   └─0 text "https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be" (4:81-4:146, 101-166)
            ├─3 html "</a>" (4:146-4:150, 166-170)
            └─4 html "</p>" (4:150-4:154, 170-174)

As a consumer of the mdast, it is very difficult to remove the link inside the <a>.... I don't know what the correct behaviour should be...

tripodsan commented 3 years ago

I just realized that the same happens in 0.5.4 when the inner text is sourrounded with spaces:

const doc = `
| links |
|-------|
|<p><a href="https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be"> https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be </a></p>|
`;
wooorm commented 3 years ago

I believe actual and expected in your issue are the same thing?

tripodsan commented 3 years ago

I believe actual and expected in your issue are the same thing?

oh, I think I copy pasted the wrong snipped.... one sec :-)

wooorm commented 3 years ago

Btw, GH doesn’t render it nicely either:

Take this, to disambiguate which link “wins”:

| links |
|-------|
|<p><a href="#">https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be</a></p>|

Gives:

<table role="table">
    <thead>
        <tr>
            <th>links</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
                <p>
                    <a href="#"></a>
                    <a rel="nofollow" href="https://www.youtube.com/watch?v=kd2i50J9MZI&amp;#x26;feature=youtu.be">https://www.youtube.com/watch?v=kd2i50J9MZI&amp;#x26;feature=youtu.be</a>
                </p>
            </td>
        </tr>
    </tbody>
</table>

Note that the autolink literal is still used.

Rendered here:

links

https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be

wooorm commented 3 years ago

The same is true for gists btw

wooorm commented 3 years ago

So, what would the expected output be for you? I know that this is a change, but gh also has the two links, and exactly matching how gh sanitizes their html isn't really something for this project either 🤔

tripodsan commented 3 years ago

In my example, the link href and the link text are the same. so this can be considered as an already autolinked HTML result. in this case, I think the gfm should not autolink it again.

wooorm commented 3 years ago

so this can be considered as an already autolinked HTML result. in this case, I think the gfm should not autolink it again.

But this is not how GFM handles it: the autolink literal is linked. They do have two <a>s. So I don’t see why this extension should do that?

For the actual GFM output: given that they sanitize the output, which is done by parsing the output of markdown as HTML and then cleaning it, and given that HTML parsing closes an open anchor element when another opening anchor tag is opened:

d = document.createElement('div')
d.innerHTML = '<a><a></a></a>'
console.log(d.innerHTML)

Yields:

<a></a><a></a>

I think the current behavior is correct: if micromark/remark outputs '<a><a></a></a>' and GH outputs <a></a><a></a>, both are seen as the same thing in browsers.

tripodsan commented 3 years ago

but when add the HTML from above:

<p><a href="https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be">https://www.youtube.com/watch?v=kd2i50J9MZI&#x26;feature=youtu.be</a></p>

https://www.youtube.com/watch?v=kd2i50J9MZI&feature=youtu.be

you can see that github doesn't add an additional <a>:

image

So ideally, the gfm-autolink extension detects that this is a link inside an <a> with the same href, and doesn't generate a link node.

wooorm commented 3 years ago

That’s a different case, and micromark/remark (most probably) handle that correctly, see: https://gist.github.com/wooorm/41e4ba5779fc89b78ece9d6a67faaa41.

Because p (or div) is a known “block” tag (ref: https://spec.commonmark.org/0.29/#html-blocks), the rest of the line is seen as raw HTML: emphasis and such won‘t work there either. Whereas in a table cell, or by using a non-block tag such as a (or span), the parser works differently, and “inline” things such as autolink literals or emphasis do work.

tripodsan commented 3 years ago

hmm... you're correct.

wooorm commented 3 years ago

Btw, there’s definitely a bug with autolink literals, which is very weird, and I’m working on here: https://github.com/micromark/micromark-extension-gfm-autolink-literal/pull/6#issuecomment-770957354.

But for this specific case, mircomark/mdast/remark are as close we can get to how GH parses, I think

wooorm commented 3 years ago

Closing this. Meanwhile, there are new releases for this project and the corresponding mdast utility to match GH!