syntax-tree / mdast-util-to-markdown

mdast utility to serialize markdown
http://unifiedjs.com
MIT License
95 stars 20 forks source link

roundtripping emphasis in emphasis edge case changes document structure #12

Open ChristianMurphy opened 3 years ago

ChristianMurphy commented 3 years ago

Subject of the issue

***emphasis*in emphasis*

is stringified as:

\***emphasis*in emphasis*

which changes the structure

Your environment

Steps to reproduce

parse

***emphasis*in emphasis*

which has the structure

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text"
                },
                {
                    "type": "emphasis",
                    "children": [
                        {
                            "type": "emphasis",
                            "children": [
                                {
                                    "type": "text"
                                }
                            ]
                        },
                        {
                            "type": "text"
                        }
                    ]
                }
            ]
        }
    ]
}

and stringify it:

\***emphasis*in emphasis*

the resulting markdown has a different structure than the original

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text"
                },
                {
                    "type": "emphasis",
                    "children": [
                        {
                            "type": "text"
                        }
                    ]
                }
            ]
        }
    ]
}

:notebook: comparing how the two pieces of markdown text are being parsed with https://spec.commonmark.org/dingus it appears in both cases it is parsed as expected.

Expected behavior

structure is the same

Actual behavior

structure is different

wooorm commented 3 years ago

Here are some more examples:

a ***b*c d*

a \***b*c d*

a ***b* d*

a \***b* d*

Yields (CM dingus):

<p>a *<em><em>b</em>c d</em></p>
<p>a ***b<em>c d</em></p>
<p>a *<em><em>b</em> d</em></p>
<p>a *<em><em>b</em> d</em></p>

so whether that escape “works” relates also to what comes after the “run”.

I don’t really see an (easy) fix 🤔

ChristianMurphy commented 3 years ago

A related edge case which can happen on the tail of emphasis next to emphasis

*a*_b__
github-actions[bot] commented 3 years ago

Hi! This was marked as ready to be worked on! Note that while this is ready to be worked on, nothing is said about priority: it may take a while for this to be solved.

Is this something you can and want to work on?

Team: please use the area/* (to describe the scope of the change), platform/* (if this is related to a specific one), and semver/* and type/* labels to annotate this. If this is first-timers friendly, add good first issue and if this could use help, add help wanted.

wooorm commented 2 years ago

I came up with a way to solve this, I think: https://github.com/syntax-tree/unist/discussions/60#discussioncomment-2111096.

wooorm commented 1 week ago

FYI, syntax-tree/unist#60 is solved. But, it is different than this. Most of the linked issues above are solved because of syntax-tree/unist#60. But not this issue itself.


I have been thinking more and more about this, and it gets incredibly complex.

*x*y z*

**x*y z*

***x*y z*

****x*y z*

*****x*y z*

Yields:

xy z*

*xy z*

**xy z*

***xy z*

****xy z*

(dingus: https://spec.commonmark.org/dingus/?text=*x*y%20z*%0A%0A**x*y%20z*%0A%0A***x*y%20z*%0A%0A****x*y%20z*%0A%0A*****x*y%20z*%0A).

Please see the points near here: https://spec.commonmark.org/0.31.2/#can-open-emphasis.

There’s something at play here where, due to the different latter delimiter runs, and whether the opening run is divisible by 3, affects each other, which is changed by escaping a marker.

I have no clue how to deal with such complex escaping needs, where pulling a thread somewhere, will have something happen somewhere entirely different

wooorm commented 1 week ago

Perhaps....... I will look into it later. My brain is broken. But maybe we can do this by switching markers when we have to escape a surrounding:

***x*y z*

\*_*x*y z_

***x*

\*\*_x_

*x**

_x_\*

_x__

*x*\_

These each generate the same HTML