miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
811 stars 113 forks source link

Enable tables which interrupt a paragraph (like GFM does) #166

Closed minaminao closed 1 year ago

minaminao commented 1 year ago

Input markdown:

A:
|a|b|
|-|-|
|a|b|

ast_renderer outputs:

    {
      "type": "Paragraph",
      "children": [
        {
          "type": "RawText",
          "content": "A:"
        },
        {
          "type": "LineBreak",
          "soft": true,
          "content": ""
        },
        {
          "type": "RawText",
          "content": "|a|b|"
        },
        {
          "type": "LineBreak",
          "soft": true,
          "content": ""
        },
        {
          "type": "RawText",
          "content": "|-|-|"
        },
        {
          "type": "LineBreak",
          "soft": true,
          "content": ""
        },
        {
          "type": "RawText",
          "content": "|a|b|"
        }
      ]
    },

I believe Table should be output, not RawText.

Actual output in GitHub:

A: a b
a b
pbodnar commented 1 year ago

@minaminao, thanks for the report. After some analysis, I would consider this a change request which could be stated like this:

Enable tables which interrupt a paragraph (like GFM does)

Let me explain why. The following, slightly modified markdown from your example works as expected in mistletoe:

A:

|a|b|
|---|---|
|a|b|

I. e. when there is a blank line between the blocks, table is parsed correctly. Also note the 3 consecutive dashes used in the separator cells - this is covered by #131, so I won't dive into this here.

The GFM spec, being an extension of the CommonMark spec, clearly states when a given token type might "interrupt a paragraph". Unfortunately, authors of GitHub somewhat forgot to specify this for their extension token type table. Yet, it is evident from their implementation of markdown here that they do support this interruption.

Therefore we could try as well, but it won't be a trivial change, because before adding corresponding Table.start() check into the lines iteration inside Paragraph.read(), we need to extend existing Table.start() check (it is a simple return '|' in line right now) and also related code. This change will imply some performance penalty, so we need to be cautious. Also for backwards compatibility reasons, we should probably make this an optional (switchable) feature of mistletoe.

To conclude, while the requested change makes sense, I would possibly not put it on the todo list for the very next mistletoe version. But I can imagine trying to implement this together with the other tables related issues...

CC to @anderskaplan, our top contributor, who maybe can also give some thoughts on this? :)

minaminao commented 1 year ago

Thanks for the analysis :)

(I found this in a table with more than 3 consecutive -'s, but simplified the table for reporting, and then have related it to another isssue #131 unexpectedly)

anderskaplan commented 1 year ago

As mistletoe has the ambition to support custom tokens, it would be nice if it also had a mechanism to customize which tokens are allowed to interrupt a paragraph. As you point out @pbodnar , this is currently hard coded in Paragraph.read(). With such a mechanism in place, it should be straightforward to enable the desired behavior for tables.

I have some ideas about how to do this, so I'll try and make a PR out of them.

pbodnar commented 1 year ago

@anderskaplan, that looks promising - having it implemented in a more generic way, not just for Table, would be great. :)

anderskaplan commented 1 year ago

Created two PR's: one "infrastructure" fix (#186) and a draft PR for the new mechanism + table change (#187).