miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
841 stars 119 forks source link

Problem rendering pipe characters in code blocks within tables #85

Closed huettenhain closed 3 years ago

huettenhain commented 5 years ago

Hello, thank you heartily for this great library. It took me quite few months to encounter a bug, which I am reporting here.

Mistletoe seems unable to render pipe characters in code blocks within tables. Here is an example of observed behavior:

mistletoe [version 0.7.2] (interactive)
Type Ctrl-D to complete input, or Ctrl-C to exit.
>>> | Table | Header |
... |---    |---     |
... | `<|>` | `<|>`  |
...
... ^Z

<table>
<thead>
<tr>
<th align="left">Table</th>
<th align="left">Header</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">`&lt;</td>
<td align="left">&gt;`</td>
<td align="left">`&lt;</td>
<td align="left">&gt;`</td>
</tr>
</tbody>
</table>
>>>

The following was my expected rendering:

Table Header
<\|> <\|>

Interestingly, this behavior can not be expected in GFM, which requires escapes for pipes: https://github.com/github/markup/issues/1078

However, escaping pipes is not working in mistletoe:

>>> | Table | Header |
... |---    |---     |
... |`<\|>` | `<\|>` |
... ^Z

<table>
<thead>
<tr>
<th align="left">Table</th>
<th align="left">Header</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">`&lt;\</td>
<td align="left">&gt;`</td>
<td align="left">`&lt;\</td>
<td align="left">&gt;`</td>
</tr>
</tbody>
</table>
pbodnar commented 3 years ago

@huettenhain, thanks for the report. It looks like the GFM's approach seems to be even somewhat in-line with the CommonMark specification (https://spec.commonmark.org/0.30/#precedence). So I would go their way, i. e. to enable to escape | whenever it is inside a cell. I cannot quite imagine any other solution. That will mean the following changes in the code:

  1. Ignore escaped pipes when splitting a table row to cells. (This is quite easy to do, it's quite common in regex patterns used in this project.)
  2. Replace all escaped pipes with just pipes before parsing each cell's content. These escapes can be anywhere, not just within a code block. (I will have to get into the parsing logic a little bit more in here.)

What do you think? It would be great to have some feedback before starting with the implementation.

huettenhain commented 3 years ago

That seems perfectly reasonable to me, and I should add that I am very happy and grateful to see this nice library brought to life again.

pbodnar commented 3 years ago

That seems perfectly reasonable to me,

OK, thank you for your feedback, I think I will look at this next week then.

I should add that I am very happy and grateful to see this nice library brought to life again.

Yes, I was given write permissions by @miyuchina who seems to like my contributions, so hopefully I can bring the library a little bit further, even though it may have lost some users in the previous years. I take it as a challenge. :)

pbodnar commented 3 years ago

Just looking at this. I have some interesting findings:

  1. According to https://www.markdownguide.org/extended-syntax, one can escape pipe by using its character reference (&#124;).
  2. Yet, according to https://github.github.com/gfm/#entity-and-numeric-character-references, this is not possible within code blocks and code spans, because character references are not recognized there.

So no change in the direction of this issue actually. I'm only re-classifying it: it is probably not a bug, but rather an enhancement.

pbodnar commented 3 years ago

Another finding: It looks like even GitHub is not handling this edge case:

|a|b|
|-|-|
|ending with slash: \\|other cell|

This results in just one cell in the row:

a b
ending with slash: \|other cell

Of course, users will hopefully never do something like that. So I wouldn't implement handling of this edge case now. It would probably require using something more complicated than a simple row-string split (which just checks for any \ not being present right in front of splitting |, because a more precise regex look-behind cannot be used).

pbodnar commented 3 years ago

@huettenhain, so I prepared the implementation in a dedicated branch. Can you please check it out, together with my notes above?

pbodnar commented 3 years ago

close: I merged the branch, enjoy and feel free to reopen in case of any problems.

huettenhain commented 3 years ago

Ah, I am so sorry, I just couldn't get to this. I think this is a good way to proceed, I will let you know if any problems arise.

pbodnar commented 3 years ago

No problem, thank you. 👍