mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
756 stars 138 forks source link

Differences between MD4C and GFM tables #136

Closed dominickpastore closed 3 years ago

dominickpastore commented 3 years ago

I notice there are a few examples for tables in the GFM spec that differ from the way MD4C handles them. I saw that test/tables.txt differs from the GFM spec quite a bit, so I'm not sure if these differences are intentional or not. I know there's some other issues with tables in the GFM spec (e.g. the ones mentioned in #108).


Example 200: Pipes in a cell's content must be escaped, even inside other spans. This one is surprising:

  1. It directly contradicts another part of the spec, "All backslashes are treated literally," with respect to code spans.
  2. This is very different from the implementation previously described at https://talk.commonmark.org/t/parsing-strategy-for-tables/2027/46.
| f\|oo  |
| ------ |
| b `\|` az |
| b **\|** im |

GFM Spec:

<table>
<thead>
<tr>
<th>f|oo</th>
</tr>
</thead>
<tbody>
<tr>
<td>b <code>|</code> az</td>
</tr>
<tr>
<td>b <strong>|</strong> im</td>
</tr>
</tbody>
</table>

MD4C:

<table>
<thead>
<tr>
<th>f|oo</th>
</tr>
</thead>
<tbody>
<tr>
<td>b <code>\|</code> az</td>
</tr>
<tr>
<td>b <strong>|</strong> im</td>
</tr>
</tbody>
</table>

Example 203: Number of header columns must match the columns in the delimiter row.

| abc | def |
| --- |
| bar |

GFM Spec:

<p>| abc | def |
| --- |
| bar |</p>

MD4C:

<table>
<thead>
<tr>
<th>abc</th>
</tr>
</thead>
<tbody>
<tr>
<td>bar</td>
</tr>
</tbody>
</table>

Example 205: <tbody> tags must be omitted when the table has no body rows.

| abc | def |
| --- | --- |

GFM Spec:

<table>
<thead>
<tr>
<th>abc</th>
<th>def</th>
</tr>
</thead>
</table>

MD4C:

<table>
<thead>
<tr>
<th>abc</th>
<th>def</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
mity commented 3 years ago

Example 200: Pipes in a cell's content must be escaped, even inside other spans.

I'm aware of this one. There is some ironic history involved.

  1. Early MD4C table implementation did not treat pipes in codespans inside a table well at all as I simply forgot to consider that case.
  2. Original GFM implementation (before the migration to forked cmark) behaved exactly as the linked https://talk.commonmark.org/t/parsing-strategy-for-tables/2027/46 describes.
  3. I rewrote MD4C's table implementation to match it, partly to improve compatibility and partly also because I agreed with all the rationale in that thread.
  4. After GFM migrated to cmark, they changed the behavior to the current one.

To be honest, I'm quite reluctant to follow unless reality (say, a wider and clear consensus of other implementation) forces me to; partly because I believe that very few documents are affected, that there is no clear consensus across implementations, and also partly because I am in a disagreement with their rationale for the change.

AFAIK their reasoning is based on commonmark precedence rule. My counter-argument is that although in HTML the table cells are block elements, in GFM they behave more like inline spans on the Markdown side and that's what matters here because we are a Markdown parser, not a HTML parser.

Example 203: Number of header columns must match the columns in the delimiter row.

Maybe they added some new examples into their specs. I don't remember this rule/example at all. Their specs additions were always heavily under-specified, so it's good if they make the things clearer.

Example 205: <tbody> tags must be omitted when the table has no body rows.

This is more about the renderer, but makes sense too.

dominickpastore commented 3 years ago

To be honest, I'm quite reluctant to follow unless reality (say, a wider and clear consensus of other implementation) forces me to; partly because I believe that very few documents are affected, that there is no clear consensus across implementations, and also partly because I am in a disagreement with their rationale for the change.

AFAIK their reasoning is based on precence commonmark rule. My counter-argument is that although in HTML the table cells are block elements, in GFM they behave more like inline spans on the Markdown side and that's what matters here because we are a Markdown parser, not a HTML parser.

Your reasoning makes a lot of sense to me, and I think it's more in line with the way most people (including me) would expect Markdown to work.

mity commented 3 years ago

Opened two new reports, so we can track/discuss/close the issue individually.

So closing this one.