mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
756 stars 138 forks source link

Tables: <tbody>tag should be omitted when table has no body rows #138

Closed mity closed 3 years ago

mity commented 3 years ago

(from #136, to make one report per individual issue)

Example 205: <tbody> tags must be omitted when the table has no body rows.

| abc | def |
| --- | --- |

GFM Spec:

<table>
<thead>
<tr>
<th>abc</th>
<th>def</th>
</tr>
</thead>
</table>

MD4C:

<table>
<thead>
<tr>
<th>abc</th>
<th>def</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
mity commented 3 years ago

I agree the HTML output should not exhibit the empty <tbody>...</tbody> tag.

That said, I'm wondering whether the parser should not call the application-provided callback with MD_BLOCK_TBODY at all in such case, or whether it should continue to do so and only the HTML renderer code should handle the empty case specially to suppress the <tbody></tbody> output.

I'm mostly concerned about custom (non-HTML) renderers and what's better in general. I tend to think that some renderers might assume that for every table we call those, and could break if we do not.

Anyone has an opinion about what's better?

(In retrospect, I also think that naming all the MD_BLOCK_xxx and MD_SPAN_xxx blocks as corresponding HTML blocks was a mistake as it may lead people to wrong assumption about how close the Markdown and the HTML counterpart. The parser API should be output-format agnostic. But we cannot fix that anymore due the source compatibility reasons.)

dominickpastore commented 3 years ago

I see you already made this decision, but I wanted to offer some thought on the last point:

(In retrospect, I also think that naming all the MD_BLOCK_xxx and MD_SPAN_xxx blocks as corresponding HTML blocks was a mistake as it may lead people to wrong assumption about how close the Markdown and the HTML counterpart. The parser API should be output-format agnostic. But we cannot fix that anymore due the source compatibility reasons.)

I don't think that's such a bad thing. The blocks do follow HTML semantics. By that I mean table blocks use an HTML-like structure, where header rows are wrapped by a special header block and body rows are wrapped by a body block. I could imagine an alternate implementation where, instead of having header and body containers, there's a separator node that delimits the end of the header rows. (This would be closer to what e.g. LaTeX does, where header rows aren't special, but you normally put a double or bold line between them and the body.)

I suppose there could be debate over what semantics would be the most renderer-agnostic, but I think the HTML semantics do a pretty good job there (and the Markdown spec itself is clearly biased toward HTML semantics, maybe not w.r.t tables, but in other ways). These are semantics that most people should already be familiar with, and using the HTML names serves as self-documentation that yes, table headings wrapped in a heading block is the structure used, not something else.