jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.14k stars 3.3k forks source link

Table row and column groups from Markdown #9856

Open allefeld opened 3 weeks ago

allefeld commented 3 weeks ago

Describe your proposed improvement and the problem it solves.

It would be good to be able to create HTML's row groups (multiple tbodys) and column groups (colgroup) from Markdown. They are useful because such groups can be considered the semantics corresponding to a styling e.g. with (stronger) divider lines, which can be used to make orientation easier in larger tables.

The internal Pandoc representation of tables allows for the presence of multiple table bodies, but as far as I can tell there is no way to create such a structure from Markdown input.

An obvious approach would be to extend grid tables to support a third kind of divider character besides - and =, but ASCII does not contain any suitable horizontally oriented characters. One possibility is to modify the edge character, +. For example, a divider in which the first and last edge are * instead of + would separate different table bodies.

Another structural feature of HTML tables which does not seem to be represented by Pandoc internally are column groups. If it were to be implemented, column groups could be created from Markdown in the analogous way: The uppermost and lowermost + of a vertical divider would be * instead of +.

Example:

+----------*----------+----------+
| Header 1 | Header 2 | Header 3 |
+==========+==========+==========+
|   Row 1  |   Data   |   Data   |
+----------+----------+----------+
|   Row 2  |   Data   |   Data   |
*----------+----------+----------*
|   Row 3  |   Data   |   Data   |
+----------+----------+----------+
|   Row 4  |   Data   |   Data   |
+==========+==========+==========+
|  Footer  |   Data   |   Data   |
+----------*----------+----------+

In this table there are two column groups, containing 1 and 2 columns respectively, and two row groups, containing 2 rows each.

Describe alternatives you've considered.

The alternative would be to directly include HTML code.

kysko commented 3 weeks ago

(...) first and last edge are * instead of + would separate different table bodies.

(...) [for] column groups (...) uppermost and lowermost + of a vertical divider would be * instead of +

Interesting, but the problem could be visibility. Maybe 'o' instead of '*'?

Also, for more complex tables, you eventually have to consider the (sub)header in a pandoc Table Tbody, for which the '=' separator is a good candidate.

Also (bis), in principle two tbodies could have different Row Head Column numbers.

(...) a third kind of divider character besides - and =, but ASCII does not contain any suitable horizontally oriented characters

I've played with a tentative syntax in my own experimental md Table reader/writer during the pandemic, where for the subtable (tbody) division I use the '~' separator.

If you don't mind, I could give examples below (although I think this could be in Discussions section).

allefeld commented 3 weeks ago

o instead of * would be fine, too.

I didn't think of ~ for the horizontal, makes sense, but what about the vertical?

I wasn't aware of intermediate heads.

Please feel free to add examples, or to start a discussion referencing this.

kysko commented 3 weeks ago

Firstly, I see I have not really addressed your colgroups issue. I mentioned Row Head Column (RHC), which are kind of column grouping within a TBody, but that's not the same thing.

As you say, multiple colgroups are not represented internally, although they could be encoded somehow in the Table attributes.

There was some kind of support for grid table attribute added in pandoc 3.1.11.1, at least for ID, at the end of a table grid caption; if it could be extended to any attribute, that would be an option for expressing colgroups even if it is never explicitly implemented visually in the grid table itself.


I wasn't aware of intermediate heads

These subheads can complicate the syntax.

Let's say the usual = separator is chosen for TBody subhead. Consider then the following, using either your "bookend syntax" (if I may call it that) (I'll use o for visibility), or the ~ separator:

+-----+   or  +-----+
| A   |       | A   |
+=====+       +=====+
| B   |       | B   |
o-----o       +~~~~~+
| C   |       | C   |
+-----+       +-----+

In either syntax above, we would have two TBodies. But, do we have a first TBody with subhead A and body B, or do we have two TBodies, B and C, with A as TableHead?

Any syntax must be able to distinguish those two cases. Let's illustrate this by some possible solutions.

With the ~ separator, an ugly solution I had come up with was to introduce a double separator when there was a TableHead, which would give:

: A is a THead          or     : A is Not a THead

+-----+                        +-----+   <- TBody 1              
| A   |   <- THead             | A   |      <- subhead of TBody 1
+=====+                        +=====+                           
+~~~~~+                        | B   |      <- body of TBody 1   
| B   |   <- TBody 1           +~~~~~+                           
+~~~~~+                        | C   |   <- TBody 2              
| C   |   <- TBody 2           +-----+                           
+-----+                      

(The distinct = separator for the THead gives a distinct line on which to place the global column alignments.)

But now I see that your idea would solve it in a cleaner way, since we can consider the THead as some kind of special TBody:

: A is a THead          or     : A is Not a THead

+-----+                        +-----+   <- TBody 1
| A   |   <- THead             | A   |      <- subhead of TBody 1
o=====o                        +=====+
| B   |   <- TBody 1           | B   |      <- body of TBody 1
o-----o                        o-----o
| C   |   <- TBody 2           | C   |   <- TBody 2
+-----+                        +-----+

(However, it would "steal" the alignment locations from the cells immediately bellow (which only matters if individual cell alignment on the grid is ever officially implemented).)

I'd prefer a distinct separator for subtables, but I admit getting rid of that double separator in the case above is satisfying.

Another possible solution: maybe use your bookend indicators for the subtables, and use the ~ separator for the subheads?

: A is a THead          or     : A is Not a THead

+-----+                        +-----+   <- TBody 1
| A   |   <- THead             | A   |      <- subhead of TBody 1
+=====+                        +~~~~~+
| B   |   <- TBody 1           | B   |      <- body of TBody 1
o-----o                        o-----o
| C   |   <- TBody 2           | C   |   <- TBody 2
+-----+                        +-----+

Anyways... just throwing ideas out there...

(There was a similar problem with the TableFoot, but tarleb solved it by imposing a last separator with =.)


what about the vertical?

A possible character for vertical separator is § (for RHC's), but it's admittedly a bit ugly, and not ASCII (and if we accept non-ASCII, there is a better candidate in the box-drawing group (U+2551)).

Elsewhere, I think some have suggested double pipes (like U+2551, but as two characters).

tarleb commented 3 weeks ago

The current table syntax already leads to subtle mistakes, such e.g. #9740. Adding more syntax to grid tables is very likely going to lead to more such problems.

The tilde ~ isn't being used yet in table syntax, but is looks similar to a dash in most fonts, which would add one more potential source of problems for authors.

I tend to think that a lightweight markup format like Markdown is just not suited to expressing this level of detail, and that the preferable solution is just to resort to raw HTML.

bpj commented 3 weeks ago

What about a +html_tables extension allowing the Markdown parser to parse HTML tables, presumably with a custom markdown=parse attribute on the <table> element to allow and parse Markdown syntax inside cells and caption?

I'm thinking of writing a new list-to-table filter with potentially three list levels where the top level is - head - body - foot, probably subject to an attribute on the enclosing div.

tarleb commented 3 weeks ago

The native_divs and native_spans extension control somewhat similar functionalities, so there's a case to be made for native_tables and native_figures extensions. But maybe this functionality should be left to filters, I'm not sure.

jgm commented 2 weeks ago

I see the appeal of html_tables, but anything that combines HTML parsing with markdown parsing gets to be a giant pain.

I think some version of "list tables" might be a better approach for complex layouts.

allefeld commented 2 weeks ago

I admit that this is stretching the abilities of readable Markdown, and using tables in HTML markup is an alternative, if they are parsed and transformed for non-HTML output. "List tables" would be fine, too.

Downgrading my request: Could we get colgroup representation in the AST, read from HTML, and with support for Word and LaTeX output?


More generally speaking, when preparing tables for a paper, I had the idea that a dedicated external file format for them could be useful. In an academic workflow, figures are created separately and then included and combined with captions etc. Why don't we do the same with tables? The table-file format could e.g. simply be HTML, stripped down to what the AST supports. They would then be included with the same syntax as figures, ![caption](table-file.html).