jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.86k stars 3.39k forks source link

Markdown reader - support new table features #6317

Open mb21 opened 4 years ago

mb21 commented 4 years ago

Add support for (at least some of) the new table features introduced in pandoc-types/pull/66.

It would be good if at least one of pandoc markdown's table syntax would support that: grid tables seem like the obvious candidate. Something like:

+---------------+---------------+--------------------+
| Fruit         | Price         | Advantages         |
+===============+===============+====================+
| rowspan                       | - built-in wrapper |
|                               | - bright color     |
+---------------+---------------+--------------------+
| subheader     | Price         | Advantages         |
+===============+===============+====================+
| Oranges       | colspan       | - cures scurvy     |
|               |               | - tasty            |
+---------------+               +--------------------+
|| Row header   |               | - cures scurvy     |
||              |               | - tasty            |
+---------------+---------------+--------------------+
| Table foot    | Price         | Advantages         |
+===============+===============+====================+

This would roughly tick off the following of the new table features:

It does have the disadvantage that if the last rows look like header rows, they are simply treated as the table foot.

mb21 commented 4 years ago

For captions and table attributes, inspired by https://github.com/jgm/pandoc/issues/3177#issuecomment-421261363, we could use the syntax of a native div wrapping nothing but a table:

::: {#tableId}

+---------------+---------------+
| Fruit         | Price         |
+===============+===============+
| Bananas       | $1.34         |
|               |               |
+---------------+---------------+

: long caption is backward-compatible
:
: but now, just like with blockquotes, it can contain blocks.
and it can wrap lazily

:::

This would be mostly backwards-compatible with pandoc-crossref I think? @lierdakil ?

Placement of the short caption is trickier though...

lierdakil commented 4 years ago

I would have to modify pandoc-crossref to work with the new AST anyway, so might as well adapt to the new syntax, whatever it ends up being.

That said, I'm not exactly a fan of overloading the native div syntax, it can lead to some surprising behaviour, and will likely break some workflows.

Perhaps we could use something like this instead?

: {#tableId}
+---------------+---------------+
| Fruit         | Price         |
+===============+===============+
| Bananas       | $1.34         |
|               |               |
+---------------+---------------+

: long caption is backward-compatible
:
: but now, just like with blockquotes, it can contain blocks.
and it can wrap lazily

The lack of empty line between : {#tableId} and the table itself should I believe avoid ambiguity wrt table captions above tables, and the syntax is similar, but less noisy.

mb21 commented 4 years ago

I would have to modify pandoc-crossref to work with the new AST anyway, so might as well adapt to the new syntax, whatever it ends up being.

but users wouldn't have to change their markdown files? or am I mistaken or is a rare case anyway?

lierdakil commented 4 years ago

Internally, pandoc-crossref represents a table-with-attributes as a table-in-a-div, and that works on the syntax level, too. However, I believe most users use the short-cut syntax of adding {#tableId} to the end of the caption. Which isn't the most elegant thing in the world, but it worked for a while, and I'm not going to remove it, at least not until the next major release (which will take a while).

As for table-in-a-div, it's debatable whether to keep it or not, but probably I'll keep it as a variant syntax for the foreseeable future, because backward-compatibility is a thing I think about sometimes.

despresc commented 4 years ago

There is also the simple table and multiline table syntax, which is independent of the syntax for the overall table attributes and caption. I posted this in my pull request before, but something like this:

        Item
--------------------------  ---------
Animal    Description           Price
--------- ----------------  ---------
Gnat      per-gram              13.65
          each                   0.01
Gnu       stuffed               92.50
Emu       stuffed               33.33
Armadillo frozen                 8.99

which should be parsed like an existing simple table, except that multiple header lines are allowed, and the alignments of columns are determined by the last header line. The parser would have to go back and fill in the cell dimensions after header parsing, but if the existing rule that cells cannot cross column boundaries were kept for the other header lines, then this would be easier. That would mean this table:

 h1     h2
----   ----
   large
-----------
1
2
3

might have a second header row with two cells larg and e, and two columns, the first right-aligned and the second left-aligned (and full of empty cells in the body). This depends on the exact rules, but it would be similar to what the existing parser does in the body.

This (and the multiline table version) would allow for multiple table head lines and row spans in the table head, in addition to whatever table caption or attribute syntax is allowed.

jgm commented 4 years ago

There are some suggestions for extensions to pipe table syntax in the commonmark forum: see especially

Extending grid table syntax as suggested above makes sense. For the caption, I think we'd want a syntax that can allow arbitrary block-level content. Making it like definition list definitions might make sense (with the 4-space indent).

:   My caption is here.

    Second paragraph of caption.

        indented code inside caption.

But I am also somewhat tempted by the "overloading fenced div" approach, which gives us a uniform way to add table attributes and also degrades nicely. (Everything after the table itself could be considered the caption.)

If there's going to be a special way to add attributes to the table, why not just

{#id .class}

on a line by itself right before the table? (NB in my commonmark-hs I've implemented an extension allowing attributes to be placed on any block level element this way.)

We need a solution for short captions. A simple thing would be to take the first sentence of the caption, but that's probably not robust enough.

lierdakil commented 4 years ago

If there's going to be a special way to add attributes to the table, why not just {#id .class}

Works for me, if it works. I was just being wary of potential ambiguities, but now that I think about it, those are probably not an issue.

But I am also somewhat tempted by the "overloading fenced div" approach

It's not a great solution, because then there's no concise way to have a table in a div. Which might be used for styling purposes or marking parts for filters. Most notably, this breaks syntactical backward compatibility -- granted, probably for a minority of edge cases, but I would argue it's a bad idea overall to tack on unintuitive contextual semantics onto an existing syntax that has (in theory) a very specific meaning, from my experience, it will just lead to surprises down the line, and not the good kind.

Everything after the table itself could be considered the caption.

This would be especially painful in some cases. FWIW, I do this for code blocks in pandoc-crossref (with some limitations), but that's because it's one of the few bad options I have, and not because it's a good idea.

jgm commented 4 years ago

because then there's no concise way to have a table in a div

One way to reduce this impact would be to require the table divs to be marked up somehow, e.g. with class table.

lierdakil commented 4 years ago

One way to reduce this impact would be to require the table divs to be marked up somehow, e.g. with class table.

Which we're generally trying to avoid due to i18n concerns IIRC. So it'd be at best a stopgap.

mb21 commented 4 years ago

Making [captions] like definition list definitions might make sense

yeah, or like blockquotes, but with the : instead of the >. Blockquotes is arguably a markdown feature more familiar to most users, and should be mostly the same except for indentation rules?

[attributes] on a line by itself right before the table? (NB in my commonmark-hs I've implemented an extension allowing attributes to be placed on any block level element this way.)

ah yes, if that's a general principle that works, that's great as well.

About overloading the div syntax: I guess to make a final decision, that should be done as part of the figure syntax? #3177

For me, we could also decide to go ahead implementing the grid table I posted in the original post of this issue, and worry about attributes and long captions later. Or should we do this directly in commonmark-hs? I'm not so up to date what's the state of progress is there...?

jgm commented 4 years ago

Yes, if someone wants to work on allowing col/rowspans in grid table syntax, that's fine and it can be done without deciding about captions and identifiers. The syntax you propose looks okay to me. I agree that the issues about captions and identifiers should be thought about in connection with figures.

commonmark-hs currently has pipe tables but I haven't tried to implement grid tables there. It would be good to do this, though!

lrosenthol commented 4 years ago

just keep in mind that grid tables are really bad for multi-line cells. Pipe tables (ala ASCIIDoc) is probably a better approach.

jgm commented 4 years ago

See above for a link to some suggestions for pipe tables, which pandoc supports too. There's no reason we couldn't find a raw to do col/rowspans in both kinds of tables.

bpj commented 4 years ago

Just for the record the correct word for "row header" is stub.

rickywu commented 4 years ago

Any plan to support markdown writer for new table feature?

jgm commented 4 years ago

Yes, of course we'll need to support whatever formats we decide on in the writer too. I opened a new issue for that.

bwl21 commented 4 years ago

To be honest, tables are some of the most annoying issues in Markdown, in particular if the table gets complex

I think there are contradicting requirements:

I therefore propose to support at least one Table format which does not request that the table table shall appear as tabular in the source text and use a more appropriate table format such as:

the-solipsist commented 4 years ago

I think there are contradicting requirements:

* table shall be powerful

* table shall appear as tabular in the source text

I tend to agree. While the original impetus of Markdown might have been to have a format that is simple enough to publish as-is, Pandoc Markdown is also meant to capture sufficient complexity to be the authoring format for conversion into multiple formats.

That having been said, pipe_tables (unlike grid_tables and simple_tables) allows for "compressed" or "non-aligned" tables, and so is easy enough to write as it doesn't require a "tabular"-looking table. And unlike a format like CSV, which is also easy to write, pipe_tables has the potential to allow for cell-level alignment, multiple header-rows, colpsans/rowspans, captions, multi-line cells (to support unnumbered and numbered lists).

In particular, I like this proposal on a sufficiently-complex pipe_tables format, and think discussion around it would be beneficial: https://talk.commonmark.org/t/tables-in-pure-markdown/81/145

I also wouldn't be opposed to Pandoc Markdown natively supporting HTML5 tables syntax, since those too are simple to write and most end tags aren't required: https://talk.commonmark.org/t/tables-in-pure-markdown/81/124

I think it is also noteworthy that column spans and row spans are normally discouraged if your document is to be rendered accessibly by screen readers. So complex tables should generally be avoided whenever accessibility is a concern (as it usually should be).

jgm commented 4 years ago

Here are my some other thoughts on the issue of pipe table extensions: https://talk.commonmark.org/t/tables-in-pure-markdown/81/134

bwl21 commented 4 years ago

@jgm thanks for the pointer to some other thoughts.

I see tables being subject of a long discussion. But I also do not see any practical progress with this respect. How bad ...

So I really wish pandoc would support native html5 tables with markdown as table cell content. then we would have a solution to solution to the issue until the discussion converges.

the-solipsist commented 4 years ago

Here are my some other thoughts on the issue of pipe table extensions: https://talk.commonmark.org/t/tables-in-pure-markdown/81/134

As far as I can see, the main feature-level differences (i.e., non-syntactical difference) between @jgm's proposal and aoudad's proposal are that aoudad's proposal provides for:

Features that neither proposal has:

The various features they both have in common are:

It seems to me that syntactically they are mostly similar, with a couple of differences: multi-line cells (: vs. ! / +), and table captions ([Caption Text] (underneath) / |Caption Text| (in first cell) vs. : / Table: (above or underneath)).

I hope I didn't miss anything important differences.

Do folks think it is worth having per-cell alignment and row headers?

At any rate, would it make sense to have a feature-rich non-graphical table syntax (such as HTML5's, which seems to be both easy to type since it can do away with most end tags and has all the required features) be readily understood by Pandoc such that it is convertible into multiple formats without needing a separate filter to accomplish this?

bpj commented 4 years ago

As for a more powerful grid/pipe table syntax to me it is important that there is an easy way to mark a column as a stub (often erroneously called "row header") column or more generally to mark a cell as what in HTML terms is a TH element. I'm thinking perhaps replace the pipe(s) to the left (or to the right in an RTL document) with (a) bang(s).

|        | Head 1 | Head 2 | Head 3
|--------|--------|--------|--------
! Stub 1 |        |        |
! Stub 2 |        |        |
! Stub 3 |        |        |

Ideally the broken bar character ¦ U+00A6 could be used or even the double vertical line ‖ U+2016 to the right. Personally I see no problem with using non-ASCII — at least Latin-1 — punctuation for syntax but I can understand that there might be disagreement; I have all Latin-1 punctuation characters available on my Swedish Linux keyboard but not everyone may be so lucky.

Whichever characters are used for syntax it is important that they can be backslash escaped inside cell content.

bpj commented 4 years ago

As for more powerful syntaxes which clash with the "tables-should look like tables" principle the most common requirement is probably the ability to write a table as a list of lists. I have written a filter which converts lists of lists into tables. Note that it currently only works with pandoc < 2.10 (if anybody understands the pandoc 2.10 table model a pull request is most welcome! :-), but it shows that the filter approach to this works well.

ssfdust commented 3 years ago

Any news here?

jgm commented 3 years ago

I think that a more powerful grid table format would be a good first step. Something like https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#grid-tables (perhaps extended to support multiple headers etc.). We could use the same parser to support rst grid tables with row/colspans. I still like the idea of extending pipe table syntax, but the grid table syntax is less controversial.

rickywu commented 3 years ago

I agree use html format, then we can use tui.editor https://github.com/nhn/tui.editor to render Just an ieda

the-solipsist commented 3 years ago

Hi @jgm. Since the pipe_tables format extension isn't yet settled, and grid_tables format needs extending too, how about the HTML5 suggestion?

I also wouldn't be opposed to Pandoc Markdown natively supporting HTML5 tables syntax, since those too are simple to write and most end tags aren't required: https://talk.commonmark.org/t/tables-in-pure-markdown/81/124

So I really wish pandoc would support native html5 tables with markdown as table cell content. then we would have a solution to solution to the issue until the discussion converges.

At any rate, would it make sense to have a feature-rich non-graphical table syntax (such as HTML5's, which seems to be both easy to type since it can do away with most end tags and has all the required features) be readily understood by Pandoc such that it is convertible into multiple formats without needing a separate filter to accomplish this?

jgm commented 3 years ago

HTML5 tables: it's an interesting idea, but one must think about how this would interact with the way raw HTML currently works in pandoc's markdown.

The current expectation is that raw HTML will be passed through verbatim to HTML (and other formats that accept HTML, like markdown ande pub), and that it will be ignored by other formats. Parsing HTML tables as native Table elements would violate that expectation and could lead to problems (e.g. for people who include both an HTML and a LaTeX version of a table to cover both formats).

There's also the issue of how it would interact with markdown_in_html_blocks (enabled by default), which allows text nodes in tables to be interpreted as markdown.

Just to throw out an idea that would avoid these issues, one could introduce an explicit fencing syntax that means: parse the following chunk of HTML (or whatever other format) using the appropriate pandoc reader, and include the result into the AST.

This would differ from our current "raw attribute" syntax, which always creates a RawBlock.

Example:

+++ html
 <table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td>
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
  </tr>
</table> 
+++

Of course, this would not degrade well in implementations that didn't support the special syntax. A sneakier approach would be to use HTML comments or processing instructions:

<?read?>
<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td>
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
  </tr>
</table> 

The "read" instruction would tell pandoc to try to parse a following raw block (which could be raw latex, raw html, or raw anything using a fence and a raw attribute) and parse it from its native format. The advantage of this is that the instruction would just be ignored by implementations that don't support this feature (e.g. on GitHub), so you could at least get the HTML table out in HTML output, while with pandoc you'd have the increased power of being able to convert it to any format.

jgm commented 3 years ago

Alternatively we could have a special attribute in the HTML, e.g.

<table data-parse="1">
...
bwl21 commented 3 years ago

For me it would be important that the tables cells could be markdown (with lists and multiple paragraphs, even images)

Nested tables is IMHO less important

I like the approach with the processing instruction.

<?pandoc table="parse-markdown"?>
<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th>
    <th>Age</th>
    <th>Bio</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td>
    <td>50</td>
   <td>
    Jill was born and had a good childhood. Then she
    * went to school
    *  went to university
    * got familiar with Pandoc

 now she is a happy user of [pandoc](www.pandoc.org)
  </td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
    <td>
    Eve was born and had a good childhood. Then she
    * went to school
    *  went to university
    * got familiar with Pandoc

 now she is a happy user of [pandoc](www.pandoc.org)
   </td>
  </tr>

<table data-pandoc="parse markdown"> is also fine.

jgm commented 3 years ago

I don't think it would be easy to support markdown inside the table cells, if we did this. We'd need to call the html reader to parse the included content, and of course it doesn't know about markdown. Though I suppose one could export a version of readHtml that is parameterized with a parser for plain text nodes, and supply a parrser that interprets this as markdown...

bwl21 commented 3 years ago

I don't think it would be easy to support markdown inside the table cells, if we did this.

I understand. Neverthless my main issue is, to support tables with complex content in its cells. Therefore I proposed to indicate such a table by table="parse-markdown

So without markdown-support within table cells I need to write plain HTML - which would any be at least a solution to create complex tables in a markdown Document.

mb21 commented 3 years ago

Just my personal usage: I would want to use grid table syntax for smallish tables, and place a csv file for bigger tables (using a pandoc filter, or would be cool if built-in, see #553).

the-solipsist commented 3 years ago

Just my personal usage: I would want to use grid table syntax for smallish tables, and place a csv file for bigger tables (using a pandoc filter, or would be cool if built-in, see #553).

In my playing around with tables, I found that pipe tables and csv were roughly equivalent, with the | being the equivalent of , in a csv. So, excuse me if this is a stupid question, but why would a grid_table (more complex syntax, needs alignment so needs a specialized text editor like Emacs) be more suitable for smallish tables, rather than pipe tables?

BTW, this discussion on a PSV format at the CommonMark forum has some interesting thoughts on pipe tables as a CSV-like format.

Interesting idea, but one must think about how this would interact with the way raw HTML currently works in pandoc's markdown.

Ineed, this is more complicated than I'd foolishly anticipated. But it seems as though it may be prove to be a more easily solvable issue compared to settling on new extensions to grid/pipe table formats! :-)

mb21 commented 3 years ago

@the-solipsist Bigger tables, I keep in external files and edit with spreadsheet software, that's why they need to be csv. Smaller tables I keep in the markdown file and edit manually or with vim. Could be pipe tables or grid tables for me for that case, but as jgm mentioned, seems easier to add the new table features to grid tables.

bwl21 commented 3 years ago

Bigger tables, I keep in external files and edit with spreadsheet software,

@mg21 if you edit the table e.g. i Excel and have rich text with multiple paragraphs in a cell in combination with column/rowspans ... this is the use case where we struggle in Markdown.

jankap commented 3 years ago

Guys, I highly appreciate all your work here and I've had a look at all the relevant issues and came up here and it seems like this is the only issue left to have colspan and rowspan tables when converting from MD to HTML, is that assumption correct?

If yes, what is missing to get it integrated to pandoc?

Thanks a lot :)

tarleb commented 3 years ago

That's correct. What's missing is

If your question was meant as an offer to participate, you could add support for reST-style grid tables, as the Markdown and reStructuredText parsers share this code. See function gridTableWith in file src/Text/Pandoc/Parsing.hs. That would enable Markdown users to profit from most of the new table features.

Help is welcome.

jankap commented 3 years ago

Thank you very much for your confirmation. While I'd love to contribute, I'm afraid that Julia and Matlab skills don't help much here, and there's no time to dig into Haskell while writing a thesis I'm sorry :(

While there seem to be some good ideas in the thread you mentioned, there's no discussion for grid tables anymore, i.e. we "just" need to find somebody who is capable of writing the parser for grid tables, right?

I'm going to ask my colleagues but don't have high hopes to find a Haskell guy :(

tarleb commented 3 years ago

While there seem to be some good ideas in the thread you mentioned, there's no discussion for grid tables anymore, i.e. we "just" need to find somebody who is capable of writing the parser for grid tables, right?

Yes, that's right.

BTW, if you just need some way to add tables to your Markdown, then you could write the table as HTML (or LaTeX, if you prefer) and use a Lua filter to turn it into a full table.

function RawBlock(raw)
  if raw.format:match 'html' and raw.text:match '%<table' then
    return pandoc.read(raw.text, raw.format).blocks
  end
end

The table would be embedded like this:

```{=html}
<table>
<thead><tr><th colspan="2">foo</th></tr></thead>
<tbody><tr><td>1</td><td>2</td></tr></tbody>
</table>
jankap commented 3 years ago

BTW, if you just need some way to add tables to your Markdown, then you could write the table as HTML (or LaTeX, if you prefer) and use a Lua filter to turn it into a full table.

function RawBlock(raw)
  if raw.format:match 'html' and raw.text:match '%<table' then
    return pandoc.read(raw.text, raw.format).blocks
  end
end

The table would be embedded like this:

```{=html}
<table>
<thead><tr><th colspan="2">foo</th></tr></thead>
<tbody><tr><td>1</td><td>2</td></tr></tbody>
</table>

I wasn't aware of the {=html} syntax. This relates to the raw_attribute part of Pandoc, right? The LUA filter creates a table that can be used from HTML and PDF targets or just HTML?

Regarding Latex: Right now, I'm including a (SVG) table which actually is created by Latex :D Unfortunately, I'm using Katex which does not support table or tabular environments... Might be worth thinking about that choice, too.

Thanks!

waldyrious commented 1 year ago

Regarding the row header / stub feature (btw, what's the rationale for the "stub" name, @bpj?), instead of the syntax proposed in the opening comment by @mb21:

+-------------+-----------------+-----------------+
|             | Column header 1 | Column header 2 |
+=============+=================+=================+
|| Row header | foo             | bar             |
+-------------+-----------------+-----------------+

...I find the syntax proposed here to be more readable, intuitive, and consistent with grid tables' column header syntax:

+------------++-----------------+-----------------+
|            || Column header 1 | Column header 2 |
+============++=================+=================+
| Row header || foo             | bar             |
+------------++-----------------+-----------------+
Unicode character side note > Of course, we could use the double pipe ‖ (U+2016), as proposed by @bpj, which could improve the appearance of the table: > > ``` > +------------+-----------------+-----------------+ > | ‖ Column header 1 | Column header 2 | > +============+=================+=================+ > | Row header ‖ foo | bar | > +------------+-----------------+-----------------+ > ``` > > ...but for something that ought to be editable in text mode, I don't think the minor aesthetic improvement justifies the added editing challenges.

I also think this addresses a point raised by @the-solipsist above, about support for multiple column/row headers within the same table. It seems to me that marking row headers at the division rather than at the start of the line would help with that:

+--------+--------++----------+
|                 || Header 1 |
|                 || Header 2 |
+========+========++==========+
| Stub 1 | Stub 2 || foobar   |
+--------+--------++----------+

Note how this is also consistent with how multiple column headers are implemented in grid tables today.

bpj commented 2 months ago

Regarding the row header / stub feature (btw, what's the rationale for the "stub" name?),

@waldyrious It is the proper term for it in typography. If you wonder how the term arose I don't know, but I see no reason for inventing half-baked new terms when there already is one since at least two centuries.