jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.28k stars 3.36k forks source link

Table headers lost when converting from JATS to HTML #8868

Closed coryschires closed 1 year ago

coryschires commented 1 year ago

Problem

When converting from JATS to HTML, Pandoc drops table headers if the table include more than one row of table headers. More specifically, Pandoc seems to retain only the first table header.

Given the following (abridged) JATS XML:

<table-wrap>
  <table>
    <thead>
      <tr>
        <th>Numbers 1</th>
        <th>Numbers 2</th>
        <th>Colors 1</th>
        <th>Colors 2</th>
      </tr>
      <tr>
        <th>Positive</th>
        <th>Negative</th>
        <th>Light</th>
        <th>Dark</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>7</td>
        <td>-4</td>
        <td>Pink</td>
        <td>Brown</td>
      </tr>
    </tbody>
  </table>
</table-wrap>

The following command:

pandoc table_with_multiple_header_rows.xml -f jats -t html -o table_with_multiple_header_rows.html

Will produce to following HTML:

<div class="table-wrap">
  <table>
    <thead>
      <tr class="header">
        <th>Numbers 1</th>
        <th>Numbers 2</th>
        <th>Colors 1</th>
        <th>Colors 2</th>
      </tr>
    </thead>
    <tbody>
      <tr class="odd">
        <td>7</td>
        <td>-4</td>
        <td>Pink</td>
        <td>Brown</td>
      </tr>
    </tbody>
  </table>
</div>

Solution / Expected Behavior

Because HTML supports multiple rows of headers, I would expect all table headers to be retained when converting from JATS to HTML:

<div class="table-wrap">
  <table>
    <thead>
      <tr class="header">
        <th>Numbers 1</th>
        <th>Numbers 2</th>
        <th>Colors 1</th>
        <th>Colors 2</th>
      </tr>
      <tr>
        <th>Positive</th>
        <th>Negative</th>
        <th>Light</th>
        <th>Dark</th>
      </tr>
    </thead>
    <tbody>
      <tr class="odd">
        <td>7</td>
        <td>-4</td>
        <td>Pink</td>
        <td>Brown</td>
      </tr>
    </tbody>
  </table>
</div>

Steps to recreate

Pandoc Version

pandoc 3.1
Features: +server +lua
Scripting engine: Lua 5.4

Sample JATS XML files

table_with_multiple_header_rows.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article article-type="research-article" dtd-version="3.0" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
  <body>
    <table-wrap>
      <table>
        <thead>
          <tr>
            <th>Numbers 1</th>
            <th>Numbers 2</th>
            <th>Colors 1</th>
            <th>Colors 2</th>
          </tr>
          <tr>
            <th>Positive</th>
            <th>Negative</th>
            <th>Light</th>
            <th>Dark</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>7</td>
            <td>-4</td>
            <td>Pink</td>
            <td>Brown</td>
          </tr>
        </tbody>
      </table>
    </table-wrap>
  </body>
</article>

Commands to reproduce

pandoc table_with_multiple_header_rows.xml -f jats -t html -o table_with_multiple_header_rows.html

Thanks for your help! I know @noahmalmed has recently contributed some table fixes (with help from @tarleb). So, if it's not too complicated, I bet we could file a pull request to fix this issue as well.

Any tips to point us in the right direction would be greatly appreciated!

noahmalmed commented 1 year ago

Ah, I see the problem.

https://github.com/jgm/pandoc/blob/4a6950200ff148b44b38c20f015100bb4d7e5033/src/Text/Pandoc/Readers/JATS.hs#L274-L275

Here, the code only parses one row because it uses filterChild, it should do something similar to what we do when parsing the body https://github.com/jgm/pandoc/blob/4a6950200ff148b44b38c20f015100bb4d7e5033/src/Text/Pandoc/Readers/JATS.hs#L278 That is, we should use, filterChildren in order to grab all of the rows that may exist in the thead

noahmalmed commented 1 year ago

According to the jats spec. It is valid for the footer to also have multiple rows. Because the footer code is pretty similar, we should probably also account for multiple rows there too

noahmalmed commented 1 year ago

These changes ended up being pretty small and I added them to a PR I currently have open: https://github.com/jgm/pandoc/pull/8795/files

jgm commented 1 year ago

looks like this was merged, so this issue can be closed, right?

noahmalmed commented 1 year ago

@jgm Yep 👍