jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.38k stars 3.37k forks source link

HTML tables not parsed correctly with thead/td #5077

Closed Crell closed 5 years ago

Crell commented 5 years ago

As discovered in #5014, it appears that parsing of HTML tables is broken when the table header is in a non-common format.

Specifically, this input:

<table>
    <thead>
        <tr><td>Name</td></tr>
    </thead>
    <tbody>
    <tr>
        <td>Accounts</td>
    </tr>
    </tbody>
</table>

Note the <thead> block contains <td> cells rather than <th> cells. While this is uncommon it is valid HTML. However, the parser appears to not be reading it as a valid table in this circumstance. When I try to convert it to markdown (any variant) using the online converter (http://pandoc.org/try/), I get the raw text of the table without any formatting, just line breaks.

Name

Accounts

If I change the <td> tags to <th>, or remove the <thead>, it parses properly either way and the output is the expected markdown table. (Although of course in the latter case the resulting table has no header, as expected.)

I've confirmed this as far back as 1.19 (what Ubuntu 18.04 ships) and in the online converter, which I presume is the latest stable.

jgm commented 5 years ago

This was already fixed in commit 1cfdd3662f667f8119e441b60ba8d718b75f90ca. Note that Ubuntu's stable is often quite far behind the latest pandoc release. You might consider using the deb packages we provide.

Crell commented 5 years ago

You're quick! :smile: Thanks.