jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.38k stars 3.37k forks source link

HTML to MD table conversion issue with markdown_strict+pipe_tables #5014

Closed ulodciv closed 5 years ago

ulodciv commented 5 years ago

Something seems to go wrong with tables with markdown_strict+pipe_tables:

echo '<table><tr><td><p>x</p></td></tr></table>' | pandoc --from html --to 'markdown_strict+pipe_tables'
<table>
<colgroup>
<col width="100%" />
</colgroup>
<tbody>
<tr class="odd">
<td><p>x</p></td>
</tr>
</tbody>
</table>

Without the <p> tag, the problem goes away:

echo '<table><tr><td>x</td></tr></table>' | pandoc --from html --to 'markdown_strict+pipe_tables'
|     |
|-----|
| x   |
jgm commented 5 years ago

With this output format, we fall back to HTML tables (which after all are valid markdown) when the table can't be represented as a pipe table. Pipe tables don't allow block-level content inside the cells, so if the table cells are not all "Plain" we fall back to HTML.

In this case we could probably get away with treating it as plain, though.

Crell commented 5 years ago

I am seeing a similar problem. Same input (HTML tables with p tags inside), but the result isn't an HTML table but just the raw text, with no table formatting at all. I'd be fine with it giving me an HTML table, but just the raw text with line breaks isn't happy-making.

My full command is:

pandoc --wrap=none --atx-headers -t markdown_strict+ascii_identifiers-auto_identifiers+blank_before_header+blank_before_blockquote+backtick_code_blocks+multiline_tables in.html -o out.md

Am I doing something wrong there? (I also tried the other 2 markdown formats, no change with either.)

jgm commented 5 years ago

@Crell we can't help unless you give a minimal sample input to test with. Trying your command line with the original input in the first comment above gives me an HTML table. Please give your pandoc version number too.

Crell commented 5 years ago

Here's the in and out files, stripped down and anonymized to just the table. (And with .txt extensions to keep GitHub happy.)

in.txt

out.txt

The build is technically running in a script that's looping on a bunch of files, but I don't think that's relevant in context. (Non-table content is all behaving correctly so far.)

jgm commented 5 years ago

pandoc -t native shows that this isn't being parsed as a Table. Why? For convenience I reproduce it here:

<table id="bkmrk-name-description-lin" style="border-collapse: collapse; width: 100%;" border="1">
    <thead><tr>
        <td style="width: 19.6296%;">Name</td>
        <td style="width: 47.037%;">Description</td>
        <td style="width: 33.3333%;">Link</td>
    </tr></thead>
    <tbody>
    <tr>
        <td style="width: 19.6296%;">Accounts</td>
        <td style="width: 47.037%;">
            <p>Blah blah</p>
            <p> </p>
            Stuff here</td>
        <td style="width: 33.3333%;">
            <p><a href="https://google.com">google.com</a></p>
            <p><a href="https://google.com">google.com</a></p>
            <p><a href="https://google.com">google.com</a></p>
        </td>
    </tr>
    <tr>
        <td style="width: 19.6296%;">Zendesk</td>
        <td style="width: 47.037%;">
            <p>Support ticketing system and help desk.</p>
            <p> </p>
            <p>Customers and Agents access Zendesk using SSO integration with Accounts.</p>
            <p> </p>
            <p>SSO can be bypassed using a special URL (shown).</p>
            <p> </p>
            <p>Everyone inside the company can access Zendesk using a shared user account using the <strong>team member</strong> role.</p>
        </td>
        <td style="width: 33.3333%;">
            <p><a href="https://google.com">http://google.com/</a></p>
            <p> </p>
            <p><a href="https://google.com">https://google.com</a></p>
        </td>
    </tr>
    <tr>
        <td style="width: 19.6296%;">Project UI</td>
        <td style="width: 47.037%;">
            <p> </p>
        </td>
        <td style="width: 33.3333%;">
            <p> </p>
        </td>
    </tr>
    <tr>
        <td style="width: 19.6296%;">Slack</td>
        <td style="width: 47.037%;">
            <p> </p>
        </td>
        <td style="width: 33.3333%;">
            <p> </p>
        </td>
    </tr>
    <tr>
        <td style="width: 19.6296%;">Github</td>
        <td style="width: 47.037%;">
            <p>Repositories for public documentation and project examples.</p>
            <p> </p>
            <p>Moar text</p>
        </td>
        <td style="width: 33.3333%;">
            <p>https://github.com/</p>
        </td>
    </tr>
    </tbody>
</table>

The way to figure this out is to try to create a more minimal test case.

As a workaround you could try -f html+raw_html, which will pass through the HTML table commands verbatim.

jgm commented 5 years ago

Here's a more minimal case:

<table>
  <thead>
    <tr>
        <td>Name</td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td>Accounts</td>
    </tr>
    </tbody>
</table>
Crell commented 5 years ago

Huh. OK, fiddling around with the minimal case from @jgm and my original source, I've found a pattern. If there's a <thead> block, the whole table is skipped and rendered as raw text. -f html+raw_html doesn't change it, using markdown instead of markdown_strict doesn't change it.

If I remove the <thead> tags but leave everything else inside it in place, and use markdown as the output format, then the table is parsed and outputted as a nice markdown pipe table; that's true whether I use pipe_tables or multiline_tables.

If, however, I use markdown_strict (because markdown leaves a ton of extra flotsam in the output that I don't want) then the table markup is passed through to the output as HTML, but with most extra classes and whitespace formatting removed.

So I see a couple of issues:

1) The presence of a <thead> seems to break the parser. (<tbody> is fine.) 2) markdown_strict has no table support, and even trying to enable it with +pipe_tables isn't working. (I would expect it to work with the extension, but this may be known behavior.) 3) In markdown mode, tables get rendered as pipe tables regardless of what extension is selected. On the flipside, multi-line tables parse fine regardless of whether I have +multiline_tables enabled.

For reference, I'm using pandoc 1.19.2.4 as installed by Ubuntu 18.04.

jgm commented 5 years ago

I see the problem. Currently pandoc expects <th> rather than <td> inside a <thead>. This is usually what you have; your table is an exception, which is why this hasn't come up before.

jgm commented 5 years ago

Re 2) it would work if the HTML parser were parsing this as a table. That's the problem here.

Re 3) These things may be cleaned up in the latest version, 2.4. You're using quite an old version, but if you can reproduce issues with the latest version (or the online converter at pandoc.org/try, feel free to report them).

jgm commented 5 years ago

I fixed the HTML reader issue. There remains the original, Markdown writer issue: allowing pipe tables when the cells have Para instead of Plain.

Crell commented 5 years ago

I managed to reproduce the thead/td problem with the online converter (thanks, I didn't know that existed). I'll file a new bug issue for that specifically as I assume that's using the current version. Thanks!

(That also gives me a workaround for now; the source files I have are exports from a CMS, but I can hand-fix the td/th question for now so I'm not blocked.)

tbussmann commented 5 years ago

There remains the original, Markdown writer issue: allowing pipe tables when the cells have Para instead of Plain.

I think this was fixed by PR #5524 in the meantime, at least the problem from the OP is no longer reproducible.