Closed ulodciv closed 5 years ago
With this output format, we fall back to HTML tables (which after all are valid markdown) when the table can't be represented as a pipe table. Pipe tables don't allow block-level content inside the cells, so if the table cells are not all "Plain" we fall back to HTML.
In this case we could probably get away with treating it as plain, though.
I am seeing a similar problem. Same input (HTML tables with p tags inside), but the result isn't an HTML table but just the raw text, with no table formatting at all. I'd be fine with it giving me an HTML table, but just the raw text with line breaks isn't happy-making.
My full command is:
pandoc --wrap=none --atx-headers -t markdown_strict+ascii_identifiers-auto_identifiers+blank_before_header+blank_before_blockquote+backtick_code_blocks+multiline_tables in.html -o out.md
Am I doing something wrong there? (I also tried the other 2 markdown formats, no change with either.)
@Crell we can't help unless you give a minimal sample input to test with. Trying your command line with the original input in the first comment above gives me an HTML table. Please give your pandoc version number too.
Here's the in and out files, stripped down and anonymized to just the table. (And with .txt extensions to keep GitHub happy.)
The build is technically running in a script that's looping on a bunch of files, but I don't think that's relevant in context. (Non-table content is all behaving correctly so far.)
pandoc -t native
shows that this isn't being parsed as a Table. Why? For convenience I reproduce it here:
<table id="bkmrk-name-description-lin" style="border-collapse: collapse; width: 100%;" border="1">
<thead><tr>
<td style="width: 19.6296%;">Name</td>
<td style="width: 47.037%;">Description</td>
<td style="width: 33.3333%;">Link</td>
</tr></thead>
<tbody>
<tr>
<td style="width: 19.6296%;">Accounts</td>
<td style="width: 47.037%;">
<p>Blah blah</p>
<p> </p>
Stuff here</td>
<td style="width: 33.3333%;">
<p><a href="https://google.com">google.com</a></p>
<p><a href="https://google.com">google.com</a></p>
<p><a href="https://google.com">google.com</a></p>
</td>
</tr>
<tr>
<td style="width: 19.6296%;">Zendesk</td>
<td style="width: 47.037%;">
<p>Support ticketing system and help desk.</p>
<p> </p>
<p>Customers and Agents access Zendesk using SSO integration with Accounts.</p>
<p> </p>
<p>SSO can be bypassed using a special URL (shown).</p>
<p> </p>
<p>Everyone inside the company can access Zendesk using a shared user account using the <strong>team member</strong> role.</p>
</td>
<td style="width: 33.3333%;">
<p><a href="https://google.com">http://google.com/</a></p>
<p> </p>
<p><a href="https://google.com">https://google.com</a></p>
</td>
</tr>
<tr>
<td style="width: 19.6296%;">Project UI</td>
<td style="width: 47.037%;">
<p> </p>
</td>
<td style="width: 33.3333%;">
<p> </p>
</td>
</tr>
<tr>
<td style="width: 19.6296%;">Slack</td>
<td style="width: 47.037%;">
<p> </p>
</td>
<td style="width: 33.3333%;">
<p> </p>
</td>
</tr>
<tr>
<td style="width: 19.6296%;">Github</td>
<td style="width: 47.037%;">
<p>Repositories for public documentation and project examples.</p>
<p> </p>
<p>Moar text</p>
</td>
<td style="width: 33.3333%;">
<p>https://github.com/</p>
</td>
</tr>
</tbody>
</table>
The way to figure this out is to try to create a more minimal test case.
As a workaround you could try -f html+raw_html
, which will pass through the HTML table commands verbatim.
Here's a more minimal case:
<table>
<thead>
<tr>
<td>Name</td>
</tr>
</thead>
<tbody>
<tr>
<td>Accounts</td>
</tr>
</tbody>
</table>
Huh. OK, fiddling around with the minimal case from @jgm and my original source, I've found a pattern. If there's a <thead>
block, the whole table is skipped and rendered as raw text. -f html+raw_html
doesn't change it, using markdown
instead of markdown_strict
doesn't change it.
If I remove the <thead>
tags but leave everything else inside it in place, and use markdown
as the output format, then the table is parsed and outputted as a nice markdown pipe table; that's true whether I use pipe_tables
or multiline_tables
.
If, however, I use markdown_strict
(because markdown
leaves a ton of extra flotsam in the output that I don't want) then the table markup is passed through to the output as HTML, but with most extra classes and whitespace formatting removed.
So I see a couple of issues:
1) The presence of a <thead>
seems to break the parser. (<tbody>
is fine.)
2) markdown_strict
has no table support, and even trying to enable it with +pipe_tables
isn't working. (I would expect it to work with the extension, but this may be known behavior.)
3) In markdown
mode, tables get rendered as pipe tables regardless of what extension is selected. On the flipside, multi-line tables parse fine regardless of whether I have +multiline_tables
enabled.
For reference, I'm using pandoc 1.19.2.4 as installed by Ubuntu 18.04.
I see the problem. Currently pandoc expects <th>
rather than <td>
inside a <thead>
. This is usually what you have; your table is an exception, which is why this hasn't come up before.
Re 2) it would work if the HTML parser were parsing this as a table. That's the problem here.
Re 3) These things may be cleaned up in the latest version, 2.4. You're using quite an old version, but if you can reproduce issues with the latest version (or the online converter at pandoc.org/try, feel free to report them).
I fixed the HTML reader issue. There remains the original, Markdown writer issue: allowing pipe tables when the cells have Para instead of Plain.
I managed to reproduce the thead/td problem with the online converter (thanks, I didn't know that existed). I'll file a new bug issue for that specifically as I assume that's using the current version. Thanks!
(That also gives me a workaround for now; the source files I have are exports from a CMS, but I can hand-fix the td/th question for now so I'm not blocked.)
There remains the original, Markdown writer issue: allowing pipe tables when the cells have Para instead of Plain.
I think this was fixed by PR #5524 in the meantime, at least the problem from the OP is no longer reproducible.
Something seems to go wrong with tables with markdown_strict+pipe_tables:
Without the
<p>
tag, the problem goes away: