jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.02k stars 3.35k forks source link

Handling of nested tables from HTML to LaTeX #3586

Open GiantCrocodile opened 7 years ago

GiantCrocodile commented 7 years ago

When I use

pandoc -f html -t latex --output test.pdf "<absolute path>\Hausschwein - Wikipedia.htm"

to parse this dowloaded Wikipedia article https://de.wikipedia.org/wiki/Hausschwein I get this error:

! Argument of \LT@nofcols has an extra }.
<inserted text>
                \par
l.147 \begin{longtable}[]{@{}ll@{}}

pandoc: Error producing PDF
pandoc 1.19.2.1
Compiled with pandoc-types 1.17.0.4, texmath 0.9, skylighting 0.1.1.4

I read that I can use pandoc for tasks like this so I would expect it to work - especially without an error. If this use case is wrong I'm sorry as I'm new to pandoc.

GiantCrocodile commented 7 years ago

From html to odt is not working completely:

[pandoc warning] Could not determine image size in `data:image/png,%89PNG%0D%0A%
1A%0A%00%00%00%0DIHDR%00%00%00%01%00%00%00%01%08%02%00%00%00%90wS%DE%00%00%00%01
sRGB%00%AE%CE%1C%E9%00%00%00%09pHYs%00%00%0B%13%00%00%0B%13%01%00%9A%9C%18%00%00
%00%07tIME%07%DB%0B%0A%17%041%80%9B%E7%F2%00%00%00%19tEXtComment%00Created%20wit
h%20GIMPW%81%0E%17%00%00%00%0CIDAT%08%D7c%60%60%60%00%00%00%04%00%01%274%27%0A%0
0%00%00%00IEND%AEB%60%82': could not determine image type

and thus odt to pdf fails too because of a libpng error.

mb21 commented 7 years ago

Can you provide the smallest amount of HTML in that file that triggers the error?

The [pandoc warning] Could not determine image size is only a warning...

GiantCrocodile commented 7 years ago

@mb21 How do I know which line is triggering the error? To me it looks like I don't get any information about which line it fails from input. It could be related to this: https://de.wikipedia.org/wiki/Hausschwein#Anzahl_der_gehaltenen_Schweine because it says something about a table.

mb21 commented 7 years ago

Unfortunately, the only way is trial and error (see first whether it's in the first or second half, then in which quarter, etc.) (or output to .tex and inspect that file)... you can also see whether it's already fixed in the latest nightly builds..

GiantCrocodile commented 7 years ago

It is related to this html it seems:

<removed afterwards>

After I removed this part I get this error:

!pdfTeX error: pdflatex (file ./tex2pdf.6204/39e9cda2c77eb8e20f79f2ef82d503f466
2ee611.png): libpng: internal error
 ==> Fatal error occurred, no output PDF file produced!
libpng error: Not a PNG file

pandoc: Error producing PDF
mb21 commented 7 years ago

The problem seems to be the nested tables, minimal example:

<table>
<tr>
<td>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
</table>
</td>
</tr>
</table>
jgm commented 7 years ago

Here is the latex that pandoc produces for the above minimal example:

\begin{longtable}[]{@{}l@{}}
\toprule
\begin{minipage}[t]{0.97\columnwidth}\raggedright
\begin{longtable}[]{@{}ll@{}}
\toprule
foo & bar\tabularnewline
\bottomrule
\end{longtable}\strut
\end{minipage}\tabularnewline
\bottomrule
\end{longtable}

This produces the error on the nested \begin{longtable}.

jgm commented 7 years ago

It says here that longtable can't be nested. We could try to detect nested tables and use tabular for those (though this may also require other changes).

anatolyborodin commented 6 years ago

@jgm I have an ODT file with a nested table, and it's ignored even in pandoc -t json. Is it bug, or just the way it works now? Or is my version just too old?

pandoc 1.19.2.4

jgm commented 6 years ago

This could be a limitation of the ODT reader, but I'm not very familiar with that.

dbitouze commented 2 years ago

Same trouble from .docx to .tex with the attached file test.docx. Is there a way to fix this issue?