Open sentientmachine opened 5 years ago
This must be the infamous Open Source Entitlement hitting us finally. Thanks for reporting, we might eventually look into the issue, despite your impolite manners.
Ha, sorry for being rude. But my beard length going down the hall entitles me to Level 4 open source entitlements when the wind blows from the east on Tuesdays.
Workaround 1 helps isolate the input bug:
.docx
document.ExampleFail.tex
that produced the error above, do a Select-all, Copy, and paste into a new file Untitled2.docx
cd /home/el/bin/docx2tex; ./d2t Untitled2.docx
.tex
output is successfully produced.A libreoffice select-all, copy and paste performs some kind of normalization operation on the faulty .docx nested table object without destroying the variation in the varying rows and columns.
OpenOffice or LibreOffice might create OOXML (docx) structures in a legal yet unexpected way. The tool should (in the sense of: “we should make it so”, not in the sense of: “it should already be Ok”) convert tables saved by recent versions of LibreOffice correctly provided they are valid OOXML, so I think we will fix this soon.
I've reproduced the error closer to the source. This screenshot tells the story:
The conversion of "CALS tables" to latex tables fails because for it doesn't handle variation in the number of columns or rows.
The conversion error is asserted here: https://github.com/transpect/xslt-util/blob/74bb4f7d3c15b8649a71dfc55dae085ab6dfd38e/calstable/xsl/normalize.xsl
So now I can create an SSCCE using microsoft word, linux libreoffice and docx2tex thustly:
whatever.docx
Workaround 2:
docx2tex can't handle Microsoft Word tables with an inconsistent number of columns and rows. If you must use them, a cleansing operation is to copy and paste those tables using libreoffice -writer into a fresh libreoffice document with docx format. Then all is well.
This .docx is a minimum possible document to illuminate the problem, it's just an empty word document with a table containing inconsistent number of rows: http://www.filedropper.com/ssccefordocx2tex
Microsoft's Office word document has an option to join cells of a table horizontally on a row by row basis, wheras libreoffice doesn't seem to allow me to do so, however I can copy and paste such things and the distinctions aren't destroyed, the copy/paste cleanses them. So maybe you can program in an auto cleanse xsl.
Thanks for the repro. I don’t think it’s related to merged cells per se. It occurs when there are merged cells within nested tables. Investigating…
The error doesn’t occur if I revert to https://github.com/transpect/xslt-util/commit/271dd78f0aabed9e5d3b877bab2f80b3b314ebd2. So there seems to be a regression. It is caused by another fix that improved other aspects of CALS table normalization and that is not covered by any test yet, apparently. With the old version, the LaTeX code that was generated didn’t compile though. This seems to be related to the table nesting, too, but at a later stage. I will try to fix the calstable bug, and if the problem with the generated LaTeX code then persists, @mkraetke needs to look into this. We will not commit to a time frame for a fix.
I was able to resolve the first error (not pushed the commit yet). However, there are more fundamental reasons why both your sample files don’t compile.
The default mode of operation for docx2tex is to resolve embedded tables, that is, to add more columns and rows to the containing table so that the embedded table becomes part of the containing table. The outer table’s rows and columns will turn into merged cells. But this only works if the embedded table occupies a full cell of the containing table, with no paragraphs and/or other embedded tables in the same cell. sscce_for_docx2tex.docx
cannot be processed because it violates this condition. We’ll probably add a message to the log file that explains this restriction. It’s unlikely that we will be able to fix this.
The alternative to resolving embedded tables is to keep them nested (there’s an option for this that is currently not exposed in w2t
). But the LaTeX code that we currently generate for this case creates extraneous \begin{table}
/\end{table}
around the embedded tabularx
environments. I think it should just put them in curly braces instead if they are embedded. This is a thing that I will eventually look into with @mkraetke.
The other sample file, ExampleFail.docx
, failed for other reasons. One is that definition list environments don’t seem to be supported yet in cells. I think they need to be wrapped in a \parbox
. There were other errors related to generated \FontAwesome
and \privateuse
macros. Again, @mkraetke and I might occasionally look into these things.
Since these errors don’t affect our daily production lines (that produce hundreds of thousands pages per year), we are unlikely to look at them with high priority. However, we are constantly trying to improve the tool, and your examples are certainly helpful since they are demanding in terms of table nesting requirements, despite their small size.
I tried also the workaround, pasting the document into an empty LibreOffice document. But the embedded tables came out the same way. I’m using LO 6.0.0.3, maybe LO 5 flattened the tables while LO 6 keeps them nested. So this workaround is not working for me.
Let me stress again that not colspans or rowspans are an issue. They are supported in principle. The main problem is nested tables, but also other issues have shown that are related to special characters and definition lists.
Thanks for the quick turnaround, that sounds right. The above workarounds handled my cases and I can tweak the input .docx to remove the bad table. Maybe a better error message would help future users realize the limitation quicker, without having to trial and error input files. Maybe even a flag to aggressively string-join-flatten horizontal and string-join-flatten-vertically the offending table.
I'd prefer a best-attempt result .tex
file even if the nested subtable was not exactly represented, because my intention was to tweak the .tex
file as needed to clean it up anyway.
Looks like the escape from Microsoft Island is not so easy as it sounds. Not big surprise. :+1: :
Bug Report: My OS:
Linux Gentoo Base System release 2.24.1.12 64 bit PC desktop
Java:1.8.0_66
Shell:bash 4.3.42 (x86_64-pc-linux-gnu)
Install:cd /home/el/bin; git clone https://github.com/transpect/docx2tex --recursive
The input docx has a few unicode shenanigans, but nothing too out of band: http://www.filedropper.com/examplefail Run you code:cd /home/el/bin/docx2tex; ./d2t ExampleFail.docx
Failure .log File: http://www.filedropper.com/examplefaild2tWhat I expected: I expected some kind of output file
ExampleFail.tex
output containing latex code.Quarantining the bug, proving the bug isn't on my side:
Use libreoffice version
5.2.3.3
-writer to create an new empty .docx document containing the ascii textasdf
.Save the above file as
Untitled.docx
using formatMicrosoft Word 2007-2013 XML (.docx)
format.Openoffice -writer produces this Untitled.docx: http://www.filedropper.com/untitled_22
Run the code:
cd /home/el/bin/docx2tex; ./d2t Untitled.docx
docx2tex works as expected, the contents of
Untitled.tex
render by pdflatex to a similar looking pdf:The problem is in the table layouts.