transpect / docx2tex

Converts Microsoft Word docx to LaTeX
BSD 2-Clause "Simplified" License
531 stars 48 forks source link

xslt-util/calstable/xpl and com.xmlcalabash conversion errors #41

Open sentientmachine opened 5 years ago

sentientmachine commented 5 years ago

Bug Report: My OS: Linux Gentoo Base System release 2.24.1.12 64 bit PC desktop Java: 1.8.0_66 Shell: bash 4.3.42 (x86_64-pc-linux-gnu) Install: cd /home/el/bin; git clone https://github.com/transpect/docx2tex --recursive The input docx has a few unicode shenanigans, but nothing too out of band: http://www.filedropper.com/examplefail Run you code: cd /home/el/bin/docx2tex; ./d2t ExampleFail.docx Failure .log File: http://www.filedropper.com/examplefaild2t

What I expected: I expected some kind of output file ExampleFail.tex output containing latex code.

Quarantining the bug, proving the bug isn't on my side:

  1. Use libreoffice version 5.2.3.3 -writer to create an new empty .docx document containing the ascii text asdf.

  2. Save the above file as Untitled.docx using format Microsoft Word 2007-2013 XML (.docx) format.

  3. Openoffice -writer produces this Untitled.docx: http://www.filedropper.com/untitled_22

  4. Run the code: cd /home/el/bin/docx2tex; ./d2t Untitled.docx

  5. docx2tex works as expected, the contents of Untitled.tex render by pdflatex to a similar looking pdf:

The problem is in the table layouts.

gimsieke commented 5 years ago

This must be the infamous Open Source Entitlement hitting us finally. Thanks for reporting, we might eventually look into the issue, despite your impolite manners.

sentientmachine commented 5 years ago

Ha, sorry for being rude. But my beard length going down the hall entitles me to Level 4 open source entitlements when the wind blows from the east on Tuesdays.

Workaround 1 helps isolate the input bug:

  1. Create a new empty Libreoffice .docx document.
  2. Open the ExampleFail.tex that produced the error above, do a Select-all, Copy, and paste into a new file Untitled2.docx
  3. Run the code: cd /home/el/bin/docx2tex; ./d2t Untitled2.docx
  4. A .tex output is successfully produced.

A libreoffice select-all, copy and paste performs some kind of normalization operation on the faulty .docx nested table object without destroying the variation in the varying rows and columns.

gimsieke commented 5 years ago

OpenOffice or LibreOffice might create OOXML (docx) structures in a legal yet unexpected way. The tool should (in the sense of: “we should make it so”, not in the sense of: “it should already be Ok”) convert tables saved by recent versions of LibreOffice correctly provided they are valid OOXML, so I think we will fix this soon.

sentientmachine commented 5 years ago

I've reproduced the error closer to the source. This screenshot tells the story:

https://ibb.co/vQjwS35

The conversion of "CALS tables" to latex tables fails because for it doesn't handle variation in the number of columns or rows.

The conversion error is asserted here: https://github.com/transpect/xslt-util/blob/74bb4f7d3c15b8649a71dfc55dae085ab6dfd38e/calstable/xsl/normalize.xsl

So now I can create an SSCCE using microsoft word, linux libreoffice and docx2tex thustly:

  1. In Windows, make a new empty Microsoft Word document.
  2. Choose table -> insert table, accept default 2x2 table.
  3. In the 2x2 table, join the upper two cells together horizontally.
  4. Save it as whatever.docx
  5. Run the code in linux: cd /home/el/bin/docx2tex; ./d2t whatever.docx
  6. You get the errors as describe on first post.

Workaround 2:

docx2tex can't handle Microsoft Word tables with an inconsistent number of columns and rows. If you must use them, a cleansing operation is to copy and paste those tables using libreoffice -writer into a fresh libreoffice document with docx format. Then all is well.

This .docx is a minimum possible document to illuminate the problem, it's just an empty word document with a table containing inconsistent number of rows: http://www.filedropper.com/ssccefordocx2tex

Microsoft's Office word document has an option to join cells of a table horizontally on a row by row basis, wheras libreoffice doesn't seem to allow me to do so, however I can copy and paste such things and the distinctions aren't destroyed, the copy/paste cleanses them. So maybe you can program in an auto cleanse xsl.

gimsieke commented 5 years ago

Thanks for the repro. I don’t think it’s related to merged cells per se. It occurs when there are merged cells within nested tables. Investigating…

gimsieke commented 5 years ago

The error doesn’t occur if I revert to https://github.com/transpect/xslt-util/commit/271dd78f0aabed9e5d3b877bab2f80b3b314ebd2. So there seems to be a regression. It is caused by another fix that improved other aspects of CALS table normalization and that is not covered by any test yet, apparently. With the old version, the LaTeX code that was generated didn’t compile though. This seems to be related to the table nesting, too, but at a later stage. I will try to fix the calstable bug, and if the problem with the generated LaTeX code then persists, @mkraetke needs to look into this. We will not commit to a time frame for a fix.

gimsieke commented 5 years ago

I was able to resolve the first error (not pushed the commit yet). However, there are more fundamental reasons why both your sample files don’t compile.

The default mode of operation for docx2tex is to resolve embedded tables, that is, to add more columns and rows to the containing table so that the embedded table becomes part of the containing table. The outer table’s rows and columns will turn into merged cells. But this only works if the embedded table occupies a full cell of the containing table, with no paragraphs and/or other embedded tables in the same cell. sscce_for_docx2tex.docx cannot be processed because it violates this condition. We’ll probably add a message to the log file that explains this restriction. It’s unlikely that we will be able to fix this.

The alternative to resolving embedded tables is to keep them nested (there’s an option for this that is currently not exposed in w2t). But the LaTeX code that we currently generate for this case creates extraneous \begin{table}/\end{table} around the embedded tabularx environments. I think it should just put them in curly braces instead if they are embedded. This is a thing that I will eventually look into with @mkraetke.

The other sample file, ExampleFail.docx, failed for other reasons. One is that definition list environments don’t seem to be supported yet in cells. I think they need to be wrapped in a \parbox. There were other errors related to generated \FontAwesome and \privateuse macros. Again, @mkraetke and I might occasionally look into these things.

Since these errors don’t affect our daily production lines (that produce hundreds of thousands pages per year), we are unlikely to look at them with high priority. However, we are constantly trying to improve the tool, and your examples are certainly helpful since they are demanding in terms of table nesting requirements, despite their small size.

I tried also the workaround, pasting the document into an empty LibreOffice document. But the embedded tables came out the same way. I’m using LO 6.0.0.3, maybe LO 5 flattened the tables while LO 6 keeps them nested. So this workaround is not working for me.

Let me stress again that not colspans or rowspans are an issue. They are supported in principle. The main problem is nested tables, but also other issues have shown that are related to special characters and definition lists.

sentientmachine commented 5 years ago

Thanks for the quick turnaround, that sounds right. The above workarounds handled my cases and I can tweak the input .docx to remove the bad table. Maybe a better error message would help future users realize the limitation quicker, without having to trial and error input files. Maybe even a flag to aggressively string-join-flatten horizontal and string-join-flatten-vertically the offending table.

I'd prefer a best-attempt result .tex file even if the nested subtable was not exactly represented, because my intention was to tweak the .tex file as needed to clean it up anyway.

Looks like the escape from Microsoft Island is not so easy as it sounds. Not big surprise. :+1: :