transpect / docx2tex

Converts Microsoft Word docx to LaTeX
BSD 2-Clause "Simplified" License
541 stars 49 forks source link

space lost #24

Closed gamboz closed 6 years ago

gamboz commented 6 years ago

In the following docx file, the space between "40" and "MHz" is lost: https://medialab.sissa.it/owncloud/index.php/s/zkxFGDvNAehVatl I'm not sure if it is an error, but I'm reporting it because the appearance of the tex/pdf and docx file differ.

mkraetke commented 6 years ago

Thanks for the report, I'm investigating your issue.

gimsieke commented 6 years ago

Seems to be an issue of omml2mml

mkraetke commented 6 years ago

The issue is that our docx2hub module converts

<m:t xml:space="preserve">40 </m:t>

to

<mml:mn>40</mml:mn>
<mml:mi> </mml:mi>

I think the whitespace should be coded either as \ or \text{ }

gimsieke commented 6 years ago

Maybe we should convert an mi that only contains (significant) whitespace to mtext or mspace. Then the TeX code will probably be ok. There’s an mml-space-handling option in docx2hub.xpl. We currently pass it only to the MathType converter. This option should eventually be passed to omml2mml.xsl, too (and acted upon accordingly). But turning it into an mtext for now is probably the quickest solution.

mkraetke commented 6 years ago

I resolved one issue, omml2mml.xsl converts the m:t with whitespace now to

<mml:mn>40</mml:mn>
<mml:mtext xml:space="preserve"> </mml:mtext>
<mml:mtext>MHz</mml:mtext>

Unfortunately, there seems to be a bit of MathML normalization in our pipeline, which drops the mtext. I'll investigate this further.

gimsieke commented 6 years ago

I just noticed that mml-space-handling is already being honored! If you just invoke docx2hub.xpl, the default setting of mspace will kick in and the resulting expression will look like:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline">
   <mml:mn>40</mml:mn>
   <mml:mspace width="0.25em"/>
   <mml:mtext>MHz</mml:mtext>
</mml:math>

I haven’t tested it with the full docx2tex pipeline though.

mkraetke commented 6 years ago

This option was set to xml-space, so <mml:mtext xml:space="preserve"> </mml:mtext> should be the appropriate output with regard to this value. Unfortunately, after I've fixed the MathML normalization, there were lots of \text{} environments in our test data. This is the case when authors work with text style in the equation editor when it is not necessary. To write 40 Mhz you do not need an equation editor at all.

However, I've changed mml-space-handling from xml-space to mspace for docx2tex which results in less unintended text{} environments, where authors wrote their equations sloppy. Finally, the equation now reads as follows:

detector at $40\:\mathrm{MHz}$, i.e.,
gamboz commented 6 years ago

Hi, thank you for the fast solution.

I'm not sure if it is related to this issue, but since the last commit, my clone of docx2tex fails. I've also tried with a new pristine checkout. The errors are related to the conf.csv not validating and to a "Undeclared variable in XPath expression: $image-output-dir". The first error disappears if I specify the conf.xml file with the "-c" options of d2t

Please find the docx file and the d2t log here: https://medialab.sissa.it/owncloud/index.php/s/6I6rKxHflXeu3co

mkraetke commented 6 years ago

The bug is fixed, I've added an option recently to pass a custom image directory.