michal-h21 / make4ht

Build system for tex4ht
132 stars 15 forks source link

ODT output for fractions and non-italicized words in math #55

Open gl-utah opened 2 years ago

gl-utah commented 2 years ago

The file "problem.tex"

\documentclass{article}
\begin{document}
$U$, namely
$$ N_1^* \quad\hbox{such that $x$ and $y$ is\ } \frac{1}{2+3} + 4 $$
also $\rm Cu^{++} + SO_4^{- -}$.
\end{document}

has the LaTeX result correct but make4ht's ODT translation appears in LibreOffice as incorrect The problems are: an extra space between the U and the comma, and also before the final period; misplaced subscript of N_1^*; "such that" and "is" should not be italicized; "and" misinterpreted as a logical operator; "+4" incorrectly put in the denominator; and the chemical symbols in italics when they should not be italicized. I create the ODT file by running make4ht -c config.cfg -e build.lua -f odt problem.tex with the build.lua and config.cfg files attached (with .txt extensions) (see Issue #54). I tried replacing \hbox with \textrm but it did not help; nevertheless, I am open to replacing commands from Plain TeX in my input file with their LaTeX versions if it would help. build.lua,txt config.cfg,txt

michal-h21 commented 2 years ago

Ah, this LibreOffice math rendering is really frustrating. On the one hand, there are some issues with your TeX code - I would use \mbox instead of \hbox, and \mathrm instead of \rm - it just isn't supported by TeX4ht. So fixed version could look like this:

\documentclass{article}
\begin{document}
$U$, namely
$$ {N}_{1}^{*} \quad\mbox{such that $x$ and $y$ is\ } \frac{1}{2+3} + 4 $$
also ${\mbox{Cu}}^{++} + {\mbox{SO}}_4^{- -}$.
\end{document}

And it looks fine in Firefox:

Snímek z 2021-12-06 16-24-45

I also found, that LO really wants <mtext> element in the argument of <msubspup> element. It fails with the current build file for {N}_{1}^{*} because of that. So modify the build file to use "mtext" instead of "mi" on the line 19.

With these changes, it looks like this:

Snímek z 2021-12-06 16-35-45

I hope that I will be able to fix the space after "U", but I don't understand at all, why it joins "+ 4" to the fraction. It is just non-sensical, and I don't really know how to fix that.

gl-utah commented 2 years ago

Yes, I see that I should use \mbox instead of \hbox; thank you very much for pointing that out! I could not get \mathrm to be reflected in the ODT file, but I do not need \mathrm since \mbox works.

For the fraction: the ODT file is showing { frac { 1 } { { 2 + 3 } } + 4 }, and as you say, it is very strange that this does not produce the correct output. Is this a bug in LibreOffice? It seems that an ODT syntax leading to the correct output is { { { 1 } over { 2 + 3 } } + 4 }. The document "LibreOffice Math Guide" (https://documentation.libreoffice.org/assets/Uploads/Documentation/en/MG4.x/PDF/MG44-MathGuide.pdf) says to use "over" for fractions, and does not mention "frac". "over" works like the Plain TeX command \over, I think. Would it be possible for make4ht to translate LaTeX's \frac{1}{2} into an ODT file's {{1} over {2}} ? If so, would that fix the problem?

The original fraction problem persists in Word if I use LibreOffice to create the DOCX file, but if I open the ODT file directly in Word, Word gives the correct output. It is not often that one gets correct output from incorrect input! Or I guess the ODT file is correct and LO is interpreting it incorrectly?

michal-h21 commented 2 years ago

I would recommend to not use the plain TeX constructs in LaTeX math, they are quite hard to handle correctly by TeX4ht :/

We produce only MathML code, not StarMath that LibreOffice Math Guide describes, so it is not possible to change "frac" to "over", I am afraid. In the past, I thought that I could write code to transform MathML to this LibreOffice format, but then I found that it would be quite difficult, and that it isn't supported in Word anyway. So I abandoned that idea.

I've found a mail about OpenOffice math support here: https://www.mail-archive.com/dev@sw.openoffice.org/msg00200.html I wouldn't be surprised, if not too much changed since then, because math support is usually not something that too much users and developers care about, and it is really hard to do it properly. This is the reason why only browser supports MathML is Firefox, and it's support still isn't complete. Other browsers can support MathML only using MathJax, which has great MathML support. It is bad that it cannot support also word processors.

So my conclusion is that the fraction code produced by TeX4ht is correct, LO interprets it incorrectly, and Word interprets it correctly. The bad thing is that Word fails in other cases where LO works.

gl-utah commented 2 years ago

It's interesting to learn about this history. Since my final target is Microsoft Word, I guess I ought to ignore this LO problem. As you say in your last sentence, though, "Word fails in other cases where LO works."

Since make4ht works well at translating LaTeX to html, would LaTeX -> HTML -> Word be a better route than LaTeX -> ODT -> Word? But then I wonder about footnotes.

And Word (at least Word 2016) certainly has problems with fractions also: MicrosoftWordEquation

Overall, somewhat discouraging. I do really appreciate your help despite this!

Were you able to make any progress on the extra spaces being inserted after all in-line mathematical objects?

michal-h21 commented 2 years ago

Regarding the LaTeX -> HTML -> Word way, I think you would find other issues, like footnotes, and probably others. You should try it though. Don't forget to try the "mathml" option, as it is the only possible way how math could keep some formatting information. The downside is that ODT uses MathML as well, so it is quite likely that the issues from ODT will remain. You can also try pictures for math, which will keep the appearance, but you will lose the ability to edit formulas.

For wrong fractions rendering, it seems that there is one attribute that can require to display fractions as a/b: "bevelled": https://developer.mozilla.org/en-US/docs/Web/MathML/Element/mfrac. It should be disabled by default, but maybe Word decided to enable it? You can try the following build file, which explicitly disables it:

local domfilter = require "make4ht-domfilter"

local function just_operators(list)
  -- count <mo> and return true if list contains just them
  local mo = 0
  for _, x in ipairs(list) do
    if x:get_element_name() == "mo" then mo = mo + 1 end
  end
  return mo
end
local process = domfilter {
  function(dom)
    for _, x in ipairs(dom:query_selector("mo")) do
      local siblings = x:get_siblings()
      -- test if current element list contains only <mo>
      if just_operators(siblings) == #siblings then
        if #siblings == 1 then
          -- one <mo> translate to <mi>
          x._name = "mtext"
          x:set_attribute("mathvariant", "normal")
        else
          -- multiple <mo> translate to <mtext>
          local text = {}
          for _, el in ipairs(siblings) do
            text[#text+1] = el:get_text()
          end
          -- replace first <mo> text with concetanated text content
          -- of all <mo> elements
          x._children = {}
          local text_el = x:create_text_node(table.concat(text))
          x:add_child_node(text_el)
          -- change <mo> to <mtext>
          x._name = "mtext"
          -- remove subsequent <mo>
          for i = 2, #siblings do
            siblings[i]:remove_node()
          end
        end
      end
    end
    for _, x in ipairs(dom:query_selector("mfrac")) do
      -- try to fix wrong <mfrac> form in Word (it displays it as   x / y)
      if not x:get_attribute("bevelled") then
        x:set_attribute("bevelled", "false")
      end
    end
    return dom
  end
}

Make:match("4om$", process)
-- Make:match("html$", process)
pkra commented 2 years ago

Random bystander

The downside is that ODT uses MathML as well, so it is quite likely that the issues from ODT will remain.

As far as I've gathered, that's not quite accurate. The internal format is actually a linear (asciimath-like) syntax (much like Word has a separate XML which also comes with an equivalent linearization called "unicode math").

When given an ODT with MathML in it, Libre Office appears to first convert it to its linear format and then back to MathML, replacing the original MathML. This process has a lot of bugs, unfortunately. (You can test this by opening an ODT, saving it again, and looking inside the file.)

michal-h21 commented 2 years ago

Hi Peter, thanks for your reply. I meant that if LO or Word have issues with MathML import, these issues should be similar regardless of whether that MathML comes from HTML, or from ODT. I guess that LO will translate it to its internal format in both cases. And yes, it is quite buggy. It renders the following MathML:

<math xmlns:xlink='http://www.w3.org/1999/xlink' xmlns='http://www.w3.org/1998/Math/MathML'><mrow>
                     <mfrac><mrow><mn>1</mn></mrow> 
<mrow><mn>2</mn> <mo>+</mo> <mn>3</mn></mrow></mfrac> <mo>+</mo> <mn>4</mn>
</mrow></math>

As: Snímek z 2021-12-09 15-49-35

And Firefox renders it like this: Snímek z 2021-12-09 15-52-57

michal-h21 commented 2 years ago

Were you able to make any progress on the extra spaces being inserted after all in-line mathematical objects?

Yes, I've fixed that in TeX4ht sources.

gl-utah commented 2 years ago

Although Word 2016 has the problem with fractions which I wrote about above, I just checked Word 2019, and it handles the fraction correctly. In other words, when Word 2019 opens the ODT file (with the make4ht build file attached to my original post), it renders the fraction correctly even once the editing window has been opened and closed. LibreOffice renders it incorrectly, and LO also exports a DOCX file which is incorrect when opened in Word. So for fractions, having Word 2019 (not 2016) open the ODT file is the best way to go.

Unfortunately, Word, even Word 2019, cannot directly open the ODT file which make4ht generates from

\documentclass{article}
\usepackage[hyphens]{url}
\begin{document}
\url{google.com} $a$
\end{document}

When one tries to open the file despite the error, the math is not rendered. LibreOffice has no problem with this ODT file. I ran into other commands (my personal LaTeX macros, including an equation numbering command I mentioned in Issue #56) which, when run through make4ht, generated ODT files which neither Word 2016 nor Word 2019 can open, but which LibreOffice can open without problems. Therefore, for my use case, it seems best to use LibreOffice to open the ODT file and export a DOCX file, then fix the fractions using Word 2019.

I also tried LaTeX -> HTML -> DOCX. Using make4ht -c config.cfg -e build.lua <filename>, the math came into Word as images, not equations. They looked good, but I suppose publishers want equations they can edit. Using make4ht -c config.cfg -e build.lua <filename> "mathml,mathjax" was, once imported by Word, much worse: no images, but no equations either, just plain symbols with no subscripts or superscripts or other formatting. In conclusion, to get to DOCX, it seems that ODT is a better intermediate format than HTML at this point.

Thank you for fixing in the TeX4ht sources the extra spaces being inserted after in-line math.

michal-h21 commented 2 years ago

It is good that they managed to fix the equations issue in the newer version. Regarding the failing files, could you try to make a MWE? It is possible that TeX4ht generates wrong ODT (but it would fail in LO too, I guess).

I've tried your \url sample in Office 365, and it couldn't open it as well. I validated the ODT file in ODF validator, and it is valid. One thing that I could imagine that Word can have problem with, is that the URL is missing the https:// part. Once I changed the command \url{https://google.com}, Office 365 could open the ODT file. I expect that desktop Word may work in the similar way.

I didn't expect that Word would fail so badly with the HTML import, as it already should have some MathML support. I also don't understand why it shouldn't support images. I thought that it should support most features, except for footnotes and other stuff that has no native support in HTML.

gl-utah commented 2 years ago

Yes, I can confirm that although Office 2016 and Office 2019 cannot open an ODT file produced from a LaTeX file with \url{google.com}, they can open an ODT file coming from \url{https://google.com}. Thanks for finding this!

In the following file, there are exactly two lines between \begin{document} and \end{document} which are commented out. If either one of these is un-commented, the ODT file created can be opened by LibreOffice but not by Microsoft Word (either 2016 or 2019). LO can even open the ODT file created by un-commenting both of those lines. If both lines remain commented out, both LO and Microsoft Word can open the ODT file.

\documentclass{article}
\makeatletter
% The following macro changes left-quote marks to
% mathematical "prime" symbols but otherwise passes
% on its arguments unaltered.  See the last example
% in the TeXbook p. 219.
\def\doch@ngequotetoprime#1{{\ch@ngequotetoprime#1\end}}
\def\ch@ngequotetoprime#1{%
\ifx#1\end \let\next=\relax
    \else  \ifx#1'\({}^\prime\)\else #1\fi% or just \('\)
    \let\next=\ch@ngequotetoprime\fi
\next}
%
\def\enum#1(#2:#3){\refstepcounter{equation} \label{#1#2}
              \eqno{\mathrm{(#1\theequation#3)}}}
\def\eoldnum#1(#2:#3){\eqno{\mathrm{(#1\ref{#1#2}#3)}}}
\def\eq#1(#2:#3){(#1\ref{#1#2}\doch@ngequotetoprime{#3})}
\makeatother

\begin{document}
$$ dS = dQ_{\mathrm{rev}}/T  \,,\enum A(entropy:)$$
and
$$ x=2 \,. \eoldnum A(entropy:')$$
%Reference \eq A(entropy:).

\begin{table}\small\centering
%\begin{tabular}{r|c|c}& a & b \\ \hline\end{tabular}
\end{table}
\end{document}

As for the LaTeX -> HTML -> DOCX route, in case I was unclear, using make4ht -c config.cfg -e build.lua <filename> was rather successful; the only problem was that the math came into Word as images, not equations, so it can no longer be edited. Using make4ht -c config.cfg -e build.lua <filename> "mathml,mathjax", on the other hand, made the math come into Word as ASCII symbols like one might find in Notepad, so that was not at all a success.

michal-h21 commented 2 years ago

There are certainly issues with your custom macros that cause wrong MathML structure. I would modify it in this way:

\documentclass{article}
\makeatletter
% The following macro changes left-quote marks to
% mathematical "prime" symbols but otherwise passes
% on its arguments unaltered.  See the last example
% in the TeXbook p. 219.
\def\doch@ngequotetoprime#1{{\ch@ngequotetoprime#1\end}}
\def\ch@ngequotetoprime#1{%
\ifx#1\end \let\next=\relax
    \else  \ifx#1'\ensuremath{{}^\prime}\else #1\fi% or just \('\)
    \let\next=\ch@ngequotetoprime\fi
\next}
%
\def\enum#1(#2:#3){\refstepcounter{equation} \label{#1#2}
              \eqno{\mathrm{(#1\theequation#3)}}}
\def\eoldnum#1(#2:#3){\eqno{({\mathrm{#1\ref{#1#2}}}#3)}}
\def\eq#1(#2:#3){(#1\ref{#1#2}\doch@ngequotetoprime{#3})}
\makeatother

\begin{document}
$$ dS = dQ_{\mathrm{rev}}/T  \,,\enum A(entropy:)$$
and
$$ x=2 \,. \eoldnum A(entropy:')$$
Reference \eq A(entropy:).

\begin{table}\small\centering
% \begin{tabular}{r|c|c}& a & b \\ \hline\end{tabular}
\end{table}
\end{document}

I changed \( ... \) to \ensuremath, moved ( outside of \mathrm, as it should produce math operator, and added some more braces. It still doesn't open in Word thought.

gl-utah commented 2 years ago

Thank you very much for your message. I can confirm that even with your changes, Word still cannot open the ODT file.

The LaTeX output is LaTeX_MWE The ODT file opened in LibreOffice is LO_MWE The equation numbers should be flush right; the "A" in the equation numbers should be upright, not italic; and the prime symbol is too small, almost illegible. Else it looks good. Asking LO to export a DOCX file, then opening that in Word 2019 yields Word_MWE Again, the equation numbers are in the wrong place, and the "A" in the equation numbers should be upright. In addition, the subscript rev in equation (A1) should be set in an upright font. The prime symbol is easy to read, which is an improvement over LO.

The line Reference \eq A(entropy:). is what prevents Word from opening the ODT file. But commenting that line out and un-commenting the third-to-last line, \begin{tabular}{r|c|c}& a & b \\ \hline\end{tabular}, also leads to an ODT file which Word cannot open. So it's not only my custom macros which create an ODT file which Windows cannot open.