latex3 / tagging-project

Issues related to the tagging project
https://latex3.github.io/tagging-project/
LaTeX Project Public License v1.3c
41 stars 15 forks source link

math inside math breaks mathml generation #727

Open u-fischer opened 1 month ago

u-fischer commented 1 month ago

If math is used inside math, e.g.

\DocumentMetadata{uncompress,pdfversion=2.0,pdfstandard=ua-2,testphase={phase-III,math}} 
\documentclass{book}
\usepackage{unicode-math}
\begin{document}
$ a=b \mbox{$x$} $

\ExplSyntaxOn
%$ a=b \mbox{$\luamml_flag_ignore: x$} $
\ExplSyntaxOff

\end{document}

the generated mathml contains duplicated entries with the same hash but different content

<div>
<h2>\mml 1</h2>
<p>$a=b \mbox {$x$}$</p>
<p>09AFE0B55E68D79A4BC7841C2B4B7CCB</p>

<math xmlns="http://www.w3.org/1998/Math/MathML">
 <mi>
 𝑥
 </mi>
</math>
</div>

<div>
<h2>\mml 1</h2>
<p>$a=b \mbox {$x$}$</p>
<p>09AFE0B55E68D79A4BC7841C2B4B7CCB</p>

<math xmlns="http://www.w3.org/1998/Math/MathML">
 <mi>
 𝑎
 </mi>
 <mo lspace="0.278em" rspace="0.278em">
 =
 </mo>
 <mi>
 𝑏
 </mi>
 <mtext>
 <math xmlns="http://www.w3.org/1998/Math/MathML">
 <mi>
 𝑥
 </mi>
 </math>
 </mtext>
</math>
</div>

This is a serious problem, as e.g. hyperref inserts a \smash which produces four empty mathml fragments through the mathpalette command ;-(. This can be avoided by adding a \luamml_flag_ignore: (then the inner math is a simple 𝑥 inside an mtext) but perhaps it is also possible to process the inner math but suppress the writing of the chunk. @zauguin @davidcarlisle what would be the best option to handle this?

car222222 commented 1 month ago

Sorry that I am neither.

Is this not just one of the many problem cases that occur with the use of boxes within mathmode?

There are also, of course, many other things legal in mathmode that will cause problems (maybe cross-reference/links stuff, and line-breaking within inline maths?).

davidcarlisle commented 1 month ago

We could presumably fix the counting and md5 hash if we want to generate mathml for the nested cases, but it seems the current momentum is towards just attaching AF to the outer math, which is then responsible for making a mathml-html-mathml version of the whole formula. In which case we need to lose the inner math chunks completely.

car222222 commented 1 month ago

@davidcarlisle That will work until you put something (not more math) inside the box that cannot be translated into suitable MathML.

In most cases there will be something that is not mathml, it needs to be converted to html not mathml.

Just as in tex having

\[  aaa  \mbox{$x$} ...\]

is valid but pointless as you could just use x. All sensible examples have smething other than nested math in the mbox.

u-fischer commented 1 month ago

@davidcarlisle yes I agree that we do not want the inner chunks to create fragments. But how do we want them to be represented inside the outer mathml? As

<mtext>
 <math xmlns="http://www.w3.org/1998/Math/MathML">
 <mi>
 𝑥
 </mi>
 </math>
 </mtext>

or as

<mtext>
 𝑥
 </mtext>

? (Assuming that the inner math is real, fake math should probably be handled differently)

davidcarlisle commented 1 month ago

<mtext><span><math><mi>x I think. that is mtext containing inline html with further nested math. <math> directly nested inside <mtext. isn't valid and the option of <mtext> 𝑥 </mtext> is only available in trivial cases like $x$ if there was any structure at all , $x^2$ or $\frac{a}{b}$ you would need markup.

car222222 commented 1 month ago

That second seems semantically incorrect.
It is not text but math, I assume!

Apologising again for not be the correct person to reply here.

car222222 commented 1 month ago

@davidcarlisle I had assumed that mtext was for non-math text, not for further math??

u-fischer commented 1 month ago

@davidcarlisle ok. So we need @zauguin to tell use how 1) one can add a span and 2) how to prevent luamml to write a mathml chunk while still processing the inner math.

davidcarlisle commented 1 month ago

@davidcarlisle I had assumed that mtext was for non-math text, not for further math??

in html, mtext has content model text or any html flow elements, so you can have a span and within that span you can have any inline html including further <math> so it directly models \[ .... \mbox{... $x$ where you can nest the math inside math so long as you have a text construct between them. It's same here you can not nest a <math> (or any mathml element) directy in an <mtext> but you can have an html <span> which contains a <math>

car222222 commented 1 month ago

That all makes sense.

But you cannot simply use

mtext x /mtext

for a "math x", can you?

Since this would "mean" a "text x", would it not?

-- Still no response from the other party! Why, oh why?

davidcarlisle commented 1 month ago

well of the two options originally suggested one was invalid (so not mathml) conversely <mtext> 𝑥 </mtext> is valid but relies on Unicode rather than element tagging to provide the mathematical semantics for U+1D465 MATHEMATICAL ITALIC SMALL X` so it means "text consisting of a math italic x" . Whether or not that is the same as "math consisting of x" I leave to philosphers.

davidcarlisle commented 1 month ago

@car222222 I don't think this is so different from

<mtext>مرحبا بالعالم</mtext>

ideally the text would have element markup giving language and directional information, but if you just have plain text and rely on the Unicode bidi algorithm to set the direction, it's not wrong.

Similarly you might prefer the nested math structure to be fully marked up, but if it's not then if it is valid you still have to handle it.

car222222 commented 1 month ago

Who has to handle what, and when?

car222222 commented 1 month ago

It may not be different, but that does not make it good!

Why not always more clearly distinguish "text" and math? Even "text within math"!

Maybe it does not matter in HTML or LaTeX how these things are marked-up, but in general it is not "semantically appropriate" to mix e erything up like this, or is it?

car222222 commented 1 month ago

More constructively, should therefore: mbox in mathmode always translate to mtext > span

car222222 commented 1 month ago

A better example for explaining what this is all about might be:

x + y

Can this also now be encoded without using mathmode? . . . provided the "correct, but unknown to most of us, slots" are used for the x and the y (not to mention the +, but that is easier to remember/access!).

davidcarlisle commented 1 month ago

Can x+y also now be encoded without using mathmode?

That isn't a question that is necessarily under our control The system can only generate structural tagging for the document the author provides.

\documentclass{article}

\usepackage{fontspec}
\setmainfont{STIX Two Math}

\begin{document}

Can x + y also now be encoded without using mathmode?

Can 𝑥 + 𝑦 also now be encoded without using mathmode?

\end{document}

image

Works today, arguably it's not that well marked up, but it isn't necessarily wrong, and shouldn't necessarily error if tagging was enabled just because the plain text looks as if it might be math.

But such uses are typically in non mathematical documents which just have the occasional math-like text so I wouldn't really expect it in documents showing math-nested-in-math. But expect the unexpected...

davidcarlisle commented 1 month ago

More constructively, should therefore: mbox in mathmode always translate to mtext > span

I think a qualified answer is "yes" for any \mbox that an author might have added.

So the typical ... f(x)=0 \mbox{ if $x<0$} cases.

Hidden \hbox like \boldsymbol{x} being essentially \mbox{\boldmath$x$} should stay hidden, but as ever, distinguishing which case you are in in practice isn't always easy.