Poor MathML output - Githubissues

mhchem / MathJax-mhchem

3rd-party extension to MathJax for typesetting chemical equations

Apache License 2.0

94 stars 13 forks source link

Poor MathML output #23

Open NSoiffer opened 2 years ago

NSoiffer commented 2 years ago

Issue Summary

The mchem packages produces very ugly MathML when there is a much better (and better for assistive technology) solution. Solving this would solve #22. Technically, this is not a bug report, but the upshot is that with the current output, recognition and proper voicing of the chemistry output is extremely difficult.

At the end of this issue is a relatively simple example that generates poor MathML output. It has (at least) four problems:

there are extra mrows,
multiple elements are in a single mi
mphantom and mpadded are needlessly used
the sup/super scripts are not attached to the chemical element they adorn

Let me go through these one by one...

The first one (extra mrows) is pretty obvious. In the output below, there is:

      <mrow data-mjx-texclass="ORD">
        <mi data-mjx-auto-op="false">SO</mi>
      </mrow>

There is no need for the mrow.

The second is also pretty obvious -- "SO" is in a single mi. That's a misuse of mi and is equivalent to putting x and y into a single mi and then making that italic (via mathvariant="italic") so that it looks ok. Given this is a chemistry package, is it hard to recognize an element and put it in its own mi? If it is a single letter, then the default italics are overridden with mathvariant as in

    <mi mathvariant="normal">S</mi>

It is valid to add mathvariant="normal" to <mi>Na</mi> if that makes generation simpler (i.e., you can always add it to any chemical element).

The third problem's solution is maybe less obvious. mphantom and mpadded are being used to force the scripts to all have the same vertical position. But MathML has a much simpler solution for this usage: mmultiscripts. From the MathML spec:

This element allows the representation of any number of vertically-aligned pairs of subscripts and superscripts, attached to one base expression. It supports both postscripts and prescripts. Missing scripts must be represented by the empty element none. All of the upper scripts should be baseline-aligned and all the lower scripts should be baseline-aligned.

Since all chemical elements have a capital letter, all scripts will be horizontally aligned.

An example of the better output, which also attaches the scripts to the element (and therefore solves the last problem) is:

<math>
    <mi>S</mi>
    <mmultiscripts>
        <mi>O</mi>
        <mn>4</mn>
        <mrow>
            <mn>2</mn>
            <mo>&#x2212;</mo>
        </mrow>
    </mmultiscripts>
</math>

I've removed the MathJaX generated attributes for clarity. Compare this to the mchem/MathJax output below.

This is also a better solution for the prescripts used in nuclear chemistry.

There is a problem for electron, proton, etc., and scripts because the base doesn't have a capital letter. The mphantom/mpadded trick could be used here (inside an mrow around the base element), but is not ideal. This leads to an alternative to mmultiscripts: use msub/msup/msubsup with the attributes subscriptshift and superscriptshift. There are two problems with this approach:

you need to figure out the amount of shift that is appropriate (use font relative units).
MathML Core does not support these attributes (MathJaX does). This means that when someday MathJaX makes use of MathML in an output jax, it will need to do something (probably ugly) to make this work.

Finally, here is the example: \ce{SO4^2−}

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mstyle mathcolor="#a33e00">
    <mrow data-mjx-texclass="ORD">
      <mrow data-mjx-texclass="ORD">
        <mi data-mjx-auto-op="false">SO</mi>
      </mrow>
      <msub>
        <mrow data-mjx-texclass="ORD">
          <mrow data-mjx-texclass="ORD">
            <mpadded width="0">
              <mphantom>
                <mi>A</mi>
              </mphantom>
            </mpadded>
          </mrow>
        </mrow>
        <mrow data-mjx-texclass="ORD">
          <mrow data-mjx-texclass="ORD">
            <mpadded height="0">
              <mn>4</mn>
            </mpadded>
          </mrow>
        </mrow>
      </msub>
      <msup>
        <mrow data-mjx-texclass="ORD">
          <mrow data-mjx-texclass="ORD">
            <mpadded width="0">
              <mphantom>
                <mi>A</mi>
              </mphantom>
            </mpadded>
          </mrow>
        </mrow>
        <mrow data-mjx-texclass="ORD">
          <mn>2</mn>
          <mo>&#x2212;</mo>
        </mrow>
      </msup>
    </mrow>
  </mstyle>
</math>

mhchem commented 2 years ago

Thanks, @NSoiffer, for your long elaboration. But I'd like to contradict in some important points.

Your "example of the better output" (JsFiddle) is nonsense. It renders as and that is wrong on many levels.

"SO" must be upright (otherwise it would be variables)
the "2–" must not be above the "4", according to the rules, e.g. IUPAC.
It looks semantically wrong, because it attaches "2–" to "O" in the XML structure

Just because JAWS has a bug and reads out phantom content (#22), is not a good reason to resort to sizes that are font-dependent and will lead to wrong renderings with different fonts.

I agree that the generated MathML looks a bit convoluted, like double mrows. Maybe we could improve a bit there. But please be aware that we have a mhchem→TeX→MathML chain, i.e. we do not go from mhchem syntax to MathML directly.

For assistive technologies, I would make the bold claim that using the mhchem syntax for Braille and read-out would be much better. It gets complicated when mhchem and real math are mixed, though, like in $m_{\ce{H2O}}$ or $\ce{Fe(CN)_{$\frac{6}{2}$}}$.

NSoiffer commented 2 years ago

I'm not sure how I ended up with that mmultiscripts example as it is wrong (as you point out). The proper one (using the mathvariant attr I said in the text should be used) is:

<math>
    <mi mathvariant="normal">S</mi>
    <mmultiscripts>
        <mi mathvariant="normal">O</mi>
        <mn>4</mn>
        <none/>
        <none/>
        <mrow>
            <mn>2</mn>
            <mo>&#x2212;</mo>
        </mrow>
    </mmultiscripts>
</math>

This also pushes the 2- out as it should have been. See the codepen rendering.

For assistive technologies, I would make the bold claim that using the mhchem syntax for Braille and read-out would be much better.

I'm a little dubious about that. It can certainly be done, but MathML has more structure which simplifies things and one doesn't have to have a full TeX interpreter to deal with all the hacks people do in TeX. What TeX has are macro names which can give semantic information for speech (e.g., binom). MathML 4 is adding an intent attr which some more semantic TeX implementations such as PreTeXt has said they will experiment with generating.

mhchem commented 2 years ago

Thanks. That looks much better. I would even claim that semantical grouping should be

<math>
    <mmultiscripts>
        <mrow>
            <mi mathvariant="normal">S</mi>
            <mmultiscripts>
                <mi mathvariant="normal">O</mi>
                <mn>4</mn>
                <none/>
            </mmultiscripts>
        </mrow>
        <none/>
        <mrow>
            <mn>2</mn>
            <mo>&#x2212;</mo>
        </mrow>
    </mmultiscripts>
</math>

but that would require semantic knowledge that the mhchem syntax does not contain. (JsFiddle)

The typographic subtleties, that the phantom solves, are not done here. I want to have all subscripts and superscripts on the same height. Not even with A and O, the superscripts are the same height. (JsFiddle)

For assistive technologies, I would make the bold claim that using the mhchem syntax for Braille and read-out would be much better.

I'm a little dubious about that.

My counter argument: There is a reason that chemistry books in Braille use a different (shorter) notation for chemistry than they do for generic math.

mhchem commented 2 years ago

Okay, back to the question if we can simplify the MathML output, e.g. remove the double rmow. An improved rendering would require a change in how MathJax works. We would need to change the mhchem→TeX math→MathML chain into something that skips TeX math. Or maybe change the TeX math interface, so that it includes a special \multiscripts{base}{left-sub}{left-sup}{right-sup-presub}{right-sub}{right-sup-stacked}{right-sup-postsub} that has a simpler MathML output. I don't know how easily this would be possible. Including @dpvc.

dpvc commented 2 years ago

First, about the extra mrow elements: MathJax's internal representation keeps information about the original TeX input that is needed in order to get the spacing right (among other things), and when MathJax produces MathML output, it includes nodes and classes that carry that information. That way, if MathJax's MathML output is used as input to its MathML input jax, it can reproduce the internal format that would have come from the original TeX input. The <mrow data-mjx-texclass="ORD"> and similar mrow elements are used for that purpose, so are not redundant from that standpoint. If these are removed, that could lead to incorrect layout, since the needed information will be missing. It is true that some of them could be removed without problems, so MathJax could prune some of these when it generates MathML output; mhchem should not have to worry about such things.

In terms of the example you have been looking at, while I am not a chemist, my understanding of the sulphate ion given by \ce{SO4^2-} is that the subscript 4 goes with the O, as there are 4 oxygen atoms and one sulphur atom, and that the superscript 2- indicates that the ion as a whole has a charge of -2. That is, the superscript is for the whole ion, while the subscript is for the oxygen alone, as in (S(O_4))^2-. Neil's representation using mmultiscripts with a base of O would put the superscript and subscript both on the oxygen, which is a different meaning: that there are 4 oxygen and each has charge -2. Again, not being a chemist, I don't know which of these is correct, but it seems to me that the offset superscript is meant to signify that the 2- goes with the whole ion, without having to use parentheses.

In that case, the LaTeX markup naturally would be

{\mathrm{S}\mathrm{O}_4}^{2-}

yielding $$\Large {\mathrm{S}\mathrm{O}_4}^{2-}$$

with MathML (simplified):

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <msup>
    <mrow>
      <mi mathvariant="normal">S</mi>
      <msub>
        <mi mathvariant="normal">O</mi>
        <mn>4</mn>
      </msub>
    </mrow>
    <mrow>
      <mn>2</mn>
      <mo>&#x2212;</mo>
    </mrow>
  </msup>
</math>

This corresponds to your last example with nested mmultiscripts nodes, though there is no need for mmultiscripts in this case, as it is being used just to get individual subscripts and superscripts, where msub and msup would do just fine (with one caveat that I will discuss shortly). If you were using prescripts, then mmultiscripts would be useful for that. MathJax does have two macros that handle that: \sideset in the AMS extension, and \prescript in the mathtools extension. Note that \sideset is supposed to be used only for a base that is a large operator, though MathJax doesn't enforce that, and that the prescripts are left-justified for this macro (with \prescript they are right-justified). Neither of them supports more than one pre- or postscript of each type. A more general mechanism for producing mmultiscripts elements could certainly be implemented.

The caveat is that the positioning of a subscript when there is no superscript may be higher than when there is a superscript. That is the position of the 4 in {\mathrm{0}_4}^{2-} is different from that in \mathrm{O}_4^{2-} as shown in: $$\Large {\mathrm{0}_4}^{2-}\quad \mathrm{0}_4^{2-}$$ whereas the positioning using mmultiscripts will always be the lower position, regardless of whether there is a superscript. One can get the same effect visually with {\mathrm{O}_4^{}}^{2-}, though this will probably mess up accessibility.

In terms of placement of superscripts, if the base is a single letter (or a letter with certain accents), the placement should be the same regardless of the letter. In your example of \mathrm{A}^8\mathrm{O}^8, the CHTML output is

<mjx-math display="true" style="margin-left: 0px; margin-right: 0px;" class="MJX-TEX" aria-hidden="true">
  <mjx-msup>
    <mjx-mi class="mjx-n">
      <mjx-c class="mjx-c41"></mjx-c>
    </mjx-mi>
    <mjx-script style="vertical-align: 0.413em;">
      <mjx-mn class="mjx-n" size="s">
        <mjx-c class="mjx-c38"></mjx-c>
      </mjx-mn>
    </mjx-script>
  </mjx-msup>
  <mjx-msup>
    <mjx-mi class="mjx-n">
      <mjx-c class="mjx-c4F"></mjx-c>
    </mjx-mi>
    <mjx-script style="vertical-align: 0.413em;">
      <mjx-mn class="mjx-n" size="s">
        <mjx-c class="mjx-c38"></mjx-c>
      </mjx-mn>
    </mjx-script>
  </mjx-msup>
</mjx-math>

Note that both mjx-script tags have the save 'vertical-align' value and content, and so they should be positioned the same. Unfortunately, not all browsers are very accurate about these things, and so they don't always produce exactly the same positioning. There is very little MathJax can do about that. Here's what I see for this example in your jsFiddle as well as in my own pages:

which is nicely aligned for me. When the base is not a single letter (like if it were a letter with a superscript already), then the position may be different but for plain letters, they should all line up. That should also be true for subscripts, as long as they either all have no superscript or all have a superscript.

NSoiffer commented 2 years ago

Let me lead with that I'm not a chemist and haven't taken a chemistry course in decades, so I blew it with the structure/semantics of \ce{SO4^2-} and mmultiscripts.

@mhchem wrote:

My counter argument: There is a reason that chemistry books in Braille use a different (shorter) notation for chemistry than they do for generic math.

The only reference I found for changes to Nemeth for chemistry is one that shortens some arrows and adds some symbols. It doesn't change the rules for sub/superscripts (Nemeth does have a special rule/shortcut for numeric subscripts). For UEB, the guidelines don't change the output for chemistry at all. I've implemented both in MathCAT, but not the arrow modifications for chemistry. Recognizing chemistry is needed first, and that is what I'm working on. If you saw a reference that I missed, please let me know.

As with @dpvc, I don't see a difference for the A^8 O^8 example you provided -- the superscripts are aligned on both Chrome and Firefox in windows for me.

@dpvc wrote:

One can get the same effect visually with {\mathrm{O}_4^{}}^{2-}, though this will probably mess up accessibility.

Works fine for MathPlayer, but your mileage may vary with other math-to-speech transforms as they tend to be more simplistic (as is apparently JAWS with mphantom).

@dpvc: thanks for you detailed response. The point of using mmultiscripts vs msub (etc) is the alignment of the scripts even in the absence of a sub/superscript, but your solution of an empty script might work. I'm not clear though on whether you are suggesting that @mhchem can use some of the simpler markup you showed or not.

dpvc commented 2 years ago

The point of using mmultiscripts vs msub (etc) is the alignment of the scripts even in the absence of a sub/superscript

Yes, that was why I gave my caveat above. It would certainly be possible to make a macro that produces mmultiscripts for a subscript rather than msub.

I'm not clear though on whether you are suggesting that @mhchem can use some of the simpler markup you showed or not.

Yes, I think that notation is already available and can be used without additional macros being defined. Although the empty superscript hack works visually, I would probably recommend the mmultisctipts approach if you always want the lower positions for the subscripts. This would require a new macro, as \sidesset and \prescript both end up producing plain msub/msup/msubsup if there are no prescripts.

@NSoiffer, I'm not sure if you are aware of the work-flow for mhchem, which is that it takes the contents of \ce{...} and converts it to another TeX string and then that is parsed by MathJax. The mhchem package does not produce MathML directly itself, so most of the MathML suggestions are not something mhchem would do directly, other than by changing the TeX that it uses in order to end up with the better MathML (e.g., using \mathrm{S}\mathrm{O} rather than \mathrm{SO}).

Note also that @mhchem is the author of the original LaTeX package as well as MathJax's extension. When he first wrote the extension (replacing my crappy one from long ago), he wanted to make sure that both LaTeX and MathJax were consistent, and so rewrote the LaTeX package so that both would use the same algorithms internally. Since the LaTeX package must produce LaTeX (not MathML), keeping the package entirely in terms of TeX notation is an important factor, here. While it would be possible to make new macros that produce mmultiscripts for MathJax output, Martin would have to provide LaTeX-based definitions of the same macros in order to have mhchem work in actual LaTeX. So using as much standard LaTeX as possible is also an advantage.

My concern about using ^{} is that Volker's SRE currently will end up with "to the power" with nothing afterward. That can probably be fixed, but it currently doesn't handle this well.

mhchem commented 2 years ago

Going the side-path of "accessible chemistry".

The only reference I found for changes to Nemeth for chemistry is one that shortens some arrows and adds some symbols.

Erm, if you jump from section 3 to section 7, you will see that this is a completely different notation. \ce{5 C2O4^{2-}(aq)} = #5,c2,o4"^2-"(aq) This notation omits the indication of uppercase letters, subscripts, etc. (It's not quite clear how exactly this will translate to Braille characters. 6-dot Braille needs two characters for numbers, 8-dot braille (Eurobraille) needs just one.) Maybe you could provide the Braille representation of the corresponding MathML, but I am sure it will be quite different and more lengthy.

Besides special mathematical Braille systems (English-speaking countries: Nemeth Braille, German-speaking countries: 6 dot Braillemathematikschrift or 8 dot SMSB), LaTeX seems to be the choice for a lot of people. Searching with English terms is tricky, but German return several pages of spot-on search results (search for: latex als mathematikschrift für blinde). If LaTeX is used for mathematics, it would also work well for chemistry, in particular with the mhchem (\ce) syntax, I'd reckon.

Disclaimer: While I have a good knowledge about chemical typography, I forgot everything about what these notations actually mean or how you would pronounce it. Also, what I know about Braille is second-hand knowledge from the Internet.

mhchem commented 2 years ago

@dpvc

he wanted to make sure that both LaTeX and MathJax were consistent, and so rewrote the LaTeX package so that both would use the same algorithms internally.

I have to admit this isn't true. mhchem for LaTeX and mhchem for MathJax work differently internally. The MathJax version is much more forgiving for sloppy input (partly because of your first implementation :-)) and – blame on me – it supports a few more special notation than the LaTeX version. Nobody complained yet and motivated me.

dpvc commented 2 years ago

I have to admit this isn't true.

I stand corrected, and apologize for the misinformation. Thanks for the correction!

zorkow commented 2 years ago

@NSoiffer I am confused what you mean by poor MathML. If two representations generate the same typeset expression are they not equivalent? There might be good vs bad styles, similar Syntax of programming languages. If you are trying to infer semantics exclusively on the structure of syntax, you will always be at the mercy of whatever software has generated that syntax. For example, I would assume mrow elements have only grouping functions, similar to groups in SVG, and while they might add styling information they should not carry any "meaning". SRE would simply strip superfluous mrows, rewrite multiscipts without left indices into sub/superscripts, etc.

Also why don't you simply use the more informative LaTeX notation here, instead of throwing out the knowledge, and then try to regain it by interpreting the MathML. A few year's back I experimented with some chemistry rules on the basis of mhchem in SRE and ultimately wrote some for the LaTeX representatinos directly rather than any intermediary MathML. Hence they are not yet in use, and the maths interpretation is what you currently see, as @dpvc has pointed out. However, I hope to use them soon, as we are currently moving towards using LaTeX more directly in SRE (the [latex-to-speech] package is a first step).

This is in particular in order to support accessibility. As @mhchem pointed out the Nemeth use case is rather small, given that over the last decades in a number of European countries, visually impaired children are being taught LaTeX directly, starting in primary school, which particularly helps communicating mathematics with non-specialist teachers and thus furthers inclusive education. In some of these countries (notably Germany and Slovakia, but others are sure to follow) specialist Braille systems have been superseded entirely by translating LaTeX code into 8-dot Euro Braille. I feel it would be a terrible precedent to ignore these stakeholders in order to enable some rare symbols in Nemeth that will -- let's be honest -- ultimately only be used and understood by very few people.

Now, whether LaTeX will also be used for chemistry and if so if it will be mhchem I would not know... But I am currently working with some people in Baden Wuerttemberg, so I am sure going to find out.

NSoiffer commented 2 years ago

Apologies, the earlier replies got buried in my mail stream. So a little catch up: @dpvc: thanks for the history/description of how mhchem and mathjax work together. The correction by @mhchem indicates that maybe some other macros would produce simpler MathML.

@mhchem re braille:

\ce{5 C2O4^{2-}(aq)} = #5,c2,o4"^2-"(aq) This notation omits the indication of uppercase letters, subscripts, etc.

Maybe you were a little hasty in looking at the RHS: dot 6 ("," in ASCII braille) is the capitalization char in Nemeth (and UEB). You'll notice that it is before the 'c' and the 'o'.

Also, simple numeric subscripts in Nemeth (regardless of it being chemistry) skip the subscript indicator. It's an optimization in Nemeth but not UEB. It does lead to a complication though when the subscript is followed by a number, so a special rule is needed for that case. Many translators that I tried fail to handle that special case correctly. If you look at example 7.1-3, you see that there is a subscript indicator (dots 5-6/";" in ASCII braille). Nemeth code always states the nesting of the sub/superscripts, so you'll two subscript indicators before the "s" and "p". So again, I don't see anything special for Nemeth code for chemistry here.

@zorkow was on a call today with me and others. I was aware Germany and a few other European countries were teaching LaTeX in grade school. However, I learned from @zorkow today that there is a move to use (a predefined subset of) LaTeX for the braille math and that is replacing the old braille math code. With 8 dot braille, that makes a lot of sense. I do agree that assuming they expand that to include mhchem, that would be a real boon for accessibility since your design is clear and concise.

As @zorkow also pointed out, maintaining the TeX via semantics/annotation in MathML would make it easier to produce the braille for the LaTeX assuming it started life as TeX (and also that it didn't have unknown macros/weird redefinition of the syntax).

@NSoiffer I am confused what you mean by poor MathML. If two representations generate the same typeset expression are they not equivalent?

@zorkow: by "poor" I mean significantly more complicated than needed. It is definitely not the case that two notations that produce the same rendering are equivalent. Display may be by far the most important usage, but accessibility and search are other usages. It is sort of like saying two programs are the same if they produce the same result, One might take a linear amount of time and the other an exponential amount of time or one might be 20 lines of code and the other 1,000 lines. But I think this type of discussion is not productive and hopefully you don't feel compelled to extend it.

I do agree that there is good semantic info in the macro being used. The Math WG is adding an intent arg to MathML to capture exactly this type of information ("author intent"). I have focused the design of MathCAT around the idea of "intent" and others are working on generating intent from TeX, so we should have examples of the workflow that goes from authors to AT by the time the MathML 4 spec become a recommendation (which will be a while). If mhchem/MathJax eventually generates intent, it would make it much easier to reliably generate good speech for all AT.

zorkow commented 2 years ago

Sorry, for replying so late, I was on holidays.

I agree with @Nsoiffer, we should not protract this discussion; so here are a few, final remarks from my site:

Display may be by far the most important usage, but accessibility and search are other usages.

All three points are debatable:

I am not aware of many systems that actually use MathML as a basis for display.
MathML is NOT accessible out of the box. Otherwise, why the need for specialised systems (or intent) so preprocessing has to happen, no matter what. In fact, the design decision I regretted the most in SRE was to base it's input on MathML rather than something more powerful.
While I doubt that there are many use cases for formula search, beyond secondary school math before exams, all half-way working search engines (e.g. searchonmath) are based on LaTeX. And even for MathML I would assume some canonicalization would be important.

It is sort of like saying two programs...

Poor comparison, in my opinion, as any "mess" added to a MathML expression will only add constant overhead to the XML parsing. But I am not an expert on the MathML spec and their might be hidden traps.

What I find more worrying is that this good XML property is now discarded with intent. From what I have learned so far about the intent attribute is that now strings with holes (variables) have to be understood by AT, adding unpredictable complexities via

a need for additional string parsing
evaluation of variables
lookup in the DOM, which is not compatible with recursive descent
potential risk of circularity

In many ways we already have a version of this idea in MathJax, using the semantic embeddings via O(1) data attributes. When we first published that we had some severe pushback as "this was not in line with the MathML spec". But now MathML is trying to reinvent that via programming in strings?

And, if they are serious about intent, then surely that could be applied to any DOM structure as we do in MathJax. Why the insistence on a single input language like MathML?

In the call @NSoiffer mentions, @NSoiffer suggested that I would try to break things to "retain job security" with MathJax. I do find this a tad insulting; none of my math work was ever done for commercial gain. Everything was always open source and transparent.

As such I strongly object to the idea of enforcing a particular format simply because it serves current commercial technologies, instead of facilitating a paradigm shift in math accessibility, which might be far more helpful for the actual target audience (i.e., the next generation of visually impaired students). In particular, pushing a "standard" as a requirement for accessibility, even if that is detrimental to accessibilty efforts in non-Anglosaxon parts of the world, is in my opinion unethical, but unfortunately seems to be in line with a US centric view that is all too common in W3C working groups.

NSoiffer commented 2 years ago

My turn to apologize for a late reply -- I was on the road for four weeks...

Let me start by saying I thought that you (@zorkow) know me well enough that any mention of retaining job security was meant as a joke. You know I respect your work and all you have done for the community. I'm sorry to hear you were offended by it and I apologize for that.

As for the rest of your comments, we agree it is not productive to extend this discussion, so I won't go through a point by point rebuttal and just say I disagree with a number of your comments. If you want to discuss them in more detail, I suggest opening a different issue, perhaps in the MathML issue tracker.