Chemistry and math layout

w3c / mathml

MathML4 editors draft

https://w3c.github.io/mathml/

Other

61 stars 18 forks source link

Chemistry and math layout #92

Closed NSoiffer closed 1 year ago

NSoiffer commented 5 years ago

Layout of chemical formulas is very similar to laying out math. People often use math editors to enter those formulas, which means they will show up as MathML on the web.

Here are some known differences:

Chemical elements should not be in italics. For multi-letter chemical elements such as Na (sodium), that's not a problem, but for single letter ones such as H (hydrogen), that will be a problem because mi will normally use italics.
Subscripts and superscripts should be positioned slightly differently from the normal typesetting rules (see the TeX book, p179). The idea is that you want the scripts to all align, regardless of whether there is a single sub/superscript or whether both are present. I think this means that if any script is present, it should be treated as if both scripts are present (i.e., use the layout rules for msubsup).
Potentially automatic linebreaking rules might want to be different (brought up on 28/5/19 call, but no details were given).

Note: @davidcarlisle was tasked with looking into the TeX chemistry packages and those packages might reveal other layout differences.

'1' can be solved with MathML today by using mathvariant="normal". Alternatively, maybe the sans-serif math alphabetics in Unicode can be used if chemical elements are always sans-serif. However, these solutions don't help with semantics (#92), so another solution might be preferable. For units (meters, seconds, etc.), the MathML WG came out with a note that suggests using class="MathML-Unit". A similar thing could be done here. Alternatively, these might be tagged with some sort of "role" information and that semantic info could be pulled into the rendering. Personally, I think semantics and display should be kept separate.

'2' can be solved with MathML today with a hack of using mphantom or mspace or something else for the empty script if only one real script is present. If '1' is solved via some semantic info that says this is a chemical element, then the layout rules can be adjusted using this info. As with '1', I personally think semantics and display should be kept separate. Allowing none as a value for msubsup is cleaner than mphantom, etc. Another alternative is to introduce an attr on msub and msup that says "use msubsup layout rules. E.g., msubsuplayout=true/false, with default false.

Until we know more about linebreaking of chemical formulas (which should be really rare), I don't have any proposals.

Using some of the above to put up a straw man proposal

<mrow>
  <msubsup>
    <mi role="math-element">Fe</mi>
    <mn>2</mn>
    <mrow> <mo>+</mo> <mn>2</mn> </mrow>
  </msubsup>
  <msub msubsuplayout="true">
    <mi role="math-element">Cr</mi>
    <mn>2</mn>
  </msub>
  <msub msubsuplayout="true">
    <mi mathvariant="normal" role="math-element">0</mi>
    <mn>4</mn>
  </msub>    
</mrow>

Here's an alternative using "none":

<mrow>
  <msubsup>
    <mi role="math-element">Fe</mi>
    <mn>2</mn>
    <mrow> <mo>+</mo> <mn>2</mn> </mrow>
  </msubsup>
  <msubsup>
    <mi role="math-element">Cr</mi>
    <mn>2</mn>
    <none/>
  </msubsup>
  <msubsup>
    <mi mathvariant="normal" role="math-element">0</mi>
    <mn>4</mn>
    <none/>    
  </msubsup>    
</mrow>

josephwright commented 5 years ago

On line-breaking, people do do it in the middle of formula but it should keep element names together plus any subscripts. So one might break "C6H12O6" as "C6-H12O6" or "C6H12-O6" but not anywhere else. However, as you say it's pretty unusual to break such cases: normally it's formal names that are 'fun'.

josephwright commented 5 years ago

Perhaps worth noting on sub/superscripts that IUPAC have said that compound ions should have charges after any subscript numbers (https://iupac.org/wp-content/uploads/2015/07/Green-Book-PDF-Version-2011.pdf, p 51). Thus what in TeX-like terms might be expresses SO_{4}^{2-} should have the 2- clearly after the 4 as the charge applies to the entire ion: thus in TeX-like terms one would use SO_{4}{}^{2-}. (See http://mirror.ctan.org/macros/latex/contrib/chemformula/chemformula_en.pdf page 11 for an 'automated' approach.)

mhchem commented 5 years ago

I am the author of mhchem (for LaTeX, MathJax, KaTex). First of all, math typesetting is well suited for chemistry, but there are many more fine details that you have not yet mentioned, like bonds, inner dashes and dots, italic prefixes etc. Upright greek characters are an important, but often missing feature. You might want to take a look at https://mhchem.github.io/MathJax-mhchem/ to see a collection of examples. As there are no many fine details, a more structured approach could be needed that a thread of comments like this. From my experience, I would avoid semantic markup, i.e. giving each part of η²-C₂H₄ a description of why it is typeset as it is. The same notation (upright greek, dash, dot, ...) can have several very special chemical meanings, depending on the field of chemistry, with new meanings being added (and forgotten) all the time. I can see in the examples above, you are suggesting using typographic semantics. This is exactly what I would recommend.

physikerwelt commented 5 years ago

@mhchem thank you for your input. I hope we will not end up using deprecated versions this time.

davidfarmer commented 5 years ago

The same notation (upright greek, dash, dot, ...) can have several very special chemical meanings, depending on the field of chemistry, with new meanings being added (and forgotten) all the time.

Suppose I am authoring in mhchem in LaTeX. Is that markup semantic?

The point of my question is that if the original source is semantic, then I would like to retain that information all the way to the browser.

I understand that the same notation (the same, visually) can have different meanings, but that is exactly the problem I want to address. A simple math example: What does |X| mean?

The answer is that I can't tell without knowledge and context. And without knowing what it means, I can't pronounce it.

But if the LaTeX source was \card{X} then I know for sure that is means "cardinality of X", as opposed to absolute value or determinant. If the LaTeX source was |X| then too bad. But my hope is that authors can be induced to write more semantically.

Hence my question: is mhchem semantic now? If not, how hard will it be to make it semantic?

I am happy with an answer that applies most of the time, for the first half of the undergraduate curriculum.

mhchem commented 5 years ago

The mhchem syntax is not semantic in your sense. Example: It says what part goes into superscript, but it does say why. I see that this leads to problems with speech output and machine interpretation. But I don't think users would be willing to have a dozen commands to create a semantic right-hand superscript when a simple ^ would do the trick. I think the "straight forward", "little typing" syntax is what makes mhchem popular. (By the way, spoken chemistry is also highly context-dependent with a lot of "homophones" like "C twelve". But my experience with that is very limited.)

NSoiffer commented 5 years ago

Great to see all this input!

@davidcarlisle was tasked with finding out a bit more about the various TeX packages for chemistry and hence getting an expert like @mhchem providing input.

I probably should have added those links into the original issue as looking at those packages can be instructive. Here are the links:

He also included a link to siunitx, a package for units that is tangentially related to this thread.

What struck me on skimming through them is how much they were focused on shortening the input. Here's an mhchem example from the pdf: \ce{Hg^2+ ->[I-] HgI2 ->[I-] [Hg^{II}I4]^2-} Note that {}s are not used around the multichar superscript, subscripts are implied, and \atop isn't used for the arrow annotations. chemformula is similar in its design to shorten input.

One of the goals of the refresh effort is to be explicit about the layout rules. If we plan to include chemical layout in MathML (which I think we should), we need to make sure MathML can handle any differences. We also need to decide whether to add things to MathML such as attributes that make it easy to handle the differences or whether we require authoring tools to make the tweaks explicit. As a simple case, mi has the default value of mathvariant="auto" to simplify tagging of mi. That's not what is desired for chemistry. There are three possible ways to deal with this:

Require tools to generate mathvariant="normal" on all mi
Equivalently, have tools generate mathvariant="normal" on the math element
Expand the meaning of auto (the default) so that it understands that in a chemistry context (however that is specified), an upright font should be used.

Note that currently mathvariant is planned to be left in full but removed from core as a legal value on mstyle (#1, #89), and hence from math. So the second option above would be legal in the full spec, but is not legal in core.

Hence, it is important to collect a list of the differences between math and chemistry. Once we have that list, then we look at various options for each and hopefully come up with a unified strategy for dealing with those differences. We may also find that MathML is missing a couple of features that need to be added.

mhchem commented 5 years ago

You might also be interested in the mhchem for MathJax manual. It has more special syntax than the LaTeX version and a live "test-drive" at the end of the page where you can type in and see the results immediately.

Most parts of chemical equations are in an upright font, but not all. If you think about exending auto (option 3), you could make this quite complex. The x in \ce{NO_x} is italic, the n in \ce{Fe^n+}, and the i in \ce{i-Pr} are as well. But I think, this would not fit well this your aim of having a concise MathML spec that is not too difficult to implement.

Please think about chemistry-in-math and math-in-chemistry use cases, e.g. $C_p[\ce{H2O(l)}]$ , \ce{CuS($hP12$)} and the examples above.

And here are a few examples that might need extended layouting options:

davidfarmer commented 5 years ago

I just went through the first dozen or so pages of the PDF documentation for the mhchem bundle, and it looked quite semantic to me. By that I mean, one could write a script to parse the contents of \ce{} and determine unambiguously what it means. Some examples: \ce{H+} \ce{CrO4^2-} \ce{^227_90Th+} \ce{(NH4)2S}. It seems that the uses of ^ and _ are completely unambiguous. Given the (finite!) list of elements, you can recognize numbers and the + and - symbols, and you can parse the expression.

Reactions: \ce{A <--> B} \ce{A ->[H2O] B} \ce{SO4^2- + Ba^2+ -> BaSO4 v} Those are made of the pieces mentioned above, along with a small number of new pieces.

There are a few other things, but in each case the notation seems to be unambiguous. Maybe the case of nested expressions (chem in math in chem) is tricky, but there are not really that many different things that can happen.

Unless I am missing something, this situation is pretty similar to how I view LaTeX markup.

Is an equals sign distinguished from a double bond by the spaces around it?

mhchem commented 5 years ago

Well, what is semantic? The mhchem syntax is a typographic notation, the transformation mhchem->LaTeX is anambiguous. Take \ce{[Hg^{II}I4]^2-}, for instance. There are two right-hand superscripts with different meaning (oxidation number and charge). One has to look at the content to understand the meaning. Sometimes, in contexts where one is not interested in charges at all, one can even write arabic oxidation numbers (instead of roman ones). When it comes to new concepts, chemists simply define (locally) a new meaning for a certain notation. This way, the same typographic notation could be used in different branches of chemistry and they might not even be aware of that – or even understand the meaning of the other notation.

Yes, spaces are a very important semantic element in mhchem syntax.

Don't expect a finite list of elements. Chemists make up new names all the time (D, T, M, THF), some are just conventions within a single article.

davidcarlisle commented 5 years ago

One interesting thing that people can do with the mhchem for mathjax live demo feature is to see the generated mathml(3) code using mathjax right menu view mathml code option. One thing I notice trying a few examples there is use of

            <msup>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded width="0">
                    <mphantom>
                      <mi>X</mi>
                    </mphantom>
                  </mpadded>
                </mrow>
              </mrow>
              <mrow class="MJX-TeXAtom-ORD">
                <mn>2</mn>
                <mo>+</mo>
              </mrow>
            </msup>

The phantom X appearing in several places (acting as a \mathstrut to force the position of superscripts to a fixed height not depending on the base) it might be nice if we could make that simpler.... superscriptshift might actually be enough although that is minimum shift rather than a fixed shift, it would be enough to force same height on upper and lower case letters in the base though.

In full I picked this example

\ce{$K = \frac{[\ce{Hg^2+}][\ce{Hg}]}{[\ce{Hg2^2+}]}$}

which generated this MathML

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mstyle mathcolor="#a33e00">
    <mrow class="MJX-TeXAtom-ORD">
      <mi>K</mi>
      <mo>=</mo>
      <mfrac>
        <mrow>
          <mo stretchy="false">[</mo>
          <mrow class="MJX-TeXAtom-ORD">
            <mrow class="MJX-TeXAtom-ORD">
              <mi mathvariant="normal">H</mi>
              <mi mathvariant="normal">g</mi>
            </mrow>
            <msup>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded width="0">
                    <mphantom>
                      <mi>X</mi>
                    </mphantom>
                  </mpadded>
                </mrow>
              </mrow>
              <mrow class="MJX-TeXAtom-ORD">
                <mn>2</mn>
                <mo>+</mo>
              </mrow>
            </msup>
          </mrow>
          <mo stretchy="false">]</mo>
          <mo stretchy="false">[</mo>
          <mrow class="MJX-TeXAtom-ORD">
            <mrow class="MJX-TeXAtom-ORD">
              <mi mathvariant="normal">H</mi>
              <mi mathvariant="normal">g</mi>
            </mrow>
          </mrow>
          <mo stretchy="false">]</mo>
        </mrow>
        <mrow>
          <mo stretchy="false">[</mo>
          <mrow class="MJX-TeXAtom-ORD">
            <mrow class="MJX-TeXAtom-ORD">
              <mi mathvariant="normal">H</mi>
              <mi mathvariant="normal">g</mi>
            </mrow>
            <msub>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded width="0">
                    <mphantom>
                      <mi>X</mi>
                    </mphantom>
                  </mpadded>
                </mrow>
              </mrow>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded height="0">
                    <mn>2</mn>
                  </mpadded>
                </mrow>
              </mrow>
            </msub>
            <msup>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded width="0">
                    <mphantom>
                      <mi>X</mi>
                    </mphantom>
                  </mpadded>
                </mrow>
              </mrow>
              <mrow class="MJX-TeXAtom-ORD">
                <mn>2</mn>
                <mo>+</mo>
              </mrow>
            </msup>
          </mrow>
          <mo stretchy="false">]</mo>
        </mrow>
      </mfrac>
    </mrow>
  </mstyle>
</math>

davidfarmer commented 5 years ago

Let's use a different term than "semantic". Let's say "requires only local context to infer meaning". And let's stick to the undergraduate curriculum: of course some researcher can write a paper using some crazy and inconsistent notation. That markup will be misinterpreted by a screen reader no matter what we do.

By "local context" I mean: only that one equation, and little knowledge of the subject.

My assertion is that mhchem syntax "requires only local context to infer meaning". In particular, "^" does not mean "exponent". It means "oxidation number" when followed by roman numerals, and it means "charge" when followed by an integer and a plus or minus.

Thus, one could process mhchem syntax into a form that explicitly (and verbosely) encodes the semantics.

It is the same for math: "^" does not mean "superscript". It means "upper limit" when following \int, and it means other things depending on where it occurs, and one only needs local information to deduce what it is. As long as one uses "^" only for those cases where local context determines meaning, all is good and we have semantic source.

Of course, someone could write A^T for "the transpose of A". But they shouldn't. They should write \transpose{A} . That has the added benefit of making it easy to use the convention of writing the "T" on the left, or as lower case. The markup is the same, but the macro definition is different. For this to work, one needs to think in terms of encoding the meaning, not encoding the appearance. And there may need to be an extra preprocessing step before converting to the output format.

mhchem commented 5 years ago

I'd say, in more than 99% of the cases, one can infer the meaning by "local context". I skipped through the Green Book and the Red Book and found at least these meanings of right superscripts:

charge: ^- ^2- ^3- ^+ ^2+ ^3+ ^0 (when on a particle)...
oxidation: ^I ^II ^III ^IV ^-I ^-II ^-III ^0 (when at an element) ^{(I)} ^{(II)} ^{(III)} ...
excited: ^*
radical: ^. ^2.
radical and charge: ^.- ^(2.)- ^(2.)2+ ...
Kroeger Vink notation has completely different semantics: ^x ^. ^.. ^2. ^' ^''
hapticity: \eta^2 \eta^3 \eta^4
number of donor atoms: \kappa^2
(bonding number: \lambda^5)

There are more, for sure.

I don't see your point, why one should not be able to infer the meaning of A^{\mathrm{T}}. (A^T would definitely be false.) What could go wrong if a screenreader read every instance of "latin uppercase italic letter, with a right superscript upright T" as "the transpose of A" (or whatever letter)?

davidfarmer commented 5 years ago

I like the list of 9 examples of "^" in mchem. It seems like the first 5 occur when an element is to the left of the "^", and what occurs to the right is different. Thus, all 5 of those are unambiguous.

I don't know enough about the others to say anything.

Back to A^t, which I wrote this time, as some people do, with a lower case "t". It is impossible to tell what A^t means if that is all you see. It could be an exponential function of the real variable t. I'd guess it is more often that, than it is the transpose of the matrix A. And when all you have is A^t, you don't even know whether or not A stands for a matrix.

That is what can go wrong if a screen reader always reads A^t as "the transpose of A". And that is why the meaning should be encoded, instead of guessed.

On Fri, 31 May 2019, mhchem wrote:

I'd say, in more than 99% of the cases, one can infer the meaning by "local context". I skipped through the Green Book and the Red Book and found at least these meanings of right superscripts:

charge: ^- ^2- ^3- ^+ ^2+ ^3+ ^0 (when on a particle)...

oxidation: ^I ^II ^III ^IV ^-I ^-II ^-III ^0 (when at an element) ^{(I)} ^{(II)} ^{(III)} ...

excited: ^*

radical: ^. ^2.

radical and charge: ^.- ^(2.)- ^(2.)2+ ...

Kroeger Vink notation has completely different semantics: ^x ^. ^.. ^2. ^' ^''

hapticity: \eta^2 \eta^3 \eta^4

number of donor atoms: \kappa^2

(bonding number: \lambda^5)

There are more, for sure.

I don't see your point, why one should not be able to infer the meaning of A^{\mathrm{T}}. (A^T would definitely be false.) What could go wrong if a screenreader read every instance of "latin uppercase italic letter, with a right superscript upright T" as "the transpose of A" (or whatever letter)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute thethread.[AABTULA7GMULHWT5CKS4YULPYGE3HA5CNFSM4HQHHXEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPW SZGODWWKNUI.gif]

mhchem commented 5 years ago

We are deviating too much here, I guess, but let me add 2 points. First, when you argue with "some people write it non-standard", that can be the case in chemistry too. Second, your mathematical notation is sloppy. An operator is to be set in an upright font, a variable in italics. (This is a universal scientific notation. Looking at StackExchange, the chemical community observes this much more strictly than the physical and mathematical community.)

davidfarmer commented 5 years ago

I appreciate that there are standards for typography, but it is a fact that most popular introductory linear algebra textbooks typeset matrices as slanted capital letters. The other extreme is the wikipedia page on matrices, which set them upright and bold.

But the more important point I want to make is that no amount of typography addresses the issue of what A^t means. Even if the "A" is upright (and/or bold), it is impossible to tell that A is a matrix, and it is impossible to tell that A^t is the transpose of A. That is why I want to encode semantic information.

Actually, the motivation is making it possible to pronounce correctly without having to guess. I am not proposing to encode the fact that A is a matrix (although I do not object to encoding that), just encoding that "^t" is the "transpose".

The point of this thread is how to do similar encoding for chemistry, so that charge/valence is pronounced correctly. Maybe encoding that something is an element is more important than encoding that something is a matrix?

mhchem commented 5 years ago

I was talking about the T or t. These are operators, so they are to be typeset upright, so there is no confusion with a variable t. I use the chance to bring this thread back to the chemistry topic: I recommend the IUPAC document On the use of italic and roman fonts for symbols in scientific text.

NSoiffer commented 1 year ago

Lots of good discussion, but I don't see anything here that intent (with the addition of isa) doesn't solve, so I'm closing this.