w3c / mathml-docs

Notes and other documents by the Math WG. See also https://github.com/w3c/mathml-core (repo for the MathML Core spec) and https://github.com/w3c/mathml (repo for the MathML 4 spec).
https://www.w3.org/Math/
Other
4 stars 10 forks source link

TeX2MathML converter implementation guidelines #39

Open physikerwelt opened 2 years ago

physikerwelt commented 2 years ago

TLDR: Can we create a list of LaTeX commands that generate all elements described by the core spec?

The goal of the Wikimedia community group math is to improve the display of mathematical expressions in Wikipedia. Indeed, using browser-based MathML rendering to deliver high-quality formulae is desirable. The new MathML core specification seems promising as it appears to be detailed enough to implement and evaluate MathML rendering engines based on the spec. Therefore, there are good reasons to be optimistic. Once the spec is final and the rendering engines have been implemented, reasonable MathML markup will lead to appealing rendering results that the community will appreciate.

However, the de-facto standard in 2022 for authoring and rendering mathematical formulae are formats from the TeX family. Therefore, I suggest a deeper investigation of the conversion process TeX like inputs formats to MathML. We need conversion tools that generate the intended MathML 4 output from TeX like input as a prerequisite for our new MathML 4 standard to become a success story. In 2018, we evaluated several TeX2MathML conversion tools including those listed on our tool page. At that point, we created a manual gold standard dataset for presentation and Content MathML. However, the gold standard dataset's quality might not be optimal as it was influenced by LaTeXML. In particular, we used LaTeXML to generate the initial version of the MathML output and fixed problems we spotted by chance in that output.

Therefore, I suggest creating a non-normative document describing how to convert TeX expressions to the corresponding MathML core expression. While this task is open-ended, I recommend stopping after all elements described in the MathML core spec have at least one corresponding LaTeX input.

After that is completed and we still have enthusiasm, we could extend the exercise not only for core but also for intent. Here one could stop, for example, after having touched all symbols with the planned custom style tag annotations and their corresponding content MathML representations.

Disclaimer: I am currently considering implementing a texvc to MathML converter in PHP. For a TDD workflow, it would therefore be good to be able to generate meaningful test cases.

dginev commented 2 years ago

Collecting macros for creating a tiny tree with each of the MathML Core elements is certainly quite doable.

My fear is that to make it useful, you'd also have to collect the macros needed for generating the different meaningful values for each of the MathML Core attributes.

And then macros to generate some of the idiomatic expression trees.


For example, would such a list care that a script of an <msup> is better attached to a completed parenthetical base, rather than the closing fence?

( \ldots )^2

attached to full parenthetical base, (latexml with enabled grammar):

  <msup>
    <mrow>
      <mo stretchy="false">(</mo>
      <mi mathvariant="normal">…</mi>
      <mo stretchy="false">)</mo>
    </mrow>
    <mn>2</mn>
  </msup>

vs attaching to the closing fence (via mathjax)

  <mo stretchy="false">(</mo>
  <mo>&#x2026;</mo>
  <msup>
    <mo stretchy="false">)</mo>
    <mn>2</mn>
  </msup>

vs not attaching at all. (latexml with grammar disabled via --noparse):

  <mo>(</mo>
  <mi mathvariant="normal">…</mi>
  <mo>)</mo>
  <msup>
    <mi/>
    <mn>2</mn>
  </msup>

The elements are about the same, but the trees are markedly different. Well, there's also apparent debate whether the ellipsis is an <mi> or <mo> -- which is also a question for a useful list, should most of the math symbols be given clear MathML Core targets?

I fear that a useful list will have to spend a lot more writing in talking about tree structure, than the individual leaf elements. Still worth starting, but I'd expect it to hit 50-100 pages in size pretty quickly if we include that area of consideration.

davidfarmer commented 2 years ago

In order to get the intent attributes, there will need to be TeX macros corresponding to the intents.

For example, suppose we want to convey this (true but mostly useless) fact: If x squared is strictly between 0 and 100 then the absolute value of x is in the open interval from 0 to 10.

Possible TeX markup to capture that meaning (although it might not be pronounced as written above) is: If $0 < x^2 < 100$ then $\abs{x} \in \oointerval{0}{10}$.

The "in" macro is standard.

The macros "int" and "oointerval" need to be defined. (And maybe have different names.) Depending on how the macro is defined, the open-open (i.e., open at both ends) interval could be written as \oointerval{0, 10} , making it a function of one parameter but requiring the author to type the comma.

My point is that the absolute value and the open interval have to be macros, because this version requires guessing the intents: If $0 < x^2 < 100$ then $|x| \in (0, 10)$.

It will be good to have a discussion about how we think actual authors will write material that captures the intent.

physikerwelt commented 2 years ago

My point is that the absolute value and the open interval have to be macros, because this version requires guessing the intents: If $0 < x^2 < 100$ then $|x| \in (0, 10)$.

@davidfarmer I think even in a controlled environment like Wikipedia, with a very restricted set of commands it is hard to predict what people will actually write. Often there is a lot of formatting included. For example, for your interval example, the actual code in Wikipedia looks like this

Both notations are described in [[International standard]] [[ISO 31-11]]. Thus, in [[set builder notation]], : \begin{align} {\color{Maroon}(} a,b{\color{Maroon})} = \mathopen{\color{Maroon}]}a,b\mathclose{\color{Maroon}[} &= {x\in\R\mid a{\color{Maroon}{}<{}}x{\color{Maroon}{}<{}}b}, \{} {\color{DarkGreen}[}a,b{\color{Maroon})} = \mathopen{\color{DarkGreen}[} a,b\mathclose{\color{Maroon}[} &= {x\in\R\mid a{\color{DarkGreen}{}\le{}} x{\color{Maroon}{}<{}}b}, \{} {\color{Maroon}(} a,b{\color{DarkGreen}]} = \mathopen{\color{Maroon}]}a,b\mathclose{\color{DarkGreen}]} &= {x\in\R\mid a{\color{Maroon}{}<{}}x{\color{DarkGreen}{}\le{}} b}, \{} {\color{DarkGreen}[}a,b{\color{DarkGreen}]} = \mathopen{\color{DarkGreen}[} a,b\mathclose{\color{DarkGreen}]} &= {x\in\R\mid a{\color{DarkGreen}{}\le{}} x{\color{DarkGreen}{}\le{}} b}. \end{align} Each interval {{open-open|''a'', ''a''}}, {{closed-open|''a'', ''a''}}, and {{open-closed|''a'', ''a''}} represents the [[empty set]], whereas {{closed-closed|''a'', ''a''}} denotes the singleton set {{math|{''a''}{{null}}}}. When {{math|''a'' > ''b''}}, all four notations are usually taken to represent the empty set.

As you can see, due to the absence of a native TeX or MathML-based solution to annotate intent, people came up with custom templates such as closed-open, etc. However, those templates are hard to discover for authors. E.g, the closed-open template is only used 45 times within English Wikipedia. On the other hand, I think the effort people spend in writing and rewriting Wikipedia articles is much higher than the effort to write a paper once and upload it to arxiv. Feel free to look at the statistics of the interval example. Just to quote one of many impressive numbers: The average time between edits is 8.2 days. While the tex code within the wikitext tag <math> (not be confused with the HTML5 element <math>;-) produces MathML output right now, the templates in double curly brackets generate text, e.g., for the closed-open example <span class="texhtml">[<i>a</i>, <i>a</i>)</span>. Certainly one could change the implementation of the templates to also output MathML. So overall, I see it as a big advantage that we have those semantic templates. However, there are several thousand math templates and one would need to provide good reasons for people to spend effort modifying these templates. I could imagine that improved accessibility would be a convincing argument. Telling a long story short. I expect that if we find a convenient and intuitive way to specify the intent, I guess there is a good chance that it will be implemented in Wikipedia. However, having users change the MathML code either directly or via a WYSIWYG editor is not a solution, because it would be too frustrating if a minor change in the TeX source and subsequent regeneration of the MathML code would reset the intent properties.

physikerwelt commented 2 years ago

I fear that a useful list will have to spend a lot more writing in talking about tree structure, than the individual leaf elements. Still worth starting, but I'd expect it to hit 50-100 pages in size pretty quickly if we include that area of consideration.

@dginev I think it would be useful to write this. At least to everyone implementing conversion tex2mathml converters, which could be 20 people or even more. I am afraid, I might have co-authored papers that are read by fewer people;-) Maybe we can just start something and see how it goes. Can you recommend an authoring tool?

dginev commented 2 years ago

Can you recommend an authoring tool?

I think for this type of collaborative writing, HackMD may be my current default choice. They have higlighting of TeX and XML snippets (similarly to github issues), and also have native MathML rendering (which I had asked of them some time back, <math>...</math> markup will render as regular HTML-in-markdown). But it's just one idea - maybe there's something better.

davidfarmer commented 2 years ago

@physikerwelt made what I think is a key point:

"having users change the MathML code either directly or via a WYSIWYG editor is not a solution"

What is needed, especially in an environment like Wikipedia, is either

a) A human-readable, human-writable source format which automatically converts to the desired MathML output.

b) An editing program which lets a person create the content, and which all potential authors can use.

My hope is to support option a). The key point @physikerwelt made is a warning about possible failures of option b).

Both options can coexist of the editing program of b) can output the source format of a).

I like option a) because it provides an archival format which can adapt to future changes in the recommended MathML output.

physikerwelt commented 2 years ago

@davidfarmer exactly. Just to link it back to the Wikimedia terminology. a) corresponds to wikitext and b) corresponds to VisualEditor

Both options can coexist of the editing program of b) can output the source format of a).

I would like to mention that the development of the VisualEditor was extremely challenging to the constraint that the wikitext output should still remain human-readable and editable. This constraint did not only make it a bit more effort but put it into a whole new class of problems and increased the effort about several orders of magnitude.

physikerwelt commented 2 years ago

A quick update on that. @hyper-node and I have now converted the Latex (subset) parser to PHP and now have an AST of the LaTeX representation. We are now looking for ideas to generate the MathML output from that AST. @hyper-node is going to look into the MathJax source code to come up with a high-level design on how to generate the MathML output from that tree. I will generate the lists mentioned before.

NSoiffer commented 2 years ago

Would people like this to be the focus of the MathML Full meeting this week?

Neil

Message ID: @.***>

NSoiffer commented 1 year ago

Move this to mathml-docs.