w3c / mathml

MathML4 editors draft
https://w3c.github.io/mathml/
Other
60 stars 18 forks source link

Adding semantics to presentation MathML using symbol names #141

Closed samdooley closed 2 years ago

samdooley commented 5 years ago

Several options for adding semantics to presentation markup were discussed on the Sep 10 MathML General call. A common thread seems to be a need for a shared vocabulary of mathematical symbols/operators/names.

https://docs.google.com/spreadsheets/d/1ebOkl7Gckfk5g6Dc4C8bpGZtSxLnGwpOHqAwwON0-nI/edit?usp=sharing

I have collected 1749 symbols into a Google sheet as initial starting point for such a list. The list still needs lots of work, but enough is there to illustrate how one could add semantic information via a role attribute to encode content markup within the presentation markup:

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mrow>
    <mrow>
      <msup role="power">
        <mi role="ci">a</mi>
        <mn role="cn">2</mn>
      </msup>
      <mo role="plus">+</mo>
      <msup role="power">
        <mi role="ci">b</mi>
        <mn role="cn">2</mn>
      </msup>
    </mrow>
    <mo role="eq">=</mo>
    <msup role="power">
      <mi role="ci">c</mi>
      <mn role="cn">2</mn>
    </msup>
  </mrow>
</math>

The goal for this list is to define unique short identifiers for as many mathematical symbols as possible from widely used sources, including Unicode, Content MathML, LaTeX, Nemeth Braille, and SI units.

These identifiers are intended to be suitable for use in as many markup contexts as possible, including presentation MathML role attributes, content MathML element names, LaTeX macro names, and JSON property names/values.

Each row in the table defines a single symbol, with its unique identifier (ID), a short description (Symbol), a mnemonic example (Example), and a Unicode character (Unicode).

The symbols are listed by type, which gives a rough classification of the symbols according to their syntactic form: symbol, operator, unit, function, large operator, special forms, fences, and scripts.

While the universe of math symbols is necessarily unbounded, this list should include the more common Unicode math symbols, the Content MathML 3.0 element names, the Nemeth braille patterns, and the more common SI units.

This first version is missing lots of symbols, and I know I need to check for coverage for content MathML elements, and braille patterns. But let me know if there are vocabularies that deserve special attention. Statistics, chemistry, and multi-variable calculus, for example, could clearly use some work, among others.

fred-wang commented 5 years ago

Just FYI, I think this is very similar to issue #64 (there is also #9 for a more generic native a11y implementation issue).

davidcarlisle commented 5 years ago

It might be worth cross referencing against the OpenMath list especially as all the Content MathML element names are already cross referenced to OM.

https://www.openmath.org/symbols/

davidfarmer commented 5 years ago

I have examined an introductory calculus textbook (Active Calculus, by Matt Boelkins) with the goal of determining what is required to make all the math markup unambiguous and semantic.

A draft of my report is here:

https://docs.google.com/document/d/1cZnff5_fi_ucNyZ1ex2msmJLE55FAZD-QInkLYe8xiE

NSoiffer commented 5 years ago

@samdooley: that's a long list, so thanks for all the effort and getting the ball moving along!!!

My comments:

I hope these comments are helpful to start a discussion on your list.

NSoiffer commented 5 years ago

@davidfarmer -- thanks for the list. It seems there are a few things that break "the Soiffer hypothesis", but not many. If the computer could know what the functions are, then distinguishing between function application and multiplication would not be needed. But doing that requires reading the text and can't really be known by just knowing the subject area. So putting that aside (which I don't really thinks breaks my hypothesis), I see the following as problematic:

Is there more to add to that list?

I don't think it is hard to distinguish between definite and indefinite derivatives, but maybe that's just me as @samdooley makes a distinction in his list (but he also calls out more distinctions).

Note: probably most people in this project can read TeX, but I strongly suspect some people have trouble reading it. For your examples, it would probably be helpful if you included images showing the notation in 2D form.

davidcarlisle commented 5 years ago

I think the lists are a useful starting point for assigning roles, although I'm a bit confused about the TeX-centred description. I don't think we should be specifying a TeX syntax in this group. We should be assigning roles to use on mathml elements. Individual systems or individual users can define tex macros to produce that markup, but as that is just surface syntax that's expanded out by tex or javascript or whatever, I'm not sure it need be standardised.

I'd agree with Neil that integral forms can be distinguished by the presence of limits (that is, the integral operator is wrapped in msubsup or msub) so I'm not sure that more specific roles are needed for integrals.

Also in @davidfarmer's list I'm slightly sceptical that authors will want to use prefix forms for invisible times and function application (content mathml as an author format suffers from this) since the presentation forms are infix, I think a tex infix markup like 3 \invisibletimes x is easier to map to <mn>3</mn><mo>&InvisibleTimes;</mo><mi>x</mi>

If we do use TeX markup for symbols in any descriptions I think we should use the unicode-math markup (as that works in tex) these are all listed in unicode.xml in our git repository, for instance

<mathlatex set="unicode-math">\oint</mathlatex> \oint for ∮

and

<mathlatex set="unicode-math">\mbffrakA</mathlatex> \mbffrak for 𝕬

in particular we shouldn't use commands like \bf (which is not defined by default in LaTeX).

davidfarmer commented 5 years ago
  • I don't think you included f^(4)(x) which could potentially be confused with power. Especially if you wrote `f^(n+1)(x). I suppose knowing that it is functional application would be a good clue that it wasn't power, so maybe this isn't problematic...

There are macros \nthDerivative and \functionPower .

You are correct that f^{(n+1)} could be ambiguous. I'll add it to the writeup. (In Active Calculus, it always means derivative.)

davidfarmer commented 5 years ago

Below is a proposal for how to reorganize Sam's tables.

The underlying problem is similar to what one encounters when designing a database.

"multiplication" can be represented by \cdot, \times, or [space]

\times can mean "multiplication" or "cross product"

Thus, we have a many-to-many relationship.

An additional complication is that Sam wants to encode both the form and the meaning, so that he can convert between different representations of the content (presentation MathML, content MathML, and his editing program).

The conclusion I reach is that two attributes are needed: one that encodes meaning, and one that encodes presentation. Note that some may argue that in many cases it is not necessary to encode presentation (because the content is the representation), but recording the presentation with an ASCII id can be useful.

So we need both the current "ID" column in Sam's table, and also another new column for "Meaning". I propose that specific column name so that is is clear to us what should go there. (Probably we need two tables, one which has the ID column and the other much smaller table has the Meaning column.

In the HTML, the Meaning will be recorded in the 'role' or 'math-role' or some other attribute to be determined later. We don't need to know that in order to develop the table. The ID can go in the HTML as a 'data-id' attribute. In HTML it is always legal to have an attribute beginning with "data-". It is okay if Sam uses the data-id attribute and others do not.

@samdooley : I tried to email you, but it bounced. Has your email address changed?

davidcarlisle commented 4 years ago

Trying to understand @samdooley' s spreadsheet before the call, we seem to have been talking round it for a while with I think two viewpoints leading to a certain amount of disconnect so I tried refactoring it

I started by removing all rows that did not share an entry in the Unicode column (F) as they are uniquely identified by their presentation mathml markup, then further removed any rows if there were not multiple meanings after removal of synonyms such as ngt ; notgt,

I then did a bit of hand cleanup and ended up with the 29 entries in the attached table (html but attached as .txt for this site)

sym2.txt

Note I blanked out all the units rows to a single UNITS row as I think we do need to specify some markup for (any) unit use.

Note I think the data in the original table is needed, just not in the mathml spec, that is, if you are inferring semantics (or content mathml or openmath) from presentation and hit a U+222D then you need to know that's a triple integral. Sam's spreadsheet has that data and any convertor (in either direction) needs that information, which is where it came from:-) but I would argue that it should not be in the MathML you should only put a (math)-role="wibble" on an <mo>&#x222d;</mo> if it is not standing for a triple integral.

dginev commented 2 years ago

I just discovered this issue today and am particularly interested in @samdooley 's list:

I have collected 1749 symbols into a Google sheet as initial starting point for such a list.

It appears that Google doc has disappeared. Is there a new location?

NSoiffer commented 2 years ago

@samdooley: is the list still around so @dginev can look at it. If it isn't, please close this issue.

NSoiffer commented 2 years ago

No action, so closing issue.