Open DuffJohnson opened 2 months ago
There are also a few other details that are undecided concerning the use and content of the "math" element.
For example, what is the content model for the MathML SE type "mtext": can it contain text SE types from a PDF namespace?
@car222222 I don't believe we ever contemplated use of PDF's standard SE types within
Recall that
No, but it would still be useful to know how the inventors of this methodology intend PDF mark-up to handle non-math material embedded in math material. This may of course need an extension.
PDF markup can handle text-in-Figure, so why not text-in-Formula? Preferably including text-in-math-in-formula.
Noting that MathML does provide the "mtext" SE type, but it cannot itself define the content model for it, since its content is explicitly of a textual nature, and so it needs to have (only) text-related structure elements within it: without these, how could "mtext" ever get used? And without the use of "mtext", how can a PDF contain text properly embedded within "math"?
I believe our intent was that if "non-math material" was embedded within a < Formula > then either...
(a) such content could be effectively accommodated via mathML elements, or (b) correct tagging would result in two (or more) < Formula > elements with with standard SE types "in-between".
I can't speak to the limitations (or not) of < mtext >.
Now retreating to my bunker...
Retreat seems very wise!
But someone needs to come up with the recommendations: the problem with "put the text in between two "math" is that the tagging will then no longer represent the true structure/semantics of the math.
In the base mathml spec , <mtext.
, like all token elements, (mi, mn, mo) just takes text. However they are explicitly allowed to be extended by any document schema that is incorporating mathml, and this flexibility is used in what is effectively the reference use of mathml: HTML5.
In tex you can use arbitrary text structures within math so
$ x = y = \parbox{3cm}{some text with \cite{..} bibliographic reference
and \ref{..} ref an earlier section} = z$
This would end up in html with an HTML div embedded in MathML mtext
<math>
<mi>x</mi> <mo>=</mo> <mi>y</mi> <mo>=</mo>
<mtext>
<div>
some text with <a href="bib...">cite</a>and bibliographic reference
and <a href="#sec:...">7</a> ref an earlier section
</div>
</mtext>
<mo>=</mo>
<mi>z</mi>
</math>
Note the embedded text needs access to the ID and linking mechanisms of the outer document which isn't really possible to do purely in MathML, the reference here is to the ID of a section in the textual part of the document not an ID set by MathML.
It is currently unclear if it is possible to tag such a formula using either the mathml structure elements or a mathml associated file.
But someone needs to come up with the recommendations: the problem with "put the text in between two "math" is that the tagging will then no longer represent the true structure/semantics of the math.
a more serious problem with that suggestion is that it's only possible in trivial cases.
A more realistic example is
\documentclass{article}
\usepackage{mathtools}
\begin{document}
\section{On negative values}\label{something}
xxx
\section{foo}
\[
f(x)=
\begin{cases*}
-1 & if $x<0$ see section \ref{something}\\
1 & if $x \geq0$
\end{cases*}
\]
\end{document}
where the outer math has two nested text blocks, each with a nested inline math. You can not split the text to the outer level as they both need to be nested within the same stretchy {
delimiter
@davidcarlisle While remaining safely in my bunker what I see in your example (in PDF terms) is (or so I thought)...
< H1 > "On negative values" < math > xxx < H1 > "foo" < math > f(x)= ...blah, blah...
If the example is, in fact, a single continuous "chunk" of math (please excuse my technical terminology :-), doesn't mathML provide the necessary structures to represent the "headings" within?
...or am I just not getting it at all?
Thanks @davidcarlisle for going further into this.
I had not meant to imply that there is any simple solution!
The PDF provisions for embedding MathML SEs will clearly need to be extended.
We do not even have a formal definition of what it means for the content of an mtext to "be text"! The normal way to do this in PDF is "to use a content item", as this is formally defined. But I think that anything like that would anyway need an extension to the specification since simple "text" is not formally defined.
@DuffJohnson No, the problem is not with the "whole thing" being math.
It is with these two, very typical, bits of math-in-text-in-math:
if $x<0$ see . . .
if $x \geq0$
Since these are both "mtext" embedded in the math display but also containing "inline math" (another Formula > math structure) within the text.
@DuffJohnson the question is how to markup the sentence
if $x<0$ see Section [link}1{/link]
That is a textual sentence with an inline math $x<0$ and a link to an ID specified by a heading elsewhere in the document.
If that was all there was, you would tag the x<0 with /Math and the Section 1 hypertext link with whatever you tag links with
But now place that sentence within the math block inside a { or a fraction of some other unbreakable construct.
How do you tag it now?
So this discussion seems to have pivoted to question whether or not content that can be represented as a single block of MathML mean that the PDF equivalent is the identical MathML, or does it need to be broken down into multiple constituent PDF "Formula with math" pieces to represent only "a formula" - but this would then be the responsibility of apps.
And if mtext
is allowed to be Tagged PDF, then do things get recursive (can that Tagged PDF contain another Formula?)?
@petervwyatt asked: can that Tagged PDF contain another Formula?
Not sure what you mean by "can" here, but the answer is yes, in the sense that this is one of the major uses of such "text-in-math": i.e., to support such "math-in-text-in-math".
@petervwyatt
do things get recursive (can that Tagged PDF contain another Formula?)?
yes that can already be seen in the comment above: https://github.com/pdf-association/pdf-issues/issues/467#issuecomment-2341385458
in the tex source \[
starts display math (<math display="block">
) within that is an aligned text block & If...
(html <div>
) and within that $
starts a nested inline math (<math display="inline">
)
Note this isn't a weird test case it happens all the time in mathematical documents.
If using an Associated file on the outer Formula we can of course attach some mathml+html that encodes the entire block, the main issue is managing links into or out of the formula in to the rest of the document, but such links are less common.
If using the direct PDF mathml structure elements then the question is whether the pdf structure elements may be nested in that way as that is directly a matter of the pdf specifications, not something that MathML can specify.
So... there are clearly questions to be answered regarding...
However, these questions are not the exact subject of this Issue... so I'll propose a straw-man for the posed question... and encourage others to open new issues for these other worthy questions.
PROPOSED:
Add a sentence to the end of the 3rd paragraph in 14.8.6.3: "MathML structure elements may be Grouping, Block or Inline."
@DuffJohnson yes sorry for hijacking your issue
The proposed sentence sees fine to me (after checking what "grouping" meant here (14.8.4.4))
I agree that putting it into all three categories is a Good Thing.
(Not that I really understand why all SE types cannot be so categorised . . . down with categories!)
I am not sure what is best way to move forward at this stage regarding the more general questions.
We probably need some less public discussion, of what is needed and what is feasible, before making any suggestions in an issue here.
I have moved the larger discussion on math, mtext, "math-in-text-in-math", etc. to PDF Reuse TWG Issue #19.
This errata is now only for discussion on @DuffJohnson's proposed change (above) to define the categories for math.
This image shows how I would tag the structure and content for David's example. It comes from an actual, valid PDF which is also attached here.
What is missing in this PDF is the MathML Associated File, which would carry the full semantics. It needs to be built to contain knowledge of the Structure ID (i.e., name) of the referenced target. This can be deduced from a previous LaTeX run, using stored information concerning the structure destination of the hyperlink. Hence this missing piece should be able to be automated reliably, at least in this relatively straight-forward example, after the 2nd or 3rd LaTeX run.
When I have actual coding for this, I'll provide a link to an updated example.
Hopefully this will help in future discussions of this important aspect of tagging complicated mathematical structures, having structured textual sub-parts.
We introduced the "math" structure element type in PDF 2.0 via namespaces (14.8.6.3).
Although
Add a new para to clause 14.8.4.8.6 stating that
Also... can a Formula have >1
Please note a closely-related change, already ISO-approved: https://pdf-issues.pdfa.org/32000-2-2020/clause14.html#H14.8.6.3