pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
66 stars 2 forks source link

"math" structure element has no assigned (or implied!) category #467

Open DuffJohnson opened 2 months ago

DuffJohnson commented 2 months ago

We introduced the "math" structure element type in PDF 2.0 via namespaces (14.8.6.3).

Although is specified to occur as a child of no guidance is provided regarding Category in that context.

Add a new para to clause 14.8.4.8.6 stating that gets a Category of "Grouping, Block or Inline".

Also... can a Formula have >1 elements as children? Asking because the PDF spec, in defining refers to "a formula" (singular).

Please note a closely-related change, already ISO-approved: https://pdf-issues.pdfa.org/32000-2-2020/clause14.html#H14.8.6.3

car222222 commented 2 months ago

There are also a few other details that are undecided concerning the use and content of the "math" element.

For example, what is the content model for the MathML SE type "mtext": can it contain text SE types from a PDF namespace?

DuffJohnson commented 2 months ago

@car222222 I don't believe we ever contemplated use of PDF's standard SE types within .

Recall that occurs (in PDF) within ; I think the idea was that it was up to the mathML namespace to provide all semantics within the context.

car222222 commented 2 months ago

No, but it would still be useful to know how the inventors of this methodology intend PDF mark-up to handle non-math material embedded in math material. This may of course need an extension.

PDF markup can handle text-in-Figure, so why not text-in-Formula? Preferably including text-in-math-in-formula.

car222222 commented 2 months ago

Noting that MathML does provide the "mtext" SE type, but it cannot itself define the content model for it, since its content is explicitly of a textual nature, and so it needs to have (only) text-related structure elements within it: without these, how could "mtext" ever get used? And without the use of "mtext", how can a PDF contain text properly embedded within "math"?

DuffJohnson commented 2 months ago

I believe our intent was that if "non-math material" was embedded within a < Formula > then either...

(a) such content could be effectively accommodated via mathML elements, or (b) correct tagging would result in two (or more) < Formula > elements with with standard SE types "in-between".

I can't speak to the limitations (or not) of < mtext >.

Now retreating to my bunker...

car222222 commented 2 months ago

Retreat seems very wise!

But someone needs to come up with the recommendations: the problem with "put the text in between two "math" is that the tagging will then no longer represent the true structure/semantics of the math.

davidcarlisle commented 2 months ago

In the base mathml spec , <mtext., like all token elements, (mi, mn, mo) just takes text. However they are explicitly allowed to be extended by any document schema that is incorporating mathml, and this flexibility is used in what is effectively the reference use of mathml: HTML5.

In tex you can use arbitrary text structures within math so

$ x = y  = \parbox{3cm}{some text with  \cite{..} bibliographic reference
                                      and \ref{..} ref an earlier section}  = z$

This would end up in html with an HTML div embedded in MathML mtext

<math>
<mi>x</mi> <mo>=</mo> <mi>y</mi> <mo>=</mo>
<mtext>
<div>
some text with <a href="bib...">cite</a>and bibliographic reference
  and <a href="#sec:...">7</a> ref an earlier section
</div>
</mtext>
<mo>=</mo>
<mi>z</mi>
</math>

Note the embedded text needs access to the ID and linking mechanisms of the outer document which isn't really possible to do purely in MathML, the reference here is to the ID of a section in the textual part of the document not an ID set by MathML.

It is currently unclear if it is possible to tag such a formula using either the mathml structure elements or a mathml associated file.

davidcarlisle commented 2 months ago

But someone needs to come up with the recommendations: the problem with "put the text in between two "math" is that the tagging will then no longer represent the true structure/semantics of the math.

a more serious problem with that suggestion is that it's only possible in trivial cases.

A more realistic example is

\documentclass{article}

\usepackage{mathtools}

\begin{document}

\section{On negative values}\label{something}
xxx
\section{foo}
\[
  f(x)=
  \begin{cases*}
    -1  & if $x<0$ see section \ref{something}\\
    1   & if $x \geq0$
  \end{cases*}
\]

\end{document}

image

where the outer math has two nested text blocks, each with a nested inline math. You can not split the text to the outer level as they both need to be nested within the same stretchy { delimiter

DuffJohnson commented 2 months ago

@davidcarlisle While remaining safely in my bunker what I see in your example (in PDF terms) is (or so I thought)...

< H1 > "On negative values" < math > xxx < H1 > "foo" < math > f(x)= ...blah, blah...

If the example is, in fact, a single continuous "chunk" of math (please excuse my technical terminology :-), doesn't mathML provide the necessary structures to represent the "headings" within?

...or am I just not getting it at all?

car222222 commented 2 months ago

Thanks @davidcarlisle for going further into this.

I had not meant to imply that there is any simple solution!

The PDF provisions for embedding MathML SEs will clearly need to be extended.

We do not even have a formal definition of what it means for the content of an mtext to "be text"! The normal way to do this in PDF is "to use a content item", as this is formally defined. But I think that anything like that would anyway need an extension to the specification since simple "text" is not formally defined.

car222222 commented 2 months ago

@DuffJohnson No, the problem is not with the "whole thing" being math.

It is with these two, very typical, bits of math-in-text-in-math:

if $x<0$ see . . .

if $x \geq0$

Since these are both "mtext" embedded in the math display but also containing "inline math" (another Formula > math structure) within the text.

davidcarlisle commented 2 months ago

@DuffJohnson the question is how to markup the sentence

if $x<0$ see Section [link}1{/link]

That is a textual sentence with an inline math $x<0$ and a link to an ID specified by a heading elsewhere in the document.

If that was all there was, you would tag the x<0 with /Math and the Section 1 hypertext link with whatever you tag links with

But now place that sentence within the math block inside a { or a fraction of some other unbreakable construct.

How do you tag it now?

petervwyatt commented 2 months ago

So this discussion seems to have pivoted to question whether or not content that can be represented as a single block of MathML mean that the PDF equivalent is the identical MathML, or does it need to be broken down into multiple constituent PDF "Formula with math" pieces to represent only "a formula" - but this would then be the responsibility of apps.

And if mtext is allowed to be Tagged PDF, then do things get recursive (can that Tagged PDF contain another Formula?)?

car222222 commented 2 months ago

@petervwyatt asked: can that Tagged PDF contain another Formula?

Not sure what you mean by "can" here, but the answer is yes, in the sense that this is one of the major uses of such "text-in-math": i.e., to support such "math-in-text-in-math".

davidcarlisle commented 2 months ago

@petervwyatt

do things get recursive (can that Tagged PDF contain another Formula?)?

yes that can already be seen in the comment above: https://github.com/pdf-association/pdf-issues/issues/467#issuecomment-2341385458

in the tex source \[ starts display math (<math display="block">) within that is an aligned text block & If... (html <div>) and within that $ starts a nested inline math (<math display="inline">)

Note this isn't a weird test case it happens all the time in mathematical documents.

If using an Associated file on the outer Formula we can of course attach some mathml+html that encodes the entire block, the main issue is managing links into or out of the formula in to the rest of the document, but such links are less common.

If using the direct PDF mathml structure elements then the question is whether the pdf structure elements may be nested in that way as that is directly a matter of the pdf specifications, not something that MathML can specify.

DuffJohnson commented 2 months ago

So... there are clearly questions to be answered regarding...

However, these questions are not the exact subject of this Issue... so I'll propose a straw-man for the posed question... and encourage others to open new issues for these other worthy questions.

PROPOSED:

Add a sentence to the end of the 3rd paragraph in 14.8.6.3: "MathML structure elements may be Grouping, Block or Inline."

davidcarlisle commented 2 months ago

@DuffJohnson yes sorry for hijacking your issue

The proposed sentence sees fine to me (after checking what "grouping" meant here (14.8.4.4))

car222222 commented 2 months ago

I agree that putting it into all three categories is a Good Thing.

(Not that I really understand why all SE types cannot be so categorised . . . down with categories!)

car222222 commented 2 months ago

I am not sure what is best way to move forward at this stage regarding the more general questions.

We probably need some less public discussion, of what is needed and what is feasible, before making any suggestions in an issue here.

petervwyatt commented 2 months ago

I have moved the larger discussion on math, mtext, "math-in-text-in-math", etc. to PDF Reuse TWG Issue #19.

This errata is now only for discussion on @DuffJohnson's proposed change (above) to define the categories for math.

ozross commented 1 month ago

This image shows how I would tag the structure and content for David's example. It comes from an actual, valid PDF which is also attached here.

Screenshot 2024-10-02 at 9 34 22 pm

tag-cases.pdf

What is missing in this PDF is the MathML Associated File, which would carry the full semantics. It needs to be built to contain knowledge of the Structure ID (i.e., name) of the referenced target. This can be deduced from a previous LaTeX run, using stored information concerning the structure destination of the hyperlink. Hence this missing piece should be able to be automated reliably, at least in this relatively straight-forward example, after the 2nd or 3rd LaTeX run.

When I have actual coding for this, I'll provide a link to an updated example.

Hopefully this will help in future discussions of this important aspect of tagging complicated mathematical structures, having structured textual sub-parts.