Open u-fischer opened 2 months ago
regarding adding a file for mathml, please take math tagging and security #708
https://github.com/latex3/tagging-project/issues/708 as input for discussion
Thanks, @u-fischer. Some interesting questions. It would be ideal to give each suggestion an identifier to be sure we remain on the same page... I'll take them in order.
Point taken; this restriction has the appearance of treating the Formula content as a graphic... but does that matter? I defer to @mrbhardy.
I like the re-write, but why isn't "should" reasonable in place of the "may"?
Overall this seems reasonable to me... but I'm not a fan of relaxing the "should" for Alt. Any team making software to deal with associated files on structure elements can (and should) be expected to also deal with Alt on the same element.
On your remark.. I think this idea fits neatly into the current ambitions of the PDF/UA Processor LWG for a new processor specification.
I totally agree with your suggested addition in principle; it just needs a little word-smithing, IMHO. You've also found a typo in the original - I'm pretty sure that the word "that" is missing between "formula" and "has" in the NOTE before your addition. :-)
This can be considered following resolution of suggestion 2 (or, resolve this first, and then suggestion 2).
My concern here (and I feel as if I must be missing something) is that these restrictions are appropriate for cases of < Formula > that lack substructure (e.g., a < Formula > with an AF)...?
@DuffJohnson
(I will edit my issue above to add the numbers)
Point taken; this restriction has the appearance of treating the Formula content as a graphic... but does that matter?
Well as it is written now (The standard structure type Formula shall not appear between the BT and ET operators) is make imho not sense at all as structure types do not appear in the content stream. If the sentence is a rewording of the 1.7. sentence about the illustration types (figure, formula) and should actually read as The content of standard structure type Formula shall not appear between the BT and ET operators then it is clearly wrong as the content of Formula is normally text.
I like the re-write, but why isn't "should" reasonable in place of the "may"?
Because breakable text should not require a BBox. The purpose of a BBox is (for example) to allow an html derivation to take a screenshot and move the picture around or to stop reflow. But if you have a mathml formula you no longer have a picture. You wouldn't require a BBox for Ruby or Warichu or a Span, wouldn't you? So why for an inline mathml formula using an unicode math font?
but I'm not a fan of relaxing the "should" for Alt
Again: you are not requiring an Alt on Ruby or Warichu or a Span as their meaning is intrinsic so what is the purpose on a mathml formula? It is not a picture where a blind user would miss the point without an Alt, it is text. We could easily duplicate the content of the mathml AF also into the Alt key and so fulfill the requirement as mathml is clearly an adequate description of a formula but where is the sense? Alt is for the description of non-text content, it shouldn't repeat text content.
... that these restrictions are appropriate for cases of < Formula > that lack substructure (e.g., a < Formula > with an AF)...?
In my view a Formula with a proper mathml AF as described in UA-2 should be equivalent to a Formula with mathml substructure.
These remarks are mainly @u-fischer but they may be of more general,interest.
Some primary comments Concerning Sections 2, 3 and 6:
Re 3 (and 2): I am not sure that inline math is always simply “text” in the sense that this term is often currently used in connection with the content of PDF files.
But equally, it is not clear that math needs a compulsory Alt key since there is often, for complex higher-level math, no obvious "plain text" equivalent to put in there.
Noting that it seems very unlikely that "Alt text" is intended to contain anything that is "essentially code" rather than simple text.
Note also that it is conventional in math typesetting for a single (long) inline math element to be typeset over two
or more lines: in that sense inline math can be just like
“text in a paragraph”.
It is not clear whether the PDF rules and conventions currently allow for this common convention for typesetting inline math.
Thus some extensions may be required here.
In 6: what do you mean here by “should be equivalent to”? This is rather vague.
Maybe it could (or will at some stage) mean that there is a precisely defined method for transforming each form into the other: if such a possibility is currently a fact, or even an achievable goal, then we need to define formally this inverse pair of transformations.
But this will not make math become “text” since a file of mathml code is not, in current PDF terminology, a “text file” or even a “representation of some text”.
@car222222 I used the term "text" as a counterpart to "image". So as something that consists of characters and has an intrinsic meaning as a language and can be translated (and not only described). The whole issue is about not viewing a formula as an image. Neither the TeX meaning of text versus math mode nor text files versus binary files is not meant here.
All true. I was only pointing out that in PDF-speak, "text" has a different meaning, that does not include code-like strings of characters: so you need to be careful.
I was definitely not suggesting anything contrary to your aim of clearly distinguishing math from anything related to pictures or images.
I may be wrong, but I think that the "PDF meaning of text" does not, in practice, include programs or other code.
Thanks @u-fischer.
On Suggestion 1:
Ok, so I agree that "content" is missing from the sentence; that's an erratum in itself, IMO, assuming that we don't simply remove this provision.
As to that... I agree that the restriction does not appear (to me) to make sense today, even if once it did... but I am innocent as to the technical implications of this change. @mrbhardy?
Assuming your point is valid and has no negative implication, it seems as if Suggestion 1 can be stated as: "Delete the second sentence of 14.8.4.8.5, para 2". Is that accurate?
On Suggestion 2:
My sense of your remark is that the value of the BBox is conditional on whether the Formula element is Block or Inline. If it's Block then BBox is reasonable ("should")... but a BBox does NOT make sense if the < Formula ? is Inline.
If that's your meaning then can you propose specification language to this effect?
On Suggestion 3:
I don't think its pragmatic to assume that all users encountering a Formula will prefer the mathML (or LaTeX, for that matter) to an Alt. A formula may be encountered casually, or investigated. A user may not be sufficiently literate to understand the formula, but might appreciate, e.g., "An example of the factored form of a quadratic equation".
Why rob an author who elects to include an Alt of the possibility that users who wish to read the Alt won't get that option? Why not lean into the notion that software can and should be able to represent all the alternatives that an author might legitimately wish to employ?
On Suggestion 6:
In my view a Formula with a proper mathml AF as described in UA-2 should be equivalent to a Formula with mathml substructure.
Can you help me by spelling out what's at-issue here? Why is it bad to require an Inline < Formula > element to a Width attribute? Or a Block
Or is your problem with "This value shall be the sole source of information about the element’s extent in the block-progression direction." ? If so, perhaps we can rewrite it to indicate other sources of information?
I am agreeing, though, that 14.8.5.4.6 should be entirely rewritten, as it's basically a hang-over from the "Illustration" group of SEs in PDF 1.7. If others agree this should be a stand-along erratum.
Assuming your point is valid and has no negative implication, it seems as if Suggestion 1 can be stated as: "Delete the second sentence of 14.8.4.8.5, para 2". Is that accurate?
Formula is 14.8.4.8.6 not 5 (at least in my version), and the sentence is in a paragraph of its own, so delete the second paragraph of 14.8.4.8.6.
My sense of your remark is that the value of the BBox is conditional on whether the Formula element is Block or Inline. If it's Block then BBox is reasonable
Why? You are again viewing the Formula as image. Would you also claim that a BBox is reasonable on other block elements, e.g. a List or a Blockquote or a Table? I'm not saying that there aren't cases where it could be reasonable to add a BBox, but why should it needed by default? Reflow can handle a table without a BBox, so why shouldn't it be able to handle a matrix?
I don't think its pragmatic to assume that all users encountering a Formula will prefer the mathML (or LaTeX, for that matter) to an Alt. A formula may be encountered casually, or investigated. A user may not be sufficiently literate to understand the formula, but might appreciate, e.g., "An example of the factored form of a quadratic equation".
I didn't ask to forbid to add an /Alt, I only do not think that it should be obligatory. I mean your argument is valid for a table too: a table may be encountered casually, or investigated. A user may not be sufficiently literate to understand all the table data but might appreciate, e.g., "This table shows the failures rates of various experiments". but that doesn't mean that a table or a table cell or some other structure should always have an Alt key.
Why is it bad to require an Inline < Formula > element to a Width attribute? Or a Block to have a Height?
Because you are not requiring that for other text like an inline Span or a Blockquote or a Table cell. We really do not want to measure in simple sentences like if 𝑥, 𝑦, and 𝑧 larger then 0 then 𝑓(𝑥,𝑦,𝑧) will have a maximum all the math formula.
Suggestion 1
Formula is 14.8.4.8.6 not 5 (at least in my version), and the sentence is in a paragraph of its own, so delete the second paragraph of 14.8.4.8.6.
Agreed! My mistake.
Suggestion 2
My sense of your remark is that the value of the BBox is conditional on whether the Formula element is Block or Inline. If it's Block then BBox is reasonable
Why? You are again viewing the Formula as image. Would you also claim that a BBox is reasonable on other block elements, e.g. a List or a Blockquote or a Table? I'm not saying that there aren't cases where it could be reasonable to add a BBox, but why should it needed by default? Reflow can handle a table without a BBox, so why shouldn't it be able to handle a matrix?
WTPDF uses "BBox on a table" as one example of the "semantic" use of BBox on that SE (see Table B.2.)
The "semantically significant" use of BBox for Formula is given as: "A formula that lies on a single page and occupies a single rectangle." Is this unreasonable? These things don't reflow in the same ways as e.g. lists (which have their own reflow problems!). If it's not unreasonable, and if this is a major use case (which is fair)... then maybe we simply need to be clearer about when the "should" is applicable, and when it should be "may" instead....?
Suggestion 3
I don't think its pragmatic to assume that all users encountering a Formula will prefer the mathML (or LaTeX, for that matter) to an Alt. A formula may be encountered casually, or investigated. A user may not be sufficiently literate to understand the formula, but might appreciate, e.g., "An example of the factored form of a quadratic equation".
I didn't ask to forbid to add an /Alt, I only do not think that it should be obligatory. I mean your argument is valid for a table too: _a table may be encountered casually, or investigated.
"Should" is not "obligatory".
If we relax "should" we will encourage a situation in which a predictably-substantial number of users will get a worthless (to them) result when they encounter a < Formula > instead of an alt that an author may have otherwise provided.
A table provides many more clues as to its content, not least from its structure, which doesn't require SMEs to understand. Tables are frequently captioned to accommodate the use case you mention.
Suggestion 6:
As stated, I think we should totally rewrite this clause, and I would fully agree to pulling these SEs apart when we do so. I'm ok (I think) with removing those requirements from < Formula > elements whose content is not an image.
Care to take a shot at rewriting that clause as a new Erratum?? :-D
Proposed Solution:
Proposal 1: Strike shall sentence from ISO 32000-2:2020.
Proposal 2: Keep a should statement for BBox.
Proposal 3: No action needed, UA-2 and WTPDF for Accessibility don't mandate Alt or ActualText for mathematical formulae.
Proposals 4 & 5: Do nothing
Proposal 6: Remove clause 14.8.5.4.6 and fix attributes in Table 379 (check if any other tables need fixing - Matthew). Needs review by Forms TWG...
PDF Reuse TWG agrees.
Applied fixes for Proposals 1 and 2. Not doing anything for Proposals 3, 4 and 5
Proposal 6 - waiting for Reuse TWG confirmation
Some historical background: Math formulas are viewed as images ...
In the (free) PDF 1.7 reference from november 2006 both the
Formula
and theFigure
tags are classified asIllustration Elements
. They are viewed as graphics and various requirements are based on this assumption (starting from page page 911):... but MathML should now be preferred
Historically it is quite understandable that math expressions are treated as illustrations elements as also in HTML they have often been presented as images because of the difficulty of presenting equations and special math symbols.
However, MathML is emerging as the preferred presentation of accessible math on the Web and elsewhere and viewing math as images should be avoided:
PDF 2.0 acknowledged this change and added the MathML namespace and associated Files. But in various places the spec still contains requirements that view a math formula always as an image and basically "hide" MathML associated files and MathML tags and so contradict the idea that for math formulas MathML can and should be the preferred method.
Suggested Spec changes
Suggestion 1
14.8.4.8.6 Formula structure type
I never understood that sentence but looking at the first citation above from the historical PDF 1.7 I guess it still assumes that math is some graphical image. This is plainly wrong, most math is nowadays set with fonts and unicode symbols. Imho this sentence should be removed completly.
Suggestion 2
Suggested change: A Formula element may have logical substructure, including other Formula elements. It may have a BBox attribute (see 14.8.5, "Standard structure attributes") and can then for repurposing purposes be treated as visually static, without examining its internal contents.
Suggestion 3
Suggested change: For repurposing and accessibility purposes, a Formula element that doesn't have a logical substructure describing the semantic meaning should have either an appropriate Associated File, an Alt entry or an ActualText entry in its structure element dictionary (see 14.13.6 Associated files linked to structure elements, 14.9.3, "Alternate descriptions" and 14.9.4, "Replacement text").
Remark: I refrain from adding a processing requirement, but in my opinion a mathml associated file should be prefered over an Alt entry.
Suggestion 4
Suggested change: NOTE Alt is a description of the content enclosed by the Formula element, whereas ActualText gives the exact text equivalent of a formula has the appearance of text. An Associated File can associate content in other formats like for example MathML with a Formula.
Suggestion 5
Table 374 — Standard structure type Formula
Change should to may.
Suggestion 6
14.8.5.4.6 Figure, Form and Formula attributes
Remove
Formula
from the title and the text. Or restrict all the "shalls" to "Formula
s with mainly graphical content".