pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
63 stars 2 forks source link

Confusion with ActualText vs. Alt. #60

Closed PaulRayius closed 1 year ago

PaulRayius commented 3 years ago

For Context: 32000-2 - 14.7.2 - Table 355 Actual Text:
(Optional; PDF 1.4) Text that is an exact replacement for the content enclosed by the structure element and its children. This replacement text (which should apply to as small a piece of content as possible) is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes."

Alt Text: (Optional) An alternative description of the structure element and its children in human-readable form, which is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes."

14.9.4… "The ActualText value shall be used as a replacement, not a description, for the content, providing text that is equivalent to what a person would see when viewing the content. The value of ActualText is a character substitution for the content enclosed by the structure element or marked-content sequence. If each of two (or more) consecutive structure or marked-content sequences has an ActualText entry, they shall be treated as if no word break is present between them."

"NOTE 2 The treatment of ActualText as a character replacement is different from the treatment of Alt, which is treated as a whole word or phrase substitution."

Example Scenario: You have five Paths that, together, visually create the word "Hello" in a document. Should each Path be placed in its own Span and given ActualText that corresponds to its letter? Should they all be placed in one Span and given the ActualText of "Hello"? Or, should they all be placed in one Span and given Alt text of "Hello"?

The Issue: The above definitions, and Note 2, could be taken to imply that, in the scenario above, all of those Paths should be placed in individual Span tags, for example, and given the ActualText of their intended letter, as opposed to placing all of them in a single Span and using Alternative text. In addition, because Note 2 states that Alt is intended for "whole words or phrases" and ActualText is indended for "character replacement" it implies that Alt text would be more appropriate than ActualText - except that the definition of Alt text is "an alternative description". But a "description" of the word that was created with Paths wouldn't be what's really needed.

Recommendation: Clarify that ActualText should be for the smallest semantically accurate "piece" of content (obvious word-smithing needed), clarify that "character replacement" doesn't mean "one character," and/or provide other clarification.

car222222 commented 2 years ago

I am not sure that text such as these will be permitted (in the standards):

petervwyatt commented 2 years ago

Discussed in PDF TWG - existing text for the originally reported issue is sufficiently clear in ISO 32000-2:2020 so no text change.

However the TWG desires an ongoing discussion regarding potential processor requirements for resolving confusion over text/content extraction/reuse when marked content sequences and structure elements both exist will continue under a much broader context - T.B.D.

Last bullet of clause 9.10.1 "Extraction of text content" of ISO 32000-2:2020 has a loose processor requirement for text extraction in regards ActualText (using "may" language). Cf. to 14.9.3 Alternate descriptions.

PaulRayius commented 1 year ago

From Matthew Hardy's comment on July 7, 2021, "The proposed text doesn’t clarify any of this that I can see and it is wrong to say that it is intended to replace or do something. That is up to the processor." I disagree with this statement, though, in that - whether intended or not - some remediators, for example, will look to the spec. to answer the question "which should I use here, Alt or ActualText?" So, while I do agree that it's up to the processor for how to handle Alt, ActualText, E, etc., I don't think we can say that it's only up to the processor.

That said, I think some clarification would be valuable. As such, I would go with Duff's contribution (July 30, 2021) and the subsequent minor edits immediately below Duff's post, in this thread (until we get to the detour that we took talking about processor requirements. More on that in a second...
However, where Duff said, "14.9.3, P5, replace: "When applied to structure elements, the alternate description text (see 7.9.2.2, "Text string type") is a complete (or whole) word or phrase substitution for the current element." With "When applied to structure elements, the alternate description text (see 7.9.2.2, "Text string type") is a string of characters representing the current element." Instead of saying, "representing the current element," perhaps that should read "representing (or even "describing") the contents of the current element." The Alt is not describing the "tag," (or the marked content) after all, the Alt is describing the contents (of the structure element, MC, etc.).

Regarding the direction in which this started to go, pertaining to processor requirements, since first being mentioned in this thread, there has been a Processor Requirements LWG formed within the PDF Association and I think it would be appropriate to bring that part of this conversation over there to further work out how processors and AT should handle Alt, ActualText, E, and etc. Perhaps we could work on some clarifying verbiage for 32K, here, and leave the "processor part" to the work of that group? Unless, of course, as noted by Peter on November 11, 2021 (the last post before this one), there is no need for change in the spec. and we really only need to worry about the "processor part."

mrbhardy commented 1 year ago

Let's discuss this at the next UA meeting @PaulRayius. I didn't feel that the text clarifying things actually addressed the issues.

mrbhardy commented 1 year ago

@u-fischer I was going over your sample file one last time and I did notice a slight mistake in the syntax for what you are doing with the file. I doubt it has any real impact, but ActualText isn't supposed to be present on an marked content that is part of a structure element. So when you do:

/Span << /MCID 0 /ActualText (Foo) >> BDC

That isn't actually in compliance with the spec. If you look at ISO 32000-2:2020, 14.9.4:

  • A structure element (see 14.7.2, "Structure hierarchy"), by means of the optional ActualText entry (PDF 1.4) of the structure element dictionary.
  • (PDF 1.5) A marked-content sequence (see 14.6, "Marked content"), through an ActualText entry in a property list attached to the marked-content sequence with a Span tag

The second doesn't mean "tag" in the colloquial sense of "structure element", but is referring to the mechanism in marked content to denote a tagged sequence as a span with properties.

/Figure << /MCID 0 >> BDC  
    /Span <</ActualText (Foo) >> BDC
        ...
    /EMC
/EMC

Again, I highly doubt this is going to change your findings, but just for correctness...

u-fischer commented 1 year ago

@mrbhardy Hm. While I agree that this mean that one can use /ActualText only in a /Span, you mean that it also additionally implies that one can't combine it with a MCID? So it is has to be

/Span << /MCID 0 >> BDC  
    /Span <</ActualText (Foo) >> BDC

instead of

/Span << /MCID 0 /ActualText (Foo) >> BDC

?

I'm not sure if I really see that in the spec ;-). What would be the reason to force such a split?

car222222 commented 1 year ago

@mrbhardy wrote: "I didn't feel that the text clarifying things actually addressed the issues."

Which text is that?

Note that my original rewording suggestions were intended only to clarify the distinctions between the three cases, whilst making the text in the three descriptions more uniform.

There was no intention to introduce anything that could be interpreted as "processor requirements", beyond what is in the wording of the current texts.

PaulRayius commented 1 year ago

PDF/UA TWG has discussed this. No need to change the 32000-2 spec. but some clarifying examples, best practices, etc., would be helpful. Those things, hopefully, will be taken on by the PDF/UA TWG and/or the PDF/UA LWG.