pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
63 stars 2 forks source link

Confusion with ActualText vs. Alt. #60

Closed PaulRayius closed 1 year ago

PaulRayius commented 3 years ago

For Context: 32000-2 - 14.7.2 - Table 355 Actual Text:
(Optional; PDF 1.4) Text that is an exact replacement for the content enclosed by the structure element and its children. This replacement text (which should apply to as small a piece of content as possible) is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes."

Alt Text: (Optional) An alternative description of the structure element and its children in human-readable form, which is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes."

14.9.4… "The ActualText value shall be used as a replacement, not a description, for the content, providing text that is equivalent to what a person would see when viewing the content. The value of ActualText is a character substitution for the content enclosed by the structure element or marked-content sequence. If each of two (or more) consecutive structure or marked-content sequences has an ActualText entry, they shall be treated as if no word break is present between them."

"NOTE 2 The treatment of ActualText as a character replacement is different from the treatment of Alt, which is treated as a whole word or phrase substitution."

Example Scenario: You have five Paths that, together, visually create the word "Hello" in a document. Should each Path be placed in its own Span and given ActualText that corresponds to its letter? Should they all be placed in one Span and given the ActualText of "Hello"? Or, should they all be placed in one Span and given Alt text of "Hello"?

The Issue: The above definitions, and Note 2, could be taken to imply that, in the scenario above, all of those Paths should be placed in individual Span tags, for example, and given the ActualText of their intended letter, as opposed to placing all of them in a single Span and using Alternative text. In addition, because Note 2 states that Alt is intended for "whole words or phrases" and ActualText is indended for "character replacement" it implies that Alt text would be more appropriate than ActualText - except that the definition of Alt text is "an alternative description". But a "description" of the word that was created with Paths wouldn't be what's really needed.

Recommendation: Clarify that ActualText should be for the smallest semantically accurate "piece" of content (obvious word-smithing needed), clarify that "character replacement" doesn't mean "one character," and/or provide other clarification.

DuffJohnson commented 3 years ago

I think this point can be addressed by simply correcting NOTE 2 to be consistent with the definition of Alt given in Table 355. If so, NOTE 2 would read:

"The treatment of ActualText as a character replacement is different from the treatment of Alt, which is treated as descriptive text."

car222222 commented 3 years ago

Duff's suggestion looks good to me.

mrbhardy commented 3 years ago

Reading through this very carefully, I'm not sure we should make this change. While I acknowledge that the wording isn't perfect, it is consistent across ISO 32000-1 and 32000-2. The substitution isn't saying that an action has to be performed, but how a processor might choose to use the information.

For the ActualText, it is a character replacement, meaning that for all the characters in a given ActualText, those characters would be expected to be present to a sighted user consuming the document, even when not represented using glyphs. Alt text is a descriptive substitution, intended to describe the content. This can still be at a single object level (e.g. "five sided star") or at a higher level (e.g. "five sided star representing the ...").

I'm fine with tweaking this if that isn't clear, but that is the intent. The change proposed by Duff wouldn't provide that clarity I believe. @DuffJohnson @PaulRayius

DuffJohnson commented 3 years ago

I'm not sure how the text being consistent across both editions is an argument that the NOTE is as clear as it could or should be. The proposed change to the note appears (to me) to better-align the provision with the Alt property's definition in Table 355 so I don't see the problem per se.

However, regarding this...

Example Scenario: You have five Paths that, together, visually create the word "Hello" in a document. Should each Path be placed in its own Span and given ActualText that corresponds to its letter? Should they all be placed in one Span and given the ActualText of "Hello"?

Either is fine; there's no real distinction from a specification point of view.

Or, should they all be placed in one Span and given Alt text of "Hello"?

This would NEVER be correct...

The Issue: The above definitions, and Note 2, could be taken to imply that, in the scenario above, all of those Paths should be placed in individual Span tags, for example, and given the ActualText of their intended letter, as opposed to placing all of them in a single Span and using Alternative text.

I do not understand why you would ever think of Alt for the example given.

In addition, because Note 2 states that Alt is intended for "whole words or phrases" and ActualText is indended for "character replacement" it implies that Alt text would be more appropriate than ActualText - except that the definition of Alt text is "an alternative description". But a "description" of the word that was created with Paths wouldn't be what's really needed.

IMO, the reader is expected to work with the definition and not be distracted by less-than-perfect wording in a Note.... but that said I don't see the harm in the proposed tweak to the Note, above.

mrbhardy commented 3 years ago

Maybe I wasn't quite clear. I was suggesting that we should be consistent between 14.9.3 and 14.9.4 and I'm suggesting the change would be inconsistent. A second issue is that I don't think the new text clears up the issue. I think the confusion is around substitution and what that means. Both Alt and ActualText are "substitutions", but the intent of that substitution is different (and always optional).

DuffJohnson commented 3 years ago

I now get your point and have to concur; thanks. Need to look at this some more.

DuffJohnson commented 3 years ago

So... to resolve along the lines of addressing the confusion around substitution... perhaps change the NOTE to:

"...which is treated as a whole word or phrase substitution for descriptive purposes (see Table 355)." ?

car222222 commented 3 years ago

Here is some further textual analysis related to this possible confusion, with some suggestions for improvement of the language used.

I understand the reluctance to change any of the existing text, so maybe an extra informative and clarifying paragraph could be added that explains in one place (using straightforward and consistent language) the correct interpretation of all the relevant phrases and sentences (see below for details of which these are).

-- As Matthew explained, the problems stem from the usage in these clauses of the word “substitution”, which occurs only in the following three sentences (plus NOTE 2 itself):

A: “the alternate description text . . . is a complete (or whole) word or phrase substitution for the current element.”

B: “The value of ActualText is a character substitution for the content enclosed by the structure element . . .”

C: “The E value (a text string) is a word or phrase substitution for the tagged text . . .”

In all these three cases it would be much better (correct, even!) English to not use the word “substitution” but instead to write: “to be substituted”
or maybe just
“substitute”.

Notes/Reasoning: The word “substitution” is normally understood to refer to the act of substituting/replacing (a player in a soccer team, for example) rather than to the material (the new player) that gets used as the substitute/replacement.

Additionally -- There are other problems with the above sentences:

  1. the phrase “a character substitution” is very misleading as it suggests just a single character!
    This would be better as follows: “The value of ActualText is a string of characters to be substituted . . .” or even better (as in the E case): “The value (a text string) of ActualText is a string of characters to be substituted . . .”
  2. in the sentence concerning Alt, the phrase “complete (or whole) word or phrase” is again misleading (and why does it need the alternate for “complete”?) since Alt can contain whole paragraphs of (natural language) text, so long as it is presented as a pure ‘text string’ with no formatting, mark-up, tags etc.

-- Following this route we would get the following more complete and consistent version of NOTE 2 (I prefer the use of “replacement” to “substitute” here):

”The treatment of ActualText, in which the substitute (replacement) text is simply a string of characters, is different from the treatment of Alt, in which the substitute (replacement) text consists of whole words, phrases or more, including multiple sentences and paragraphs, that form a complete description.”

BUT, as noted above, I am not suggesting this as a direct replacement for NOTE 2.

car222222 commented 3 years ago

I also noted another point of dubious (incorrect??) grammar in these texts.

This sentence is from 14.9.4, just after NOTE 1: "The ActualText value shall be used as a replacement, not a description, for . . . "

This could be ‘more correctly’ written thus: "The ActualText value shall be used as a replacement for, not as a description of, . . . "

DuffJohnson commented 3 years ago

Great suggestions, Chris, from my PoV. @mrbhardy ?

DuffJohnson commented 3 years ago

Summarizing the proposed changes... (I think)...

14.9.3, P5, replace:

"When applied to structure elements, the alternate description text (see 7.9.2.2, "Text string type") is a complete (or whole) word or phrase substitution for the current element."

With

"When applied to structure elements, the alternate description text (see 7.9.2.2, "Text string type") is a string of characters to be substituted for the current element."

14.9.4, P2, replace:

"The ActualText value shall be used as a replacement, not a description, for the content, providing text that is equivalent to what a person would see when viewing the content. The value of ActualText is a character substitution for the content enclosed by the structure element or marked-content sequence."

With

"The ActualText value shall be used as a replacement for, not as a description of, the content, providing text that is equivalent to what a person would see when viewing the content. The value of ActualText is a string of characters to be substituted for the content enclosed by the structure element or marked-content sequence."

14.9.5, P2, replace

"The E value (a text string) is a word or phrase substitution for the tagged text and therefore shall be treated as if a word break separates it from any surrounding text."

with:

"The E value (a text string) is a string of characters to be substituted for the tagged text and therefore shall be treated as if a word break separates it from any surrounding text."

14.9.4, NOTE 2, replace

"The treatment of ActualText as a character replacement is different from the treatment of Alt, which is treated as a whole word or phrase substitution."

with

"The treatment of ActualText, in which the substitute (replacement) text is simply a string of characters, is different from the treatment of Alt, in which the substitute (replacement) text consists of whole words, phrases or more, including multiple sentences and paragraphs, that form a complete description.”

car222222 commented 3 years ago

That last NOTE would better end as: ". . . description of the item."

mrbhardy commented 3 years ago

I have a couple of concerns with the proposed text, but most of it centers around the phrasing "to be substituted". This is concerning, because who is doing this substituting and under what conditions? We do not have processor instructions around this and nothing must be done with the ActualText or Alt text entries from a 32000-2 perspective.

There are two things we ARE trying to say here. The first is that ActualText represents a text equivalent for content and that Alt provides a description of the content to which it applies. The second is that when consuming elements where ActualText or Alt text is provided, a processor may choose to substitute that element's content with the text provided in the ActualText or Alt entries.

DuffJohnson commented 3 years ago

To my eyes the proposed text reflects deference to the author's intent rather that processor instructions per se.

Perhaps this would be clearer if the new text was "intended to be substituted"...?

DuffJohnson commented 3 years ago

Summarizing the proposed changes once again. Changes to the previous summary indicated via italics:

14.9.3, P5, replace:

"When applied to structure elements, the alternate description text (see 7.9.2.2, "Text string type") is a complete (or whole) word or phrase substitution for the current element."

With

"When applied to structure elements, the alternate description text (see 7.9.2.2, "Text string type") is a string of characters intended to be substituted for the current element."

14.9.4, P2, replace:

"The ActualText value shall be used as a replacement, not a description, for the content, providing text that is equivalent to what a person would see when viewing the content. The value of ActualText is a character substitution for the content enclosed by the structure element or marked-content sequence."

With

"The ActualText value shall be used as a replacement for, not as a description of, the content, providing text that is equivalent to what a person would see when viewing the content. The value of ActualText is a string of characters intended to be substituted for the content enclosed by the structure element or marked-content sequence."

14.9.5, P2, replace

"The E value (a text string) is a word or phrase substitution for the tagged text and therefore shall be treated as if a word break separates it from any surrounding text."

with:

"The E value (a text string) is a string of characters intended to be substituted for the tagged text and therefore shall be treated as if a word break separates it from any surrounding text."

14.9.4, NOTE 2, replace

"The treatment of ActualText as a character replacement is different from the treatment of Alt, which is treated as a whole word or phrase substitution."

with

"The treatment of ActualText, in which the substitute (replacement) text is simply a string of characters, is different from the treatment of Alt, in which the intended substitute (replacement) text consists of whole words, phrases or more, including multiple sentences and paragraphs, that form a complete description of the item.”

mrbhardy commented 3 years ago

I just don’t think we should change this text as described. This has nothing to do with author intent. It is the specification’s intent to provide two different (three if you count E) methods to provide alternative information for content.

One mechanism is a means of providing a textual equivalent to content that would be visibly understood to be text by a sighted user or consumed as text (ActualText).

One provides a description of a unit of content that is intended to be complete in its own right (Alt).

One provides an expansion for an abbreviation if a user requires it (E).

None of the above describe how these should be presented or used and there is no control of that by the author. It is just whether they are provided and then what action is being taken by the processor (e.g. text extraction might use ActualText, whereas a screen reader might use Alt). They could both be present on the same node.

The proposed text doesn’t clarify any of this that I can see and it is wrong to say that it is intended to replace or do something. That is up to the processor.

car222222 commented 3 years ago

@mrbhardy But the original need for some change here was not in any way related to "author intent".

So I hope you agree that some changes in wording are nevertheless needed to clarify the similarities and distinctions between these three.

car222222 commented 3 years ago

@mrbhardy originally wrote: "I have a couple of concerns with the proposed text, but most of it centers around the phrasing "to be substituted".

So would most of these concerns be alleviated by changing "to be substituted" to something less prescriptive, such as "as a (possible) substitute" or (better grammar) "that can/may act as a substitute".

car222222 commented 3 years ago

Another approach would be to completely remove the word "substitute" throughout. (I never like it much, anyway.)

DuffJohnson commented 3 years ago

Here's another oar in the water to keep the boat moving, and (notionally) incorporates Chris's approach while removing (as appropriate) processor instructions as per Matthew.

14.9.3, P5, replace:

"When applied to structure elements, the alternate description text (see 7.9.2.2, "Text string type") is a complete (or whole) word or phrase substitution for the current element."

With

"When applied to structure elements, the alternate description text (see 7.9.2.2, "Text string type") is a string of characters representing the current element."

14.9.4, P2, replace:

"The ActualText value shall be used as a replacement, not a description, for the content, providing text that is equivalent to what a person would see when viewing the content. The value of ActualText is a character substitution for the content enclosed by the structure element or marked-content sequence."

With

"The ActualText value is a representation of the content enclosed by the structure element or marked-content sequence, and shall be used as a replacement, not a description, for the content, providing text that is equivalent to what a person would see when viewing the content."

14.9.5, P2, replace

"The E value (a text string) is a word or phrase substitution for the tagged text and therefore shall be treated as if a word break separates it from any surrounding text."

with:

"The E value (a text string) is a string of characters representing the tagged text and therefore shall be treated as if a word break separates it from any surrounding text."

14.9.4, NOTE 2, replace

"The treatment of ActualText as a character replacement is different from the treatment of Alt, which is treated as a whole word or phrase substitution."

with

"The treatment of ActualText, in which the representation is simply a string of characters, is different from the treatment of Alt, in which the text consists of whole words, phrases or more, including multiple sentences and paragraphs, that form a complete description of the item.”

car222222 commented 3 years ago

A small improvement to the grammar:

“. . . and shall be used as a replacement for, not a description of, the . . l”

Also, maybe we want to make the use here non-mandatory:

“. . . shall be used as . . . “ => “. . . that is a . . .” or “. . . that can be used as . . .”

car222222 commented 3 years ago

A more substantial point: should the Alt entry also cover Marked Content, as does the ActualText entry? Thus:

"When applied to structure elements or marked content, the alternate description text (see 7.9.2.2, "Text string type") is a string of characters representing the enclosed content."

PaulRayius commented 3 years ago

A small improvement to the grammar:

“. . . and shall be used as a replacement for, not a description of, the . . l” Paul: Yes, I agree with the above.

Also, maybe we want to make the use here non-mandatory:

“. . . shall be used as . . . “ => “. . . that is a . . .” or “. . . that can be used as . . .” Paul: I think we should stick with the "shall." "Can" be used - I don't think we're allowed to use "can" in ISO-ese, are we? (Of course, I realize, we can replace "can" with "may." But, still, I think this should be more "mandatory" in nature.)

u-fischer commented 2 years ago

From the point of view of a non-native english speaker: It would help a lot if there weren't only an exact english wording, but also some concrete examples in which actions the keys are typically used or of use. E.g. which of the keys are normally considered for copy&paste, reading by a screenreader, export to html?

faceless2 commented 2 years ago

I'm all for actual examples too, although I'm not sure they should include describing how a screenreader is to handle these attributes as that's a process issue. Forgive my slightly naff effort, but perhaps something like

image ... which I would mark up as

<P>
 Very
 <Figure ActualText="Good" Alt="the word Good, with the o's drawn as a smiley face"/>
</P>

re. other points:

There are already issues with ActualText being allowed on the tag AND the Marked Content sequence (what if it's on both and they disagree?). I think in general we shouldn't be putting attributes here unless they're useful (better: required) in an untagged context. They're harder to edit and lead to ambiguity if they're allowed in both places.

car222222 commented 2 years ago

I agree with @faceless2 : I'd go with alt "describing the current element" rather than "representing the current element".

This is an important distinction.

car222222 commented 2 years ago

@faceless2: "I think a definite no to applying "alt" to MarkedContent".

I would not disagree with this sentiment, but "alt" is allowed there, is it not? Should this be changed?

faceless2 commented 2 years ago

I would not disagree with this sentiment, but "alt" is allowed there, is it not? Should this be changed?

Ugh. Yes it is allowed, and explicitly listed in 32K. So I suppose, for consistency, your earlier suggested change on 31 July has to stand.

Whether we continue to recommend its use in PDF is another matter, I'm for "no" myself for the reasons stated. Technically you could set any properties on a marked sequence, whether they're useful depends on whether anything consumes them. I did some screenreader testing a while back and tested ActualText on a marked-sequence is definitely honoured, but I didn't check "Alt" unfortunately.

romantoda commented 2 years ago

also Lang is used/allowed in the property list of marked content and screenreaders are reading it. and of course as Mike mentioned any other private data is allowed. And it is the only way to add "properties/attributes" on untagged content items (in 1.7 we didn't have the Artifact structure element for example) I think in 2.0 we should not close the door, for PDF/UA I can see the reasoning.

Roman

On Thu, Sep 9, 2021 at 1:13 PM Mike Bremford @.***> wrote:

I would not disagree with this sentiment, but "alt" is allowed there, is it not? Should this be changed?

Ugh. Yes it is allowed, and explicitly listed in 32K. So I suppose, for consistency, your earlier suggested change on 31 July has to stand.

Whether we continue to recommend its use in PDF is another matter, I'm for "no" myself for the reasons stated. Technically you could set any properties on a marked sequence, whether they're useful depends on whether anything consumes them. I did some screenreader testing a while back and tested ActualText on a marked-sequence is definitely honoured, but I didn't check "Alt" unfortunately.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pdf-association/pdf-issues/issues/60#issuecomment-915992200, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYPQJPGT5ZSJGZ42S53LV3UBCJGBANCNFSM4ZVW4IRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

car222222 commented 2 years ago

Please note that Paul’s original concerns, and my efforts at suggesting suitable language, are both confined to clarification of the text, and are not related to changing anything in the intent of the language.

My extra point is that a good way to clarify the language is to make it consistent across the texts describing each of these three keys.

car222222 commented 2 years ago

Today's session discussed nothing about the many needed clarifications needed in the detailed language.

faceless2 commented 2 years ago

A bit of a peripheral observation based on the call today, relating to Frank's observation that the LaTeX logo visually includes a capital A , but logically a lower-case "a", and Matthew noted that if you were exporting this to (eg) InDesign, that you might want to preserve the visual characteristics rather than the logical ones, but that might not be the case for exporting for other reasons, eg accessibility.

There is an extremely long and well-informed thread on the same topic from the CSS working group at https://github.com/w3c/csswg-drafts/issues/3775. I'm hesitant to even attempt to sum it up, but what I took away from it was that if they had to choose between the logical and visual value, the screen-readers generally want the visual value; they want the same information as a sighted user (ideally they'd get both, but the a18y APIs typically don't expose this)

It's much less of an issue in PDF than in HTML+CSS; in HTML the logical value is stored, so you might have <p style="text-transform:small-caps">LaTeX</a>, while in PDF we work with glyphs, so it's stored "pre-transformed" as LATEX in the file. But If the ActualText was set to "LaTeX" we'd theoretically face the same decision as in HTML.

Again, we work with glyphs in PDF, so ActualText will typically be applied to images or glyph-sequences where reconstructing the Unicode value isn't possible - it's not like we have a choice in those cases. But specifically when two equally valid versions of the content could be exported, forming early conclusions about which is the correct one for accessibility is a bit risky. I don't think any action is required to the text to clarify this, I just wanted to make everyone aware of the issue.

u-fischer commented 2 years ago

I did some screenreader testing a while back and tested ActualText on a marked-sequence is definitely honoured,

I just made some tests with this, and it looks as if it is actually required to add the /ActualText on the marked content. In a structure it is ignored. And even on marked content one can't be sure that is it used: I added here Ä in various places.

actualtext.pdf

Result of copy&paste in adobe is

some test Ä more text
some test B more text
some test more text
some test more text

So only in one case the actualtext is honored.

mrbhardy commented 2 years ago

Did you try it on a Span tag or Figure?

On Sat, Sep 18, 2021 at 4:20 PM Ulrike Fischer @.***> wrote:

I did some screenreader testing a while back and tested ActualText on a marked-sequence is definitely honoured,

I just made some tests with this, and it looks as if it is actually required to add the /ActualText on the marked content. In a structure it is ignored. And even on marked content one can't be sure that is it used: I added here Ä in various places.

actualtext.pdf https://github.com/pdf-association/pdf-issues/files/7190763/actualtext.pdf

Result of copy&paste in adobe is

some test Ä more text

some test B more text

some test more text

some test more text

So only in one case the actualtext is honored.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pdf-association/pdf-issues/issues/60#issuecomment-922387536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABT6F2XFPN6LEFXX5S234KTUCUNDTANCNFSM4ZVW4IRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

u-fischer commented 2 years ago

Did you try it on a Span tag or Figure?

The pdf I attached used Figure, but at first I also tried Span and it wasn't better.

Remark also that the behaviour depends on the type of the graphic: the first two are pdf's (the B is from example-image-B.pdf), and the MC-actualtext works, the two other a png and there actualtext is ignored.

mrbhardy commented 2 years ago

@u-fischer I only realized on re-reading that you are using copy+paste to test, but there's no expectation that ActualText in the structure tree impact copy+paste. As we've described before, ActualText is metadata for a textual equivalent to content (either in the page content stream with an MCS or referenced from the structure tree).

It isn't surprising that a plain copy+paste uses the MCS-based replacement, but ignores structure, because copy+paste has no reason to look at the structure tree. I might even expect different between plain copy+paste vs copy+paste with formatting (though I haven't tested).

If you use a screen reader with AT to consume this, I'd expect both to behave the same. It is up to a processor to determine when to use this information and to expose it. PDF/UA offers guidance when providing the information to an AT system.

petervwyatt commented 2 years ago

This was previously discussed in the PDF-UA TWG - Matthew will be a record of those discussions in the next PDF TWG.

u-fischer commented 2 years ago

there's no expectation that ActualText in the structure tree impact copy+paste

@mrbhardy Well personally I didn't expect it. But then why did @faceless2 above added /ActualText to the figure? Does it have any purpose?

It is up to a processor to determine when to use this information and to expose it

Well one certainly can't force a processor to use an information at all. But if the processor uses it there should be a certain agreement how it is used.

mrbhardy commented 2 years ago

@u-fischer the reason @faceless2 added it to the figure is because users of AT have the document read to them from the structure tree (as in, reading is driven by walking the tree in a depth-first pre-order traversal). For a speech engine, the ActualText is used in both cases.

car222222 commented 2 years ago

@mrbhardy Thanks for this information about 'how AT works (in this case)'.

Are there official sources for such information about 'how AT works'? Are these generic requirements, or just recommendations, concerning how AT shall/should process a document? Or just empirical facts, or 'widely believed statements' about AT.

Is the term AT itself clearly defined somewhere?

Is this a requirement on all AT processors? Where does it come from? : "users of AT have the document read to them from the structure tree" and what about this very precise directive? : "reading is driven by walking the tree in a depth-first pre-order traversal".

Is there somewhere a list of such facts(?) or expectations of how AT processors should (or do) behave?

u-fischer commented 2 years ago

@mrbhardy OK I made another test, and let Adobe Reader read the text. I also tried the html export with ngpdf.com and adobe pro. The pdf I used is attached and here the results I get.

Image is a PDF with own text inside

Struct MC reading copy & paste html (ngpdf) html (adobe)
/ActualText on x OK OK bad (only (wrong) image) bad (only image)
x bad (text in pdf is read too) bad (actualtext ignored) OK bad (only image)
x x bad (actualtext read twice) OK OK bad (only image)

Image is a PNG

Struct MC reading copy & paste html (ngpdf) html (adobe)
/ActualText on x OK bad (actualtext ignored) bad (image) bad (nothing)
x OK bad (actualtext ignored) OK bad (nothing)
x x bad (actualtext read twice) bad (actualtext ignored) OK bad (nothing)

exa-0004-actualtext-on-graphics-test-reading.pdf

So please what should a pdf producer or a user do here? Which of the bad results are expected and which are implementation errors which will be resolved at some time?

DuffJohnson commented 2 years ago

So... this is another wrinkle. :-(

Adobe's "Read Aloud" feature - if that's what you are referring to - isn't a screen-reader, and isn't expected to perform as a screen-reader would.

To test the AT experience it's necessary to use a AT consumer, such as a screen-reader, that leverages the respective accessibility API - a consumer such as NVDA, JAWS, WindowEyes.

u-fischer commented 2 years ago

@DuffJohnson I'm not very good with NVDA but made a short test, and as far as I can see it has the same reading behaviour as the adobe read aloud: I got duplicated "hello world" and the "don't read this" is to hear too. And isn't that the expected output? I thought they would use the same API.

lrosenthol commented 2 years ago

@car222222 @u-fischer @DuffJohnson

There are requirements for an AT device in PDF/UA, but nothing in 32K proper. Any integration between a standard PDF processor and an AT device is "implementation dependent" and "out of scope".

If someone wished to develop a new technical specification for such operations - I am sure that ISO TC 171/SC 2/WG 9 would welcome their contributions.

DuffJohnson commented 2 years ago

@u-fischer The duplication in your 3rd case is expected - if both the MC and structure element include ActualText then both iterations will (likely) be represented through any given accessibility API. In the real world we don't expect this case.

I don't know the details of API usage in varying cases. @bdoubrov would be able to shed light on what ngpdf is doing here.

car222222 commented 2 years ago

I like this suggestion from @lrosenthol for such a technical specification document. Even if it does not contain much in the way of mandates or recommendations, it would provide a very useful language and background for reasoning about this area and analysing what is (and could be) the actions and provisions of processors, together with their APIs, etc.

PaulRayius commented 2 years ago

I like this suggestion, too. If this technical specification was to happen then maybe, over time, there would be less frustration from end users about how their AT handles PDFs. (Assuming that it actually was implemented.)

Paul Rayius Vice-President of Training CommonLook

From: Chris Rowley @.> Sent: Friday, October 1, 2021 9:23 PM To: pdf-association/pdf-issues @.> Cc: Paul Rayius @.>; Mention @.> Subject: Re: [pdf-association/pdf-issues] Confusion with ActualText vs. Alt. (#60)

I like this suggestion from @lrosentholhttps://github.com/lrosenthol for such a technical specification document. Even if it does not contain much in the way of mandates or recommendations, it would provide a very useful language and background for reasoning about this area and analysing what is (and could be) the actions and provisions of processors, together with their APIs, etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pdf-association/pdf-issues/issues/60#issuecomment-932655900, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS2G6PVHH5O46GPDF5EVK4TUEZNF3ANCNFSM4ZVW4IRQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

zauguin commented 2 years ago

@DuffJohnson

The duplication in your 3rd case is expected - if both the MC and structure element include ActualText then both iterations will (likely) be represented through any given accessibility API.

Personally I consider this to be rather unexpected. The spec says about ActualText:

Text that is an exact replacement for the content enclosed by the structure element and its children

Since the MC is a child of the structure element, I would expect the ActualText to provide a replacement for all it's MC children too. I don't see why the text replacement for elements which have already been replaced by something else should have any effect.

In the real world we don't expect this case. I would see at least two use cases in the real world:

  1. If there is a (relatively) big structure element which can only be represented by completely replacing it with ActualText, but the document creator additionally wants to provide replacements for smaller components to at least approximate their meaning when a user wants to only look at parts of the text or if PDF reader does not look into the structure tree.
  2. Ulrike's usecase of improving compatibility: Since neither of the options is guaranteed to be seen by every PDF reader, a PDF writer might want to ensure that particularly important replacements are always applied by provinding them in both forms.

In any case, I think that if intended meaning is that the replacement text of the structure element and it's children (all children or only MCs?) should be combined, it would be good to add a note in the definition of ActualText to clarify that only children without ActualText will be replaced.

DuffJohnson commented 2 years ago

@DuffJohnson

The duplication in your 3rd case is expected - if both the MC and structure element include ActualText then both iterations will (likely) be represented through any given accessibility API.

Personally I consider this to be rather unexpected. The spec says about ActualText:

Text that is an exact replacement for the content enclosed by the structure element and its children

As I understand it implementations are not obliged to reuse tagged PDF using the tags in all cases. As @mrbhardy has often reminded us (please correct me if I'm getting this wrong), a copy-paste function (for example) might elect to use ActualText from the MC level without considering the structure.

Since the MC is a child of the structure element, I would expect the ActualText to provide a replacement for all it's MC children too. I don't see why the text replacement for elements which have already been replaced by something else should have any effect.

And yet the MC exists independently of the structure element (ISO 32000, clause 14.7), so processing the structure element is only one processing option... HOWEVER, my comment did refer to using the accessibility API, which implies use of tags... so you are right.

In the real world we don't expect this case. I would see at least two use cases in the real world:

  1. If there is a (relatively) big structure element which can only be represented by completely replacing it with ActualText, but the document creator additionally wants to provide replacements for smaller components to at least approximate their meaning when a user wants to only look at parts of the text or if PDF reader does not look into the structure tree.

Agreed, this is a good case; Matthew has often made a similar point.

  1. Ulrike's usecase of improving compatibility: Since neither of the options is guaranteed to be seen by every PDF reader, a PDF writer might want to ensure that particularly important replacements are always applied by provinding them in both forms.

Fair enough... so as per Chris, some processing rules (which we try to avoid in this file-format spec) are implied here...

In any case, I think that if intended meaning is that the replacement text of the structure element and it's children (all children or only MCs?) should be combined, it would be good to add a note in the definition of ActualText to clarify that only children without ActualText will be replaced.

I fear that the fix may be more extensive as per @lrosenthol's suggestion. Building on what @u-fischer has done, perhaps we simply need to grid this out in detail in order to establish consensus on how it should work. We can then hopefully get such theoretical rules compared to extant implementations and draw some conclusions...

car222222 commented 2 years ago

@DuffJohnson suggested:

"perhaps we simply need to grid this out in detail in order to establish consensus on how it should work"

Can we clarify the referents here: i.e., precisely what is meant by “this” and “it” here?

This sounds like a good way forward, assuming that they both refer to something like this:

"the examples and analysis done by @u-fischer on the treatment of /ActualText in a variety of use-cases".

I am unsure how simple this will turn out to be!