latex3 / tagpdf

Tagging support code for LaTeX
60 stars 7 forks source link

Your PDF examples do not have a valid ParentTree #4

Closed ozross closed 6 years ago

ozross commented 6 years ago

Hi Ulrike. Clearly you have done a lot of exploratory work so far, preparing this package. But you have left out one vital part of the Tagged PDF structure, the Parent Tree. None of your examples include this information, yet the PDF Spec (p. 557) says:

Table 322 – Entries in the structure tree root

ParentTree — number tree (Required if any structure element contains content items)

A number tree (see 7.9.7, “Number Trees”) used in finding the structure elements to which content items belong. Each integer key in the number tree shall correspond to a single page of the document or to an individual object (such as an annotation or an XObject) that is a content item in its own right. The integer key shall be the value of the StructParent or StructParents entry in that object (see 14.7.4.4, “Finding Structure Elements from Content Items”). The form of the associated value shall depend on the nature of the object:  For an object that is a content item in its own right, the value shall be an indirect reference to the object’s parent element (the structure element that contains it as a content item).  For a page object or content stream containing marked-content sequences that are content items, the value shall be an array of references to the parent elements of those marked-content sequences.  See 14.7.4.4, “Finding Structure Elements from Content Items” for further discussion.

Without a valid ParentTree you cannot create anything but the simplest of documents.

I suspected a problem of this kind, when I looked at ex-mc-manual-para-split.pdf using Acrobat Pro DC software. On asking to look it the Tags tree, it crashed the whole application. The reason for this, I think, is that you have a paragraph splitting across 2 pages. It shows OK, but the tagging is quite wrong.

viz. 1 0 obj << /Type /StructTreeRoot /K [<</Type /MCR /Pg 4 0 R /MCID 1>> <</Type /MCR /Pg 4 0 R /MCID 1>>] /ParentTree 2 0 R /RoleMap 3 0 R>> endobj refers to 2 Kids, both on the 1st page (object 4), having MCID 1. But the only content on page 1 has MCID 0. The 2nd page stream (object 11, associated with Page object 10) does have an MCID 1. But that is still wrong, as MCIDs are normally reset to start from 0 on each page — not sure if this is a hard and fast rule, but certainly a practical one.

The role of the Parent Tree is to record the parent object numbers of each MCID; there are examples on page 566 of the PDF 1.7 Specification.

Besides, you really shouldn't have content directly as a Kid of the /StructTreeRoot . Normally there would be an /Article or /Document or /Part structure there (see Table 333 – Standard structure types for grouping elements), with perhaps a /P (paragraph) as a kid, with the MCIDs then as kids of this /P .

This is all very tricky stuff to get correct, initially. But once you have grasped all the concepts, they fit together very well, allowing quite complicated tagging structures to be created in a fully consistent and robust way.

Hope this helps.

See you in Rio. Ross

u-fischer commented 6 years ago

Hello Ross,

no I didn't forget the parent tree. ex-mc-manual-para-split shows only how to insert the BDC/EMC markers in the page stream. There is no structure and so also no parents for the MCID's and nothing to put into it. Check one of the examples in the structure folder here on github for an example which includes structure objects and a parent tree, e.g. https://github.com/u-fischer/tagpdf/blob/master/source/examples/structure/ex-patch-sectioning-koma.tex.

I will add some comments to this example to make this clear (and change the activate-all to activate-mc, activate-all is wrong here).

u-fischer commented 6 years ago

Corrected with commit https://github.com/u-fischer/tagpdf/commit/259995d6d6d3b93c94862a8c2409516fa3af5880.

ozross commented 6 years ago

Hi Ulrike,

On 14/07/2018, at 17:54, "u-fischer" notifications@github.com<mailto:notifications@github.com> wrote:

Hello Ross,

no I didn't forget the parent tree. ex-mc-manual-para-split shows only how to insert the BDC/EMC markers in the page stream. There is no structure and so also no parents for the MCID's and nothing to put into it.

Well, clearly this document is broken. It crashed a PDF reader, so I was curious to try to find out what was wrong with it. One can learn a lot from errors.

There were several things that I found.

Firstly, you have 2 pages. Each should have an entry in the parent tree. Each such entry should be a reference to an array object. Those arrays might be empty, but there should still be such an object.

Secondly, although you say above that there is no structure, you actually have a /Kids array for the /StructRoot , so that is defining structure. Now both Those entries point to an MCID which doesn't exist on the stated page.

Thus although there may be no syntax errors here, the structure is certainly inconsistent. So it's not surprising that a PDF reader gets confused when trying to build a representation of the structure tree. Of course a viewer that isn't looking at structure is unaffected by this.

The real lesson here is that a valid Tagged PDF document cannot be built up by considering parts of the various trees and arrays in isolation. All must be done together. At least that's the way it has to appear in the final document.

That is, when you assign an MCID number to a snippet of textual content, you also need to

  1. add an element to the parent tree of that page;
  2. add an entry to the parent structure's /K array. Thus there are 3 things that need to be done together. If not, things can easily get out of whack, with unpredictable results when the document is viewed. And this is without also considering whether there is a need for /Alt , /ActualText or /E attributes.

Item 2. above can be tricky, especially when the parent structure was introduced with a /Pg key for a page different to where the current text snippet occurs; e.g for a paragraph that splits across different pages.

Check one of the examples in the structure folder here on github for an example which includes structure objects and a parent tree, e.g. https://github.com/u-fischer/tagpdf/blob/master/source/examples/structure/ex-patch-sectioning-koma.texhttps://protect-au.mimecast.com/s/0hJ9C81Vq2C6l2ZKhnYeHM?domain=github.com.

I will add some comments to this example to make this clear (and change the activate-all to activate-mc, activate-all is wrong here).

OK. I'll check out the internals of the PDF to see what is the effect of this.

BTW, this wasn't the only PDF with an empty /Nums array. But it was the only one which crashed Acrobat. :–)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://protect-au.mimecast.com/s/9CsqC91W8rCkvlgpcElUkD?domain=github.com, or mute the threadhttps://protect-au.mimecast.com/s/FM1TC0YKgRsGK7DLc2Lig_?domain=github.com.

Cheers,

  Ross
u-fischer commented 6 years ago

Hello Ross,

yes the example was broken. I corrected this.

Regarding some of your other comments:

ozross commented 6 years ago

Hi Ulrike,

On 15/07/2018, at 0:26, "u-fischer" notifications@github.com<mailto:notifications@github.com> wrote:

Hello Ross,

yes the example was broken. I corrected this.

Regarding some of your other comments:

These are certainly valid points. Of course the ultimate aim is to produce documents conforming to published standards, e.g, for accessibility and/or archivability. This imposes more stringent conditions than just being partially tagged. For accessibility, every snippet of text must be tagged, either with an MCID or as Artifact.

As you say, your package is for experimenting with tagging more generally, so is useful for this, by revealing some of the concepts and structures that ultimately need to be kept 'under the hood', so to speak.

My approach, as you'll see in Rio, is to tackle the issue from an author's point of view. That is, the author has already encoded what they want to say in their LaTeX source. This has to be the bulk of the input that leads to a fully tagged, conforming document. So the question becomes what else needs to be done when processing this document source? What problems arise due to typographical considerations? — at the LaTeX level, or due to TeX itself, as well as the need to get valid and conforming PDF output. What extra must an author add to allow the desired result to be achieved? How can this extra be kept minimal?

The TeX community, through its developers and users, needs to build a richer understanding of tagged PDF in all aspects. Both you and I are contributing to that.

Cheers.

Ross

*

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://protect-au.mimecast.com/s/gBe7Clx1OYU2B2z7SGx0vP?domain=github.com, or mute the threadhttps://protect-au.mimecast.com/s/RYFiCmO5wZsjmjR2IOj4a8?domain=github.com.