pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
64 stars 2 forks source link

Tagged PDF - should PDF 2.0 Artifacts be ignored for child ordering rules? #35

Closed faceless2 closed 3 years ago

faceless2 commented 3 years ago

In PDF2, Artifacts can be added to the document Structure Tree at any point in the tree. There are rules about tag ordering in section 14.8.4 which don't take this into account, and I'm not sure if this is an oversight (as this didn't apply in PDF1.7) or by design.

Specifically: a Caption has to be "the first or last structure element inside its parent structure element". If a Caption is preceded by an Artifact StructureElement in the tree (as it might well be if that Artifact was used to wrap the draw operations for the table background or border, for example) it's going to fail this rule.

Obviously in PDF1 this wasn't an issue as Artifacts were never part of the tree. Given that Artifacts represent "not real content" and can pop up anywhere (largely depending on the technical requirements of the tool creating the document), it seems to me their position relative to anything else shouldn't really matter; they should be transparent to any sort of restrictions on ordering of children. I am pretty certain the only time this restriction occurs is this rule for the Caption element.

I'll be sure to raise this in the PDF/UA-2 WG too, but as the restriction comes from the wording in ISO32K it really needs to be resolved there too. If it's intentional, it's not an insurmountable problem I'm sure, but wanted to flag it up in case it slipped through.

mrbhardy commented 3 years ago

@faceless2 I'd agree that it shouldn't matter, though I'd suggest a best practice is to follow the ordering rules for Captions to ensure correct processing.

faceless2 commented 3 years ago

Sure, agreed. But from a validation point-of-view, of course we have to decide one way or another.

In that case it might be an idea to change the text in table 372 from "The Caption shall be the first or the last structure element inside its parent structure element" to "The Caption shall be the first or the last structure element inside its parent structure element, except for Artifacts"

PaulRayius commented 3 years ago

As I read the example given above, I'm interpreting the "content" to be artifacted as being the border of the table (or maybe banding in rows, etc.), not related specifically to the caption. ISO 32K (Table 375) says the Artifact (structure element) "Encloses content for which semantics require a reference in the structure tree." So, if the artifact in question is referring to the borders of the table, one could argue that the table borders don't have semantic significance and so they should not be in the structure tree but instead should be artifacted "by enclosing it in a marked-content sequence with the tag Artifact" (14.8.2.2.2).

If, however, the artifact in question is a shaded box, for example, around the text of the Caption, and does have some semantic significance, then, referring, at least, to the versions of ISO 32K and the Megatable/ Annex L that I have, the Artifact structure element can basically have anything (other than Ruby or Warichu - and maybe Part or Div, depending) as its parent. So, for reading order purposes (the original intent of this comment), the Caption tag could contain, inside it, an Artifact tag in addition to the text of the caption.

Or, maybe I'm misunderstanding the whole point of "used to wrap the draw operations" (above).

faceless2 commented 3 years ago

I believe all the options you describe are equally valid - it really depends on whether the table border/background is considered significant or not. Much like for the background-color etc attributes in UA, that's an author decision, and as ISO32K has no comment on this, that aspect isn't a concern. It's specifically the restriction on order in table 372 I have an issue with.

(although I'm surprised Ruby/Warichu can't have an Artifact child! That doesn't seem right)

mrbhardy commented 3 years ago

I think there's also a question as to whether the borders of the table can validly be included in the table. In reality, having them inside is valuable in the reuse case because if I want to reuse that table elsewhere, I have sufficient information to "extract" that piece of content, including a fixed visual appearance.

mrbhardy commented 3 years ago

@faceless2 for the Ruby, we just didn't feel confident enough that we were sufficiently expert on how Ruby was used to understand whether we should or should not allow other tags inside when we had no way to understand how best to move them to a different form. We erred on the side of caution for Ruby and Warichu.

mrbhardy commented 3 years ago

Thinking about this some more and the use cases I'm aware of for this, I would say that we should honor caption order no matter what. The rule about first or last applies and Artifacts don't get a free pass. That is consistent with how the specification is written today, since that is what it says specifically.

I would then say it is perfectly acceptable to put the border drawing operations in the structure tree as Artifacts or to leave them out and use Artifact MCS. Ordering in the tree, other than for captioning probably isn't that critical, because the primary use case for including them is for extracting the visual appearance of the table excluding any elements that do not participate in its drawing (so for example, if I have a different background that isn't material to the viewing of the table, it would likely be excluded for extraction — this is of course a processor choice).

PaulRayius commented 3 years ago

I think there's also a question as to whether the borders of the table can validly be included in the table. In reality, having them inside is valuable in the reuse case because if I want to reuse that table elsewhere, I have sufficient information to "extract" that piece of content, including a fixed visual appearance.

This is a good point, Matthew. But what about this... Should the borders, etc., be "attached" (or "associated" with the whole table? Or, should the cell borders be in Artifacts within their own rows? What if someone wants to reuse part of the table but not the whole thing - but they need to reuse the cell borders, etc., in addition to that row's data, etc.? Maybe the right way to handle this is to artifact them by row? Also, regarding reading order when it comes to line numbers, etc., 14.8.2.2.2 says "The purpose of the Artifact structure element type is to accommodate artifact content in cases that have positional context relative to real content within the structure tree." And it gives the example of line numbers. But, maybe one could argue that the borders in a table fit this case, too. And, if so, perhaps, then, they should go in their respective rows, not in the table as a whole?

Of course, as I was writing this (above) Matthew posted another comment...

faceless2 commented 3 years ago

As a question of best practice perhaps for the reuse group, sure. But this particular issue was raised solely as a query over the text in ISO32K. I presume everyone agrees that there's nothing in ISO32K that disallows table backgrounds/borders etc. inside an Artifact.

FWIW I'm not sure I agree with the reasoning. We can put an Artifact inside a<Table> alongside the <Tr> elements, so it's pretty clear it has no impact whatsoever on the table layout algorithm and should be transparent to it. It seems very arbitrary that the only restrictions on Artifacts are that they can't be inside a Ruby, nor precede or follow a Caption.

But I only raised this because it seemed inconsistent and therefore possibly an oversight, not because it's hugely important. If it's this way by design, feel free to close this issue.

petervwyatt commented 3 years ago

@mrbhardy Could you please summarize the outcome of discussions - is there an identified proposed resolution for ISO 32000-2:2020? (could be as mild as an informative note!)

DuffJohnson commented 3 years ago

IMO artifact structure elements should indeed be ignored for child ordering rules and this is by design; we never intended to constrain their use in ordering terms at all. So "...except for Artifacts." should indeed, per Mike be added to table 372 for Captions only.

petervwyatt commented 3 years ago

So summarizing for everyone and so we can hopefully conclude this issue with proposed solutions:

Hopefully this will help all future readers understand both these subtle points.

PaulRayius commented 3 years ago

"Table 372 for Captions, change to "The Caption shall be the first or the last structure element inside its parent structure element, except for Artifacts.""

However, if, for example, a remediator says, "Ok, I can put this Caption wherever I want inside this Artifact..." how would that impact reflow, for example?

If it won't, negatively, then I'm fine with it. I just want to make sure we aren't forgetting about that kind of situation (or others).

faceless2 commented 3 years ago

@PaulRayius it took me a few readings of your comment before this clicked.

The intent of "The Caption shall be the first or the last structure element inside its parent structure element, except for Artifacts." is actually "The Caption shall be the first or the last structure element inside its parent structure element, skipping over any Artifacts that are siblings of the Caption for this test".

And I think you've read it as "The Caption shall be the first or the last structure element inside its parent structure, unless that parent is an Artifact in which case it can go wherever". Which hadn't occurred to me at all, but yeah, it's an equally valid reading in english. Clearly a language designed by a non-programmer.

I don't think either affect reflow; for the second case, as I understand Artifact it's is effectively just a "box of stuff", with no useful meaning assigned to its content. How it's reflowed is equally undefined, I presume, and if that's a problem I'd suggest it belongs in another issue rather than this one.

petervwyatt commented 3 years ago

To ensure compatibility with PDF 1.7:

NOTE If an Artifact structure element is present, and needs to be associated with a Caption, then the Artifact structure element needs to be a descendent of the Caption.

Approved by PDF TWG.

petervwyatt commented 3 years ago

Above NOTE is added only to the first row in Table 372, but I also noticed that the requirement "If a Caption is present, it shall be either the first or last child..." is also repeatedly stated in Table 370 (List) and Table 371 (Table). The new NOTE was not added to either of these.