pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
67 stars 2 forks source link

Annex L, ISO TS 32005 and all related XLSX/PDFs have undocumented "c" entry in megatable. #440

Closed petervwyatt closed 4 months ago

petervwyatt commented 5 months ago

Everything that is derived from Annex L megatable has an undocumented "c" entry (meaning it is not described in "Table Annex L.1 - Legend for Table L.2") for where Structure Type = WP and Child = Figure, and correspondingly for where Structure Type = Figure and Parent = WP.

This issue is in many many places:

Other Ruby and Warichu related cells have "[a]" and "[b]" which are documented, but "c" (which probably should be "[c]"!) is undocumented everywhere. So what is "c" supposed to note? Is it the same as "[b]"?

@DuffJohnson noted that the megatable did NOT have that “c” back in January 2019… but by May 2019 it had it. He could not find any mention of this change in any of the Comments records of that era.

petervwyatt commented 5 months ago

I cannot find any wording anywhere that marries Figure and WP. Is "c" the same as "[b]" or is it not needed at all?

petervwyatt commented 5 months ago

Given what this cell is, and that WP can only be below Warichu, I think the correct answer is that "c" should be "∅"

car222222 commented 5 months ago

But here is a far more radical observation! —

In order to be compatible with Table 369 in Clause 14.8.4.7.3, the content model for WP should surely be restricted to “exactly one content item” (this also applies to RP).

petervwyatt commented 5 months ago

I think you must be misreading Annex L: all the "[a]" and "[b]" footnotes indicate those special rules from 14.8.4.7.3 for the children of Ruby and Warichu structure elements, and the parents of the RB, RT, RP, WP, and WT structure elements on pp. 962-964 of ISO 32000-2:2020.

image

The issue here is WP can have 0..n children of type: NonStruct, Sub, Em, Strong, etc etc. and for content item - that all seems fair enough as I cannot see anything in 14.8.4.7.3 that otherwise limits what can occur inside a WP (which is itself is inside a Warichu). This list strangely states "c" when Figure is a child of WP - so this will either be 0..n if that makes sense or Figure should not be mentioned (meaning it's not allowed).

PS. I am slightly assuming that the reasoning for Form also applies to Figure (although I also note that Formula is not mentioned as a viable child of WP).

car222222 commented 5 months ago

Yes, I can see that it is not essential to restrict the content of WP.

Maybe the problem is the lack of clarity in Table 369. Is it consistent that WP can contain any such substructures, when the result must be "only punctuation"?
I guess it depends on what is meant by "punctuation".

So should Table 369 make it clearer that the "punctuation" that forms the sole content of a WP can, structurally, consist of this range of "inline substructures, just like WT? Noting that WT (in contrast with RT) explicitly states that the content model includes such a range of inline material.

u-fischer commented 5 months ago

A Figure element that can be block or inline. And according to 14.8.4.1 General when used inside an inline element like WP it is by default a block element (odd, but that seems to be what the spec says), and that is probably not wanted here so perhaps the c meant to clarify that Figure is allowed, but only as inline element, with a forced Placement=inline attribute and without block substructures like P etc.

To determine the category that is applicable to a structure element that may either be a block level structure element or an inline level structure element, the following applies: • If the structure element is used inside a block level element, it is an inline level structure element • In all other cases it is a block level structure element.

car222222 commented 5 months ago

Note also that "∅" is never, in fact, used in The Table, since this cases is indicated by the absence of any entry for the type.

car222222 commented 5 months ago

There are other problems with the provisions for both Warichu and Ruby: e.g., "what is punctuation, and what is not?"; and "what is the semantic meaning of a content item as a child of either? Should this possiblity be removed from The table?"

But these need separate issues.

petervwyatt commented 4 months ago

Note also that "∅" is never, in fact, used in The Table, since this cases is indicated by the absence of any entry for the type. And neither is "1..n". And "1" (by itself, for StructTreeRoot) is used and is not listed in Table Annex L.1.

car222222 commented 4 months ago

Agreed! I too have a list of such small strangenesses, and also some that may be significant.

Of greater concern are some of the current discrepancies between the provisions (both explicit and implied) for structure types in Clause 14.8.4 and the details of The Table in Annex L (and in 32005).

But these need separate issues.

petervwyatt commented 4 months ago

The wording "Table L.1 provides a legend for use in interpreting Table L.2." needs to be expanded to include the embedded PDF and spreadsheet since those artifacts do use "∅" and "1" (but not "1..n").

@car222222 - please create those issues ASAP since it would be good to sort out all Annex L megatable issues at one time!

car222222 commented 4 months ago

OK, I shall get onto these RSN, I hope!

Some will be in issues I post related to the upcoming revisions to 32005.

Others I shall post here.

DuffJohnson commented 4 months ago

In today's PDF/UA TWG we discussed the "c" at the intersection of WP (Parent) and Figure (child); "0..n" was proposed for the correction and there was no dissent.

petervwyatt commented 4 months ago

If PDF/UA TWG has agreed then PDF TWG does not need to re-agree. Thanks.

petervwyatt commented 4 months ago

Attachments still TBD: