pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
63 stars 2 forks source link

Annex L and the Table Structure element #83

Closed PaulRayius closed 3 years ago

PaulRayius commented 3 years ago

Per Annex L, the Table structure element is excluded from the list of structure elements that can be children of the P structure element and P is excluded from the list of structure elements that can be a parent of Table.

I think these omissions should be addressed and Table should be allowed to be a child of the P structure element.

As such, in the P section of Annex L, Table should be indicated as a valid child of P and, in the Table section, P should be listed as a valid parent with 0..n in both instances.

OR, in Annex L, use the "double dagger" symbol indicating "for containment rules, refer to the respective structure element type’s description." But, if that's the solution we use then please refer to my closed ticket #67.

DuffJohnson commented 3 years ago

I don't recall ever seeing one of these. Can you provide an example of a "table within a paragraph"?

car222222 commented 3 years ago

Another way of looking at this is: why should a Table element be treated differently to a Figure or Formula element?

Aside: in fact the rules allow one to 'hide a Table' inside a Figure and hence 'smuggle it into' a P.

PaulRayius commented 3 years ago

I don't recall ever seeing one of these. Can you provide an example of a "table within a paragraph"?

Not without showing client documents but it does happen. If we require a visual aid then I could make something up but is this necessary?

PaulRayius commented 3 years ago

Another way of looking at this is: why should a Table element be treated differently to a Figure or Formula element? Right. Or List?

Aside: in fact the rules allow one to 'hide a Table' inside a Figure and hence 'smuggle it into' a P. Sneaky! There are, of course, other issues with this but, with this concept, you're right. One could put the Table inside some "other" tag (that doesn't require Alt text, for example) and then "smuggle" it in. I applaud the "outside the box" thinking but, for the record, I vote "No" on making this the "solution"!

car222222 commented 3 years ago

Or List?

Well maybe List has a somewhat different feel to it?
And there are already plenty of questions about the complexity of the many relationships between lists and paragraphs! So maybe best not to go down that path?

My "solution" of hiding Tables was not a serious suggestion but just an aside to help forward the argument that Table in P is semantically similar (or even identical) to Figure or Formula (as far as the P is concerned).

DuffJohnson commented 3 years ago

Or List?

Well maybe List has a somewhat different feel to it?

This is my sense. I have no difficulty concieving of a list "within" a para. But I've never seen a table within a para and I just... don't believe in it, conceptually. A table is too "block-like" for that. The notion makes me queasy in the same way as if we were discussing a Table as a child of an H2.

And there are already plenty of questions about the complexity of the many relationships between lists and paragraphs! So maybe best not to go down that path?

I'd put it like this: that a list seems reasonable within a paragraph doesn't imply, to me, that a table is likewise reasonable any more than, say a heading.

My "solution" of hiding Tables was not a serious suggestion but just an aside to help forward the argument that Table in P is semantically similar (or even identical) to Figure or Formula (as far as the P is concerned).

How would you describe the relevant characteristics for determining "semantic similarity as far as the P is concerned"?

DuffJohnson commented 3 years ago

I don't recall ever seeing one of these. Can you provide an example of a "table within a paragraph"?

Not without showing client documents but it does happen.

With a little (not that much ) searching II have been unable to find either examples or discussion of people trying to do this.

If you have an example customer file handy, perhaps you could redact the text sufficient to make it shareable...?

bdoubrov commented 3 years ago

I think we confuse here a table being a part of a paragraph and having inline placement.

I went through a few LaTeX-based articles and quickly found an example of a list with block placement, which is a part of a paragraph, at least from LaTeX point of view. See

https://arxiv.org/pdf/2104.09188.pdf

bulleted list on page 11. One could argue, that the text after the list starts a new paragraph. But there is no par intend present. So, I believe from the author point of view this list is a part of a big paragraph.

And I think one can easily imagine a similar example of a table with block placement, which is a part of a paragraph.

DuffJohnson commented 3 years ago

https://arxiv.org/pdf/2104.09188.pdf

bulleted list on page 11. One could argue, that the text after the list starts a new paragraph. But there is no par intend present. So, I believe from the author point of view this list is a part of a big paragraph.

And I think one can easily imagine a similar example of a table with block placement, which is a part of a paragraph.

For me, it's very easy to see that list as part of the paragraph on the basis that its content could easily have been included in paragraph text; the chosen appearance is purely cosmetic. I'm unconvinced that tables are sufficiently similar to lists to be regarded the same way, perhaps because tables aren't intended to be consumed in a linear fashion they break (for me) some basic sense of "paragraph". This is why I'd love to see an example.

bdoubrov commented 3 years ago

arxiv.org provides unlimited number of examples. The one with the table inside the paragraph:

https://arxiv.org/pdf/2105.08377.pdf - page 14.

DuffJohnson commented 3 years ago

Thanks! I'm not sure how I would know whether this table occurs within the preceding paragraph or following it... but the example is interesting.

What is the value of including the table within the P in this case? I can see an argument that the preceding para and the table are a piece for some reuse purpose... but if so, is the suggestion that ALL cases in which paragraph text concludes with a colon and a following table should ALWAYS be tagged this way?!?

In other words, what's correct and how would we know? And is it worth a provision from a PDF/UA-2 POV?

car222222 commented 3 years ago

@DuffJohnson Boris' example is a rather special case where the table element occurs at the end of a sentence (and the paragraph).

If you look for subject matter that contains other "tables", such as matrices or Young Tableaux, these will be naturally represented/tagged as Table elements in mid-sentence positions. These tables appear mid-sentence because, just like the many uses of Figure or Formula used as semantic constituents of a sentence, there role is identical to that of a word or phrase).

Thus excluding Table from sentences and paragraphs is pretty much the same as excluding Figure or Formula or even words (maybe with a Span, Strong etc if tags are mandatory).

car222222 commented 3 years ago

Further notes:

Table is not in '4.8.4.5 Block level structure types' so in our context they should not be considered as block-like, either inherently or legalistically.

Some tables are indeed intended to be consumed in a linear fashion (or, more usefully in practice, using multiple distinct linearisations). These tables at least will appear naturally in sentences and "basic paragraphs".

DuffJohnson commented 3 years ago

If others are telling me that tables can be sensibly considered as semantically identical to words or phrases in a sentence... then who am I to argue! :-)

OK, so the proposed change is to amend Annex L to allow Table elements as children of P, yes. If you think this is right, vote for this comment.

MatthiasValvekens commented 3 years ago

+1

Maybe it's just my LaTeX bias talking, but I consider tables and figures mostly interchangeable for the purposes of document structure. If figures can be part of a P, then we should treat tables the same way IMO.

MatthiasValvekens commented 3 years ago

Let me echo one of @PaulRayius's points made at the PDF Reuse TWG meeting today, because I agree with it: it's a fact of life that some authors include tables in paragraphs, or even mid-sentence. We're not just talking about floating tables (that's an entirely different discussion), but about cases where the position of the table within the paragraph actually matters.

We can argue all day long about whether that's good editorial practice, but it's something that authors do. That being the case, we should allow it to be encoded as such in the structure tree (esp. from a remediation point of view, where rewriting the text may not be an option).

In addition to that, I also don't believe that the nonlinearity of tables is a good argument against allowing tables inside Ps. Formulas can occur inside paragraphs, and those are often nonlinear as well. Here's a somewhat involved example from my own work: https://i.imgur.com/rAtsZbf.png. Not saying that such use cases are relevant to a significant portion of our user base, but they're by no means rare. You'll find many more extreme examples in any homological algebra textbook ;)

faceless2 commented 3 years ago

+1 from me too. Here's another example, clearly mid-sentence:

image

Matthew's correct that you can't place a <table> inside a <p> in HTML, but it's worth noting that this decision was made some 30 years ago. It predates the display: inline-table which has been around for at least 20 years, and the (fairly common) use of a float property on a table to move it to one side of a paragraph. I'm not sure that decision would be made the same way today.

It's a classic trap for young players in HTML, and anyone looking for advice is told to use a <div> instead of a <p>, which is what everyone does. Semantically it's a worse choice, and it's also not an option available to us in PDF, because <div> is not a general block container, it has special meaning. I don't think we should repeat this mistake if we can avoid it.

DuffJohnson commented 3 years ago

OK, I concede!

In the end I can't come up with much better than non-linearity to explain why I feel table differs from list, figure and formula, in this context, nor do I find comfort in the rather spare definitions of the respective structure element types. The examples have been instructive; thank you.

So, fine, change Annex L....

petervwyatt commented 3 years ago

@DuffJohnson since it looks like everyone has converged (???), do you mind summarizing the actual precise wording change(s) that will get done at the end of this rather long thread? Then we can review and hopefully resolve at the next PDF TWG. Thanks.

PS. I found the real-world examples extremely helpful for my understanding so thanks @faceless2 and @MatthiasValvekens !

car222222 commented 3 years ago

Just a perhaps unnecessary repetition that is relevant to Matthew's raising the comparison with HTML.

Note that in PDF, Table is explicitly not categorised as block-level element. Thus comparison with its usage rules in HTML is not so straightforward.

To be more precise: Table is not in listed in '4.8.4.5 Block level structure types'

car222222 commented 3 years ago



As I commented earlier, there is already a major difference in Table between PDF and HTML: in PDF, Table is not categorised as a block-level element.

On 11 Jun 2021, at 00:13, Matthias Valvekens @.***> wrote:

You'll find many more extreme examples in any homological algebra textbook ;)

car222222 commented 3 years ago

Two more points to emphasize those by @MatthiasValvekens :

Formulas can occur inside paragraphs, and those are often nonlinear as well.

Also, formulas can contain Table elements.

You'll find many more extreme examples in any homological algebra textbook ;)

Also, even larger numbers of such examples occur in high-school textbooks on computing, economics, business and all of STEM. Often they are called 'matrices' in the surrounding text.
Such usage is simply ‘common parlance’ in any form of ‘technical language’.

Not directly relevant, but I would like to reiterate that small examples of such tables are normally rendered visually 'inline' with the words of the paragraph, rather than in a block- or display-style. Do we need to exhibit real world examples of such inline tables?

faceless2 commented 3 years ago

I was unable to find an example of a visually inline table, and I had a good look. Probably a good thing - it's a terrible way to layout in my opinion, but I agree it's certainly possible.

Re. the point that Table is not a block-level element, I was stumbling over my words on the call the other day but what I was trying to get at is that these terms are, unfortunately, hugely overloaded.

CSS (my other working group) has the concept of an "outer" and "inner" layout model: how an element appears to its parent, and how it lays out its children. The table in my example above is an "block table" - externally it's a block, laid out full width below its previous sibling, but internally a table with children in rows/columns. And as Chris said, it's equally possible to put a table on the same line as surrounding text, making it an "inline table".

HTML+CSS need this distinction for layout, and the term block-level element in CSS is specifically describing the outer display aspect. So yes, tables are block-level elements by default, but <table style="display:inline-table"> is not - it's a flow (aka inline) level element.

TL;DR is that we need to be careful comparing the word "block" between HTML and PDF.

PDF doesn't care about layout, so any distinction between group, block and inline feels arbitrary to me. In fact, on this topic I'd like to make an argument for aside to be allowed as a valid child of p as well, for the same reason we allow tables, figures, formulae and lists.

image

That, to me, looks like an aside ("content that is distinct from other content within its parent structure element" and "A callout element") inside a p, but Annex L does not allow this relationship. I'm happy to raise this as another issue to keep this one on topic, but I believe all the arguments made here for table apply equally to aside.

MatthiasValvekens commented 3 years ago

Hmmm, wouldn't that particular example be better with the surrounding paragraph and the aside as sibling elements in the structure tree? In this case, the fact that the text floats around the aside is an artifact of the layout in my opinion. It doesn't represent the way the surrounding text was intended to be read, for one. The precise position of the aside in the surrounding text is also not really relevant to the meaning the author is trying to convey.

I think this falls into the same category as "floating" tables and figures, and I don't think those are quite the same as the ones we've been discussing in this thread (but YMMV).

MatthiasValvekens commented 3 years ago

On the issue of inline tables: some algebra textbooks occasionally inline expressions involving matrices. I don't particularly like that either, since it messes up the inter-line spacing. Additionally, I'm not quite convinced that Table is the right structure element for those, actually, but that's beside the point IMO.

The broader issue here is that we're tagging semantics, not layout, so whether the table is "truly" inlined, or rendered in "display style" (to borrow a term from the TeX world, cf. the examples @faceless2 and I referenced a few posts ago) shouldn't be directly relevant to the kind of tags you put in the structure tree. For layout attributes: sure, but I don't think it matters for the purposes of the way we select & nest structure element tags. (EDIT: on reflection, I think @faceless2's last comment actually says the same thing --- I'm not disagreeing with that aspect of it).

faceless2 commented 3 years ago

(ah yes, display and inline from TeX was the other terminology I was reaching for. Thanks!)

re. is it a sibling, yes I agree that's the alternative. But if the position within the paragraph is relevant to the aside, as it might be if you had (for example) a "fun fact" or "did you know?" type of callout - a digression relating to a particular sentence in the text - then forcing it to be a sibling would lose that context of where in the paragraph it relates to.

I agree we should not be dictated to by the layout, but we shouldn't ignore it either. We have FENote, for example which specifically refers to a similar type of semantic content but which is clearly based on a layout concept, and which is allowed inside p. We could use FENote here for the type of case I'm describing too I suppose, but to me,Aside seems to be a better fit.

MatthiasValvekens commented 3 years ago

OK, I think I see what you're getting at. But you're right, it's perhaps better to factor out that discussion into another thread. This one is getting long enough already ;)

DuffJohnson commented 3 years ago

 As I commented earlier, there is already a major difference in Table between PDF and HTML: in PDF, Table is not categorised as a block-level element.

Table was block-level in PDF 1.7 and earlier. PDF 2.0 is where this was changed.

To answer Peter's question, the exact change is simply:

"Change the mega table in Annex L to allow Table as a child of P."

petervwyatt commented 3 years ago

PDF TWG recommends passing this to the ISO 32005 Discussion Group for making this change in 32005 extension document. (as acknowledges this is a normative correction). Note that the XLSX and PDF attachment will also need updating.

@mrbhardy - please speak with @DuffJohnson regarding the PDF TWG discussion

DuffJohnson commented 3 years ago

I don't understand why this change only goes to 32005 and doesn't also stay as correction-worthy errata for 32k; after all, there are other normative corrections. But if it does go there then @mrbhardy is the plank-owner... :-)

petervwyatt commented 3 years ago

Marking as "proposed solution" so this gets discussed in the next PDF TWG Meeting - specifically if anything needs to be done for this ISO 32000-2:2020 errata process. Or only as part of ISO 32005.

petervwyatt commented 3 years ago

PDF TWG discussed and although this is acknowledged as a "breaking change" this errata should be fixed and taken forward to ISO for the next part of ISO 32000. PDF TWG approve adding an informative note.

petervwyatt commented 3 years ago

The XLSX attachment that is also hosted as part of these errata (see Issue #64) will also be updated as well as the note being added to Annex L.