Elements to ignore by TOC extraction algorithm?

iherman commented 5 years ago

The HTML TOC structure and extraction ignores sectioning content and hidden elements from the TOC. Is there a need to ignore others?

rdeltour commented 5 years ago

For one thing, I'm not even convinced all sectioning content elements should be ignored; especially given that section is more often that not used as a glorified div by web developers…

mattgarrish commented 5 years ago

especially given that section is more often that not used as a glorified div by web developers

For those who may not follow this issue, if a nav is subsectioned like this:

<nav role="doc-toc">
   <h2>Contents</h2>
   <section>
      <h3>Part 1</h3>
      ...
   </section>
   ...
</nav>

The algorithm will not extract anything, as we don't account for partitioning of navs.

It's not that complicated to treat each nested section/nav element as a branch of the toc (i.e., like an a tag without an href attribute), but it does bring in additional complexity. What happens if heading numbers aren't sequential? what happens if a section has multiple headings? What happens if there are headings and no sections? We almost have to implement the outline algorithm to extract a table of contents.

I don't like that we don't descend into subsections, but if we descend into them and ignore them as branches of the table of contents (i.e., we just go in search of lists), then do we concatenate the lists we find as though they all represent children? Does that make sense if their grouping heading is gone?

iherman commented 5 years ago

I am afraid this would make all this way too complicated for no major gain. While the section elements are indeed very important in general, the current TOC algorithm is not aimed at using the main body of the content as the source of the TOC, but using a dedicated structure.

A general, or even a somewhat restrictive, outlining algorithm goes, in my view, beyond what we should define. Authoring tools may impose their own content structure restrictions, and use a corresponding TOC generating algorithm, which can then be used as a WPUB TOC if needed (this is what respec does, after all).

iherman commented 5 years ago

(Interestingly, my previous comment may be used in favor of a TOC structure in JSON, see issue #376...)

mattgarrish commented 5 years ago

the current TOC algorithm is not aimed at using the main body of the content as the source of the TOC, but using a dedicated structure

I'm not suggesting we try to parse the contents of publication, only that partitioning a toc into subsections isn't uncommon. It's often done inside a list with an unlinked label, but that's just one pattern. Inside each section we'd still expect a list of links:

<nav role="doc-toc">
   <h2>Contents</h2>
   <section>
      <h3>Part 1</h3>
      <ol>
          <li><a href="c1.html">Chapter 1</a></li>
          <li><a href="c2.html">Chapter 2</a></li>
          ...
      </ol>
   </section>
   <section>
      <h3>Part 2</h3>
      <ol>
          <li><a href="c10.html">Chapter 10</a></li>
          <li><a href="c11.html">Chapter 11</a></li>
          ...
      </ol>
   </section>
</nav>

Where it gets more complex is if people start using only headings, and don't use them consistently:

<nav role="doc-toc">
   <h1>Contents</h1>
   <h3>Part 1</h3>
   <ol>
       <li><a href="c1.html">Chapter 1</a></li>
       <li><a href="c2.html">Chapter 2</a></li>
        ...
   </ol>
   <h3>Part 2</h3>
   <ol>
      <li><a href="c10.html">Chapter 10</a></li>
      <li><a href="c11.html">Chapter 11</a></li>
      ...
   </ol>
</nav>

But just as we restrict what we accept now, we could probably restrict partitioning to the first example and if you stray from that you can't expect a usable toc. That includes if you insert a section element and don't provide a heading, you just get a placeholder (like a li without an a but with a nested list of links).

The latter example requires constructing an outline first before the intention can be meaningfully extracted.

HadrienGardeur commented 5 years ago

(Interestingly, my previous comment may be used in favor of a TOC structure in JSON, see issue #376...)

Indeed and this is just the beginning of a very very long list of edge cases that we'll encounter...

If we move away from a sub-set of HTML like in EPUB, I see no end in sight to these discussions.

rdeltour commented 5 years ago

@iherman

I am afraid this would make all this way too complicated for no major gain.

In both Matt's examples above, making sections "transparent" would mean the algo would still extract something instead of bluntly ignoring the markup.

A general, or even a somewhat restrictive, outlining algorithm goes, in my view, beyond what we should define.

Yes, 👍 of course.

@mattgarrish

I'm not suggesting we try to parse the contents of publication, only that partitioning a toc into subsections isn't uncommon.

Could we in this case merge the list descendants, as if they were one big list? I.e. we would totally ignore sections and headings, but at least the list content would get extracted? I'm aware this may not fully represent the author's intent in 100% cases, but maybe better than nothing?

@HadrienGardeur

If we move away from a sub-set of HTML like in EPUB

What the algorithm does right now isn't much different from what EPUB Nav Doc was doing, it's just slightly more permissive and is specified for the UA rather than for the Author, which is better for interoperability.
But I'm not sure I understand what you mean by "moving away from a sub-set of HTML like in EPUB"; I think the question here is precisely about what subset of HTML we want to extract?

iherman commented 5 years ago

Trying to second-guess @HadrienGardeur's statement:

moving away from a sub-set of HTML like in EPUB

I guess the issue is whether we move further away from the (relatively) simple HTML structure that the current draft specifies for TOC. If we are not careful, we may end up with an infinite amount of various structural variations, ie, very complex structures and algorithms. As a general statement (if this is what @HadrienGardeur meant, that is), I agree with this.

Whether @mattgarrish's first example is within the bounds: maybe. The current algorithm have to become a bit more complicated, because we have to account for the heading elements within the sections, too, which would probably mean that a heading element within a simple <li> would have to be accepted, too (otherwise we will have to keep track on whether we are withing a section or not). If that can be done easily then... maybe it is fine.

But I would think we should not go in direction of the second example. Ie, we should definitely stop there unless real and widely used cases come to the fore.

I must admit, however, that I was surprised by @mattgarrish's statement, whereby:

... partitioning a toc into subsections isn't uncommon.

I have never seed such TOC structure myself so far on the Web (but it may be used in e-books, I cannot say). I definitely yield to others if such structures are really common. But if they are not, we should not complicate our lives further...

mattgarrish commented 5 years ago

I have never seed such TOC structure myself

I don't follow here. EPUB used span to represent headings, so is there really any difference between the example I gave above and this:

<nav role="doc-toc">
   <h1>Contents</h1>
   <ol>
       <li>
          <div><a>Part 1</a></div>
          <ul>
             <li><a href="c1.html">Chapter 1</a></li>
             <li><a href="c2.html">Chapter 2</a></li>
          </ul>
        </li>
        ...
   </ol>
</nav>

EPUB forced you to follow the above pattern, but should we continue to rule out doing it with nested section/nav elements? It's certainly debatable that we should allow only one representation.

But, to be clear, all I'm saying is if we do this, we should maintain strict rules on what we allow. I would argue for the following if we do:

the author may partition the toc using nested section/nav elements
in these cases, each section/nav represents a branch (similar to an li)
each section/nav is labelled by the first numbered heading it contains (similar to how a is applied)
each section/nav may contain either one list or 1+ further subsections (same with li)

And that's all we'll recognize.

It will pose a question of whether to apply a or h# as the name, I agree, as I already ran into that in an earlier iteration of the working code. I'm not sure if the answer is just to accept either the first heading or a as the name, assuming headings inside list items will be a rarity, or just flag what is currently being processed (we wouldn't need a new stack or anything, since the name has to be found before any nested sections/lists begin).

TzviyaSiegman commented 5 years ago

@mattgarrish do we need to raise this to WG or is this for you to solve?

mattgarrish commented 5 years ago

This is a question of how many possible representations of the markup we want the toc algorithm to account for, but no one has complained (yet) about the existing algorithm's expectations so maybe it's a non-issue.

TzviyaSiegman commented 5 years ago

Thanks. We'll add it to discussion on Monday.

iherman commented 5 years ago

As the person raising the original issue: I am perfectly fine if the answer to the question:

The HTML TOC structure and extraction ignores sectioning content and hidden elements from the TOC. Is there a need to ignore others?

is a "no", and we close this issue with no further action.

iherman commented 5 years ago

This issue was discussed in a meeting.

No actions or resolutions
View the transcript
Tzviya Siegman: Issue 378
Tzviya Siegman: The last comment - how many possible ways do we want the TOC to account for. I apologize, I didn’t add this to the agenda so people may need time to think.
Ivan Herman: I plead guilty - I was the one who raised this issue, but looking at all the discussion, I am perfectly fine closing the issue with “no further action” and my question should be deemed as - unnecessary
Benjamin Young: my question is about document structure - generally related to the TOC and processing and where those should live. The answer could be “turn in next week.” Is the core piece - the manifest thing - is the data model, and how do you get the spec?
… how do you end up with the data model?
Tzviya Siegman: We’re going to talk about the overall document structure next week. As for this - lets bring it back to github and discuss when Matt has a microphone.
Dave Cramer: This seems like the classic issue where we’re not going to know what needs to happen until we try a bunch of stuff and things go wrong. It’s hard to imagine a bunch of theoretical TOCs.
Ivan Herman: That’ll take a bit of time. The other way around would be to close this to give us piece of mind, and if we hit problems later, take it as we come…
Dave Cramer: this is why we have CR and implementation experience.
Matt Garrish: What we have right now mimics closely what we have in epub. Do we need to expand it more? It has worked well so far. Maybe we can live with it - it’s something we need actual implementation data on…
… it’s probably something we can close off until we have something specific to deal with or let it go dormant.
Tzviya Siegman: I have anecdotal evidence that people want more, but i can put that in the issue.

TzviyaSiegman commented 5 years ago

anecdotal evidence about restrictions of EPUB (note I am not saying we should include these things in WPUB, just offering stories). Some of these can be addressed by best practices.

Authors want to include math in TOC
Authors want a standardized approach to subtitles in TOC
Authors want a standardized approach to displaying component-level metadata, such as author of a chapter, in TOC

mteixeira-wwn commented 5 years ago

@TzviyaSiegman, to your note (though tangential to the conversation in general; apologies to Ivan):

Authors want to include math in TOC

We have real use cases where in-line chemistry and music (!) notation are also important and irreplaceable. Norton used SVGs in the past, against better judgment, but it worked in our reader, which parses and sets the TOC as HTML. I imagine other publishers would also expect stuff like MathML.

More on topic: I would be happy to contribute samples of many of our various TOC "types", if that's a way forward to iterating on the parsing logic. Norton Anthologies, for example, could make good use of sectionable navigation docs.

iherman commented 5 years ago

@mteixeira-wwn such examples would be really useful. We should use to test the current extraction algorithm (@mattgarrish has a running implementation code, afaik). Thx!

iherman commented 5 years ago

This issue was discussed in a meeting.

RESOLVED: overwhelming support for closing #378, and Mateus will add comment to #414
View the transcript
ToC algorithm
Tzviya Siegman: https://github.com/w3c/wpub/issues/378
Tzviya Siegman: issue 378
… the issue is “what goes into ToC?”
… the proposal is to leave things as is, unless we have evidence it needs to be adjusted
… mateus said they need extra types (chemistry, music) in the ToC
Mateus Teixeira: I can provide examples from NN
Matt Garrish: there are two issues, one is allowing markup within the ToC labels (#414), but #378 is more about the various structures of ToC
… what kind of different structuring of the ToC should we try to account for
… maybe we should wait and see
Ivan Herman: my impression is that the possibility of putting advanced markup in label is different from #378
… the reason I raised it back then is that some structural things (e.g. section elements) are ignored by the algo, and I was wondering if other things should be ignored too
… my feeling is that the answer is no; but it doesn’t mean we can’t allow MathML
… it seems we can close #378 without much problems, regardless of what we decide for the markup of labels
Tzviya Siegman: I agree with that
Mateus Teixeira: +1, but I’ll share examples either way
Tzviya Siegman: mateus can provide examples for #414 then
Matt Garrish: right, we’re waiting for evidence for more table of contents
… we can close it and raise specific issues, specific kind of ToC when we have evidence or examples
Tzviya Siegman: the proposal is to leave the algo as is and close #378
Ivan Herman: yes, I have the impression that the structure of the ToC, as currently described, should be fine as-is.
Mateus Teixeira: true
Avneesh Singh: +1
Tzviya Siegman: ok, so overwhelming support for closing #378, and mateus will add comment to #414
Romain Deltour: +1
Charles LaPierre: +1
Wendy Reid: +1
Luc Audrain: +1
Mateus Teixeira: +1
Joshua Pyle: +1
Ivan Herman: +1
Resolution #3: overwhelming support for closing #378, and Mateus will add comment to #414

w3c / wpub

Elements to ignore by TOC extraction algorithm? #378