w3c / wpub

W3C Web Publications
https://w3c.github.io/wpub/
Other
79 stars 19 forks source link

Elements to ignore by TOC extraction algorithm? #378

Closed iherman closed 5 years ago

iherman commented 5 years ago

The HTML TOC structure and extraction ignores sectioning content and hidden elements from the TOC. Is there a need to ignore others?

rdeltour commented 5 years ago

For one thing, I'm not even convinced all sectioning content elements should be ignored; especially given that section is more often that not used as a glorified div by web developers…

mattgarrish commented 5 years ago

especially given that section is more often that not used as a glorified div by web developers

For those who may not follow this issue, if a nav is subsectioned like this:

<nav role="doc-toc">
   <h2>Contents</h2>
   <section>
      <h3>Part 1</h3>
      ...
   </section>
   ...
</nav>

The algorithm will not extract anything, as we don't account for partitioning of navs.

It's not that complicated to treat each nested section/nav element as a branch of the toc (i.e., like an a tag without an href attribute), but it does bring in additional complexity. What happens if heading numbers aren't sequential? what happens if a section has multiple headings? What happens if there are headings and no sections? We almost have to implement the outline algorithm to extract a table of contents.

I don't like that we don't descend into subsections, but if we descend into them and ignore them as branches of the table of contents (i.e., we just go in search of lists), then do we concatenate the lists we find as though they all represent children? Does that make sense if their grouping heading is gone?

iherman commented 5 years ago

I am afraid this would make all this way too complicated for no major gain. While the section elements are indeed very important in general, the current TOC algorithm is not aimed at using the main body of the content as the source of the TOC, but using a dedicated structure.

A general, or even a somewhat restrictive, outlining algorithm goes, in my view, beyond what we should define. Authoring tools may impose their own content structure restrictions, and use a corresponding TOC generating algorithm, which can then be used as a WPUB TOC if needed (this is what respec does, after all).

iherman commented 5 years ago

(Interestingly, my previous comment may be used in favor of a TOC structure in JSON, see issue #376...)

mattgarrish commented 5 years ago

the current TOC algorithm is not aimed at using the main body of the content as the source of the TOC, but using a dedicated structure

I'm not suggesting we try to parse the contents of publication, only that partitioning a toc into subsections isn't uncommon. It's often done inside a list with an unlinked label, but that's just one pattern. Inside each section we'd still expect a list of links:

<nav role="doc-toc">
   <h2>Contents</h2>
   <section>
      <h3>Part 1</h3>
      <ol>
          <li><a href="c1.html">Chapter 1</a></li>
          <li><a href="c2.html">Chapter 2</a></li>
          ...
      </ol>
   </section>
   <section>
      <h3>Part 2</h3>
      <ol>
          <li><a href="c10.html">Chapter 10</a></li>
          <li><a href="c11.html">Chapter 11</a></li>
          ...
      </ol>
   </section>
</nav>

Where it gets more complex is if people start using only headings, and don't use them consistently:

<nav role="doc-toc">
   <h1>Contents</h1>
   <h3>Part 1</h3>
   <ol>
       <li><a href="c1.html">Chapter 1</a></li>
       <li><a href="c2.html">Chapter 2</a></li>
        ...
   </ol>
   <h3>Part 2</h3>
   <ol>
      <li><a href="c10.html">Chapter 10</a></li>
      <li><a href="c11.html">Chapter 11</a></li>
      ...
   </ol>
</nav>

But just as we restrict what we accept now, we could probably restrict partitioning to the first example and if you stray from that you can't expect a usable toc. That includes if you insert a section element and don't provide a heading, you just get a placeholder (like a li without an a but with a nested list of links).

The latter example requires constructing an outline first before the intention can be meaningfully extracted.

HadrienGardeur commented 5 years ago

(Interestingly, my previous comment may be used in favor of a TOC structure in JSON, see issue #376...)

Indeed and this is just the beginning of a very very long list of edge cases that we'll encounter...

If we move away from a sub-set of HTML like in EPUB, I see no end in sight to these discussions.

rdeltour commented 5 years ago

@iherman

I am afraid this would make all this way too complicated for no major gain.

In both Matt's examples above, making sections "transparent" would mean the algo would still extract something instead of bluntly ignoring the markup.

A general, or even a somewhat restrictive, outlining algorithm goes, in my view, beyond what we should define.

Yes, 👍 of course.

@mattgarrish

I'm not suggesting we try to parse the contents of publication, only that partitioning a toc into subsections isn't uncommon.

Could we in this case merge the list descendants, as if they were one big list? I.e. we would totally ignore sections and headings, but at least the list content would get extracted? I'm aware this may not fully represent the author's intent in 100% cases, but maybe better than nothing?

@HadrienGardeur

If we move away from a sub-set of HTML like in EPUB

What the algorithm does right now isn't much different from what EPUB Nav Doc was doing, it's just slightly more permissive and is specified for the UA rather than for the Author, which is better for interoperability.
But I'm not sure I understand what you mean by "moving away from a sub-set of HTML like in EPUB"; I think the question here is precisely about what subset of HTML we want to extract?

iherman commented 5 years ago

Trying to second-guess @HadrienGardeur's statement:

moving away from a sub-set of HTML like in EPUB

I guess the issue is whether we move further away from the (relatively) simple HTML structure that the current draft specifies for TOC. If we are not careful, we may end up with an infinite amount of various structural variations, ie, very complex structures and algorithms. As a general statement (if this is what @HadrienGardeur meant, that is), I agree with this.

Whether @mattgarrish's first example is within the bounds: maybe. The current algorithm have to become a bit more complicated, because we have to account for the heading elements within the sections, too, which would probably mean that a heading element within a simple <li> would have to be accepted, too (otherwise we will have to keep track on whether we are withing a section or not). If that can be done easily then... maybe it is fine.

But I would think we should not go in direction of the second example. Ie, we should definitely stop there unless real and widely used cases come to the fore.


I must admit, however, that I was surprised by @mattgarrish's statement, whereby:

... partitioning a toc into subsections isn't uncommon.

I have never seed such TOC structure myself so far on the Web (but it may be used in e-books, I cannot say). I definitely yield to others if such structures are really common. But if they are not, we should not complicate our lives further...

mattgarrish commented 5 years ago

I have never seed such TOC structure myself

I don't follow here. EPUB used span to represent headings, so is there really any difference between the example I gave above and this:

<nav role="doc-toc">
   <h1>Contents</h1>
   <ol>
       <li>
          <div><a>Part 1</a></div>
          <ul>
             <li><a href="c1.html">Chapter 1</a></li>
             <li><a href="c2.html">Chapter 2</a></li>
          </ul>
        </li>
        ...
   </ol>
</nav>

EPUB forced you to follow the above pattern, but should we continue to rule out doing it with nested section/nav elements? It's certainly debatable that we should allow only one representation.

But, to be clear, all I'm saying is if we do this, we should maintain strict rules on what we allow. I would argue for the following if we do:

And that's all we'll recognize.

It will pose a question of whether to apply a or h# as the name, I agree, as I already ran into that in an earlier iteration of the working code. I'm not sure if the answer is just to accept either the first heading or a as the name, assuming headings inside list items will be a rarity, or just flag what is currently being processed (we wouldn't need a new stack or anything, since the name has to be found before any nested sections/lists begin).

TzviyaSiegman commented 5 years ago

@mattgarrish do we need to raise this to WG or is this for you to solve?

mattgarrish commented 5 years ago

This is a question of how many possible representations of the markup we want the toc algorithm to account for, but no one has complained (yet) about the existing algorithm's expectations so maybe it's a non-issue.

TzviyaSiegman commented 5 years ago

Thanks. We'll add it to discussion on Monday.

iherman commented 5 years ago

As the person raising the original issue: I am perfectly fine if the answer to the question:

The HTML TOC structure and extraction ignores sectioning content and hidden elements from the TOC. Is there a need to ignore others?

is a "no", and we close this issue with no further action.

iherman commented 5 years ago

This issue was discussed in a meeting.

TzviyaSiegman commented 5 years ago

anecdotal evidence about restrictions of EPUB (note I am not saying we should include these things in WPUB, just offering stories). Some of these can be addressed by best practices.

mteixeira-wwn commented 5 years ago

@TzviyaSiegman, to your note (though tangential to the conversation in general; apologies to Ivan):

Authors want to include math in TOC

We have real use cases where in-line chemistry and music (!) notation are also important and irreplaceable. Norton used SVGs in the past, against better judgment, but it worked in our reader, which parses and sets the TOC as HTML. I imagine other publishers would also expect stuff like MathML.

More on topic: I would be happy to contribute samples of many of our various TOC "types", if that's a way forward to iterating on the parsing logic. Norton Anthologies, for example, could make good use of sectionable navigation docs.

iherman commented 5 years ago

@mteixeira-wwn such examples would be really useful. We should use to test the current extraction algorithm (@mattgarrish has a running implementation code, afaik). Thx!

iherman commented 5 years ago

This issue was discussed in a meeting.