Closed iherman closed 5 years ago
For one thing, I'm not even convinced all sectioning content elements should be ignored; especially given that section
is more often that not used as a glorified div
by web developers…
especially given that
section
is more often that not used as a glorifieddiv
by web developers
For those who may not follow this issue, if a nav is subsectioned like this:
<nav role="doc-toc">
<h2>Contents</h2>
<section>
<h3>Part 1</h3>
...
</section>
...
</nav>
The algorithm will not extract anything, as we don't account for partitioning of navs.
It's not that complicated to treat each nested section/nav element as a branch of the toc (i.e., like an a
tag without an href attribute), but it does bring in additional complexity. What happens if heading numbers aren't sequential? what happens if a section has multiple headings? What happens if there are headings and no sections? We almost have to implement the outline algorithm to extract a table of contents.
I don't like that we don't descend into subsections, but if we descend into them and ignore them as branches of the table of contents (i.e., we just go in search of lists), then do we concatenate the lists we find as though they all represent children? Does that make sense if their grouping heading is gone?
I am afraid this would make all this way too complicated for no major gain. While the section elements are indeed very important in general, the current TOC algorithm is not aimed at using the main body of the content as the source of the TOC, but using a dedicated structure.
A general, or even a somewhat restrictive, outlining algorithm goes, in my view, beyond what we should define. Authoring tools may impose their own content structure restrictions, and use a corresponding TOC generating algorithm, which can then be used as a WPUB TOC if needed (this is what respec does, after all).
(Interestingly, my previous comment may be used in favor of a TOC structure in JSON, see issue #376...)
the current TOC algorithm is not aimed at using the main body of the content as the source of the TOC, but using a dedicated structure
I'm not suggesting we try to parse the contents of publication, only that partitioning a toc into subsections isn't uncommon. It's often done inside a list with an unlinked label, but that's just one pattern. Inside each section we'd still expect a list of links:
<nav role="doc-toc">
<h2>Contents</h2>
<section>
<h3>Part 1</h3>
<ol>
<li><a href="c1.html">Chapter 1</a></li>
<li><a href="c2.html">Chapter 2</a></li>
...
</ol>
</section>
<section>
<h3>Part 2</h3>
<ol>
<li><a href="c10.html">Chapter 10</a></li>
<li><a href="c11.html">Chapter 11</a></li>
...
</ol>
</section>
</nav>
Where it gets more complex is if people start using only headings, and don't use them consistently:
<nav role="doc-toc">
<h1>Contents</h1>
<h3>Part 1</h3>
<ol>
<li><a href="c1.html">Chapter 1</a></li>
<li><a href="c2.html">Chapter 2</a></li>
...
</ol>
<h3>Part 2</h3>
<ol>
<li><a href="c10.html">Chapter 10</a></li>
<li><a href="c11.html">Chapter 11</a></li>
...
</ol>
</nav>
But just as we restrict what we accept now, we could probably restrict partitioning to the first example and if you stray from that you can't expect a usable toc. That includes if you insert a section element and don't provide a heading, you just get a placeholder (like a li
without an a
but with a nested list of links).
The latter example requires constructing an outline first before the intention can be meaningfully extracted.
(Interestingly, my previous comment may be used in favor of a TOC structure in JSON, see issue #376...)
Indeed and this is just the beginning of a very very long list of edge cases that we'll encounter...
If we move away from a sub-set of HTML like in EPUB, I see no end in sight to these discussions.
@iherman
I am afraid this would make all this way too complicated for no major gain.
In both Matt's examples above, making section
s "transparent" would mean the algo would still extract something instead of bluntly ignoring the markup.
A general, or even a somewhat restrictive, outlining algorithm goes, in my view, beyond what we should define.
Yes, 👍 of course.
@mattgarrish
I'm not suggesting we try to parse the contents of publication, only that partitioning a toc into subsections isn't uncommon.
Could we in this case merge the list descendants, as if they were one big list? I.e. we would totally ignore sections and headings, but at least the list content would get extracted? I'm aware this may not fully represent the author's intent in 100% cases, but maybe better than nothing?
@HadrienGardeur
If we move away from a sub-set of HTML like in EPUB
What the algorithm does right now isn't much different from what EPUB Nav Doc was doing, it's just slightly more permissive and is specified for the UA rather than for the Author, which is better for interoperability.
But I'm not sure I understand what you mean by "moving away from a sub-set of HTML like in EPUB"; I think the question here is precisely about what subset of HTML we want to extract?
Trying to second-guess @HadrienGardeur's statement:
moving away from a sub-set of HTML like in EPUB
I guess the issue is whether we move further away from the (relatively) simple HTML structure that the current draft specifies for TOC. If we are not careful, we may end up with an infinite amount of various structural variations, ie, very complex structures and algorithms. As a general statement (if this is what @HadrienGardeur meant, that is), I agree with this.
Whether @mattgarrish's first example is within the bounds: maybe. The current algorithm have to become a bit more complicated, because we have to account for the heading elements within the sections, too, which would probably mean that a heading element within a simple <li>
would have to be accepted, too (otherwise we will have to keep track on whether we are withing a section or not). If that can be done easily then... maybe it is fine.
But I would think we should not go in direction of the second example. Ie, we should definitely stop there unless real and widely used cases come to the fore.
I must admit, however, that I was surprised by @mattgarrish's statement, whereby:
... partitioning a toc into subsections isn't uncommon.
I have never seed such TOC structure myself so far on the Web (but it may be used in e-books, I cannot say). I definitely yield to others if such structures are really common. But if they are not, we should not complicate our lives further...
I have never seed such TOC structure myself
I don't follow here. EPUB used span to represent headings, so is there really any difference between the example I gave above and this:
<nav role="doc-toc">
<h1>Contents</h1>
<ol>
<li>
<div><a>Part 1</a></div>
<ul>
<li><a href="c1.html">Chapter 1</a></li>
<li><a href="c2.html">Chapter 2</a></li>
</ul>
</li>
...
</ol>
</nav>
EPUB forced you to follow the above pattern, but should we continue to rule out doing it with nested section/nav elements? It's certainly debatable that we should allow only one representation.
But, to be clear, all I'm saying is if we do this, we should maintain strict rules on what we allow. I would argue for the following if we do:
And that's all we'll recognize.
It will pose a question of whether to apply a
or h#
as the name, I agree, as I already ran into that in an earlier iteration of the working code. I'm not sure if the answer is just to accept either the first heading or a
as the name, assuming headings inside list items will be a rarity, or just flag what is currently being processed (we wouldn't need a new stack or anything, since the name has to be found before any nested sections/lists begin).
@mattgarrish do we need to raise this to WG or is this for you to solve?
This is a question of how many possible representations of the markup we want the toc algorithm to account for, but no one has complained (yet) about the existing algorithm's expectations so maybe it's a non-issue.
Thanks. We'll add it to discussion on Monday.
As the person raising the original issue: I am perfectly fine if the answer to the question:
The HTML TOC structure and extraction ignores sectioning content and hidden elements from the TOC. Is there a need to ignore others?
is a "no", and we close this issue with no further action.
This issue was discussed in a meeting.
anecdotal evidence about restrictions of EPUB (note I am not saying we should include these things in WPUB, just offering stories). Some of these can be addressed by best practices.
@TzviyaSiegman, to your note (though tangential to the conversation in general; apologies to Ivan):
Authors want to include math in TOC
We have real use cases where in-line chemistry and music (!) notation are also important and irreplaceable. Norton used SVGs in the past, against better judgment, but it worked in our reader, which parses and sets the TOC as HTML. I imagine other publishers would also expect stuff like MathML.
More on topic: I would be happy to contribute samples of many of our various TOC "types", if that's a way forward to iterating on the parsing logic. Norton Anthologies, for example, could make good use of sectionable navigation docs.
@mteixeira-wwn such examples would be really useful. We should use to test the current extraction algorithm (@mattgarrish has a running implementation code, afaik). Thx!
This issue was discussed in a meeting.
RESOLVED: overwhelming support for closing #378, and Mateus will add comment to #414
The HTML TOC structure and extraction ignores sectioning content and hidden elements from the TOC. Is there a need to ignore others?