Do we need a more detailed definition for the HTML TOC format?

iherman commented 6 years ago

(This issue was originally discussed in #285, but needs to be migrated to a separate issue.)

@HadrienGardeur

I think that the toc is unique, in the sense that the UA will have to fetch and parse HTML to properly populate this info. I know that you have your doubts about that, but this is definitely something that every UA will need to do and I don't see any good reason why it can't be included in the same IDL.

@iherman

The problem with toc: we then have to spec exactly how the structure of the toc should be expressed in HTML, mainly when it comes to hierarchical toc-s.

@HadrienGardeur

this is IMO unavoidable.

@llemeurfr

if not, UAs won't be able to do anything interesting with the HTML ToC. The alternative we proposed was a predefined machine readable json ToC but it was dismissed by the group.

@TzviyaSiegman

We could consider restricting the way that the HTML ToC can ve written as is done in EPUB. See https://w3c.github.io/publ-epub-revision/epub32/spec/epub-packages.html#sec-package-nav-def.

@dauwhe

If we start profiling HTML in web publications, we will likely alienate the browser community as well as massively confuse authors.

HadrienGardeur commented 6 years ago

If the author wants a table of contents that isn't for use by machines, why are we concerned about marking it available for machine use? The machine-processable table of contents can be something that simply lives in a document that isn't reachable in the reading order.

That's not what I've suggested.

Based on my post above (https://github.com/w3c/wpub/issues/291#issuecomment-416352888), such a ToC would not be marked as "machine-processable", but the UA would know that this contains a renderable ToC.

The UA could use this to display a ToC button that simply jumps directly to that resource, which would work fine for the SVG example.

If you have a separate machine processable ToC, you'd keep it out of the reading order (resources) and mark it as machine processable using a different rel value.

Or rather, I could understand us allowing a JSON TOC in the manifest and an HTML TOC, with the latter being used to generate the first if it's missing, but a solution consisting of two HTML TOCs really puzzles me.

I also have a slight preference for having the machine processable ToC in JSON as well. I don't think that the arguments for an HTML machine-processable ToC are very strong:

any kind of restriction in syntax is confusing for authors and I wouldn't expect them to handle them manually, no matter if it's based in HTML or JSON
while an HTML ToC can handle things like Ruby or MathML, in practice it is very difficult for a UA to handle anything more than plain text in their native UI for a ToC

iherman commented 6 years ago

I wonder whether we are not breaking the 80/20 rule or, to use a more colloquial form of it: "the perfect is the enemy of the good"...

I think that most of the publication authors, if they produce a table of content, will produce what we call, so far, a "machine readable ToC", ie, some sort of a hierarchical list using nav|ul|ol|li. There are a large number of small javascript tools that do that automatically from properly structured HTML, HTML editors do that for you, the ToC-s generated by jekyll (see our own minutes on the Web site) do that, etc. Playing in the hands of this majority of potential WP-s, it would be crazy not to make use of these easily in the manifest. Yes, there may be some that will be wrong, and user agents may glean an erronous structure from it. Be it: it is the author's fault, and it is not much different than what happens on various places on the Web. The current draft aims at the situation where this ToC is available in HTML, it is easy to extract a machine readable version of it if the User Agents wants to produce some sort of a ToC popup, but if the user agent does not do anything the ToC makes sense as is (that is why some even wanted to require that the ToC would be in the primary entry page). I think that is the majority of our use case, close to the 80%.

Yes, there are more unusual cases, the cases described by @deborahgu, or a table of content in an SVG file, are examples thereof. The essense of the two-toc approach is, I believe, to make those esoteric cases possible, ie, to provide a fall back if the author does not want the user agent to do anything special, "just" display his/her own table of content, unchanged, in the browser. Another possibility is that there is indeed only on TOC but there is an extra information somewhere telling the User Agent "don't try to interpret this, just display it!"

I personally do not believe this would instigate most authors to produce two different ToC-s. If the restrictions are simple and the ToC is in HTML, ie, and the author can therefore add fancy CSS as his/her heart's content, why would that be the case?

mattgarrish commented 6 years ago

Based on my post above (#291 (comment)), such a ToC would not be marked as "machine-processable", but the UA would know that this contains a renderable ToC.

Yes, but you're still expecting the UA to use it for some purpose, that was my point. You don't need a second toc to jump to if the machine-processable toc can't be parsed. Fallback to presenting it instead of parsing it.

The case for a second table of contents was that the author wanted one purely for the content that had no restrictions, not that we needed to also wire it up for use by the UA somehow.

mattgarrish commented 6 years ago

I personally do not believe this would instigate most authors to produce two different ToC-s.

No, but it requires them to list the table of contents twice in the manifest to ensure that whichever approach the UA takes to rendering it knows the toc can be used. Why not list the toc once and let the UA determine how to present it based on whether it can obtain the information it needs?

mattgarrish commented 6 years ago

Sorry, "list" should be "identify it twice" since they'll have to use two toc semantics.

iherman commented 6 years ago

I personally do not believe this would instigate most authors to produce two different ToC-s. No, but it requires them to list the table of contents twice in the manifest to ensure that whichever approach the UA takes to rendering it knows the toc can be used. Why not list the toc once and let the UA determine how to present it based on whether it can obtain the information it needs?

I am not sure I follow what you say...

Also, this approach would leave it pretty open what the UA would do. If I use SVG for some sort of a map, I may not want the UA to do anything with the content, I just want it to provide a visible link to my map.

HadrienGardeur commented 6 years ago

Yes, but you're still expecting the UA to use it for some purpose, that was my point. You don't need a second toc to jump to if the machine-processable toc can't be parsed. Fallback to presenting it instead of parsing it.

The case for a second table of contents was that the author wanted one purely for the content that had no restrictions, not that we needed to also wire it up for use by the UA somehow.

That's a fair point. There are many EPUB 3 files like what you've described, with a "renderable" TOC included in the reading order (but not marked as being a TOC) and a Navigation Document that's not included in the reading order.

Based on my list at https://github.com/w3c/wpub/issues/291#issuecomment-416352888 this would mean dropping:

a rel that identifies the "plain HTML TOC" (probably contents)

If we don't think that there's a use case at all for "jumping to a non-processable TOC", I'm fine with that. I'm just not sure we ever had this discussion at all.

iherman commented 6 years ago

This is a possible writeup of what we call, for now, a two-toc version. As agreed on the call last Monday I write it down only to provide a clear basis for the discussion. The text below has some speculative aspects (e.g., the usage of aria-roledescription; I just did not find a better way to differentiate things...). It also shows that spec-ing the two-toc version is not that easy; maybe somebody can come up with a better approach if we get there.

Also, in what follows, I am not sure of the term 'MAP', we can have some bike-shedding on those.

(B.t.w., I am just a go-between at this point:-) Personally, I am not yet clear what I think the best option is.)

The infoset contains two types of table of contents:

TOC, identified in the manifest by the rel value of contents (as defined by IANA)
MAP, identified in the manifest by the rel value of https://www.w3.org/ns/wp#map

Both are HTML elements in a resource that MUST be part of either the reading order or the list of resources. These elements MUST have role="doc-toc" as part of their (HTML) attributes.

A minimal structure of the HTML element is defined for a TOC. This should be a relaxed version of the EPUB 3 Navigation Document spec.

The infoset may contain 0 or 1 TOC element and 0 or 1 Map element.

User agent behavior

If a user agent identifies a TOC, it SHOULD extract a (possibly hierarchical) list of links that it SHOULD display to the user in some application dependent way (e.g., as a pop-up, as a separate menu choice in the display area, etc.) This specification does not specify what should happen if such an extraction fails due to an incorrect TOC structure. One possibility is that the user agent considers it to be a MAP and not a TOC.
If a user agent identifies a MAP, it MAY provide a direct reference to that element from its own user interface (e.g., a button or menu item) and, if invoked, MUST display that MAP simply as HTML.

Getting hold of a TOC and a MAP

User agents MUST compute the TOC and MAP as follows:

Identify the MAP resource:
- If a resource in either the default reading order or resource list is identified with a rel value including https://www.w3.org/ns/wp#map, the corresponding url value identifies the table of content resource.
- Otherwise, the primary entry page is the table of content resource.
Identify the TOC resource:
- If a resource in either the default reading order or resource list is identified with a rel value including contents, the corresponding url value identifies the table of content resource.
- Otherwise, the primary entry page is the table of content resource.
If the MAP resource contains an HTML element with the role value doc-toc, and with the aria-roledescription value map, the user agent MUST use that element as the MAP.
If the TOC resource contains an HTML element with the role value doc-toc, and with the aria-roledescription either missing, or with different value than map, the user agent MUST use that element as the TOC.

If a MAP, respectively TOC, resource contains more than one TOC, respectively MAP, element, the first one in document order MUST be used. If the reading order and resource list, conceptually put together in this order, include more than one entries identifying a MAP, respectively TOC resource, the first one of those MUST be considered.

HadrienGardeur commented 6 years ago

It's also worth pointing out that at least for the machine processable part, this issue extends beyond the TOC.

While EPUB had many different options, we know that at least "page-list" is a must have as well.

mattgarrish commented 6 years ago

I'm just not sure we ever had this discussion at all.

Right, this is what is confusing me. It's not where we started out, which was only to ensure that the machine-readable toc didn't have to be the one in the content.

Part of my concern is also what effect two tables of content has on the user experience. Are there going to be two buttons to reach the toc, one that brings up a custom widget of some sorts and another that hyperlinks into the document?

HadrienGardeur commented 6 years ago

Part of my concern is also what effect two tables of content has on the user experience. Are there going to be two buttons to reach the toc, one that brings up a custom widget of some sorts and another that hyperlinks into the document?

Frankly, I doubt that this would be the case. If there's a machine-processable TOC and the UA can support them, I expect them to always have a preference for their own UI over rendering the TOC. The rendered TOC would IMO be treated as a fallback.

rdeltour commented 6 years ago

Thanks for the writeup @iherman! I take it to confirm my fear that a two-TOC-in-HTML solution is more confusing than helpful 😃

The writeup says:

If a user agent identifies a TOC, it SHOULD extract a (possibly hierarchical) list of links

This is IMO the key point and the difficulty of this spec'ing task: how to define a robust extraction mechanism.

My position is that this mechanism wouldn't be much more complex or difficult to spec & implement for any flow content than for a relaxed flavor of the Nav Doc.

iherman commented 6 years ago

@rdeltour, a clarification...

I am not sure how I should interpret

My position is that this mechanism wouldn't be much more complex or difficult to spec & implement for any flow content than for a relaxed flavor of the Nav Doc.

in term of the final spec. What should, in your view, the spec contain?

mattgarrish commented 6 years ago

I expect them to always have a preference for their own UI over rendering the TOC.

Sure, but it's still going to be a weird experience from one publication to the next. It seems like there should be two options to go with this approach, which is odd.

And I'd just mention here that, despite the perception, EPUB does not require reading systems to present the toc in a custom widget. That's just one option from having the restricted markup, but presenting the nav doc as HTML has always been an option. It wouldn't be any different than presenting the toc as HTML here if the links can't be parsed.

But if people are convinced that we need options to link to a table of contents in the content and provide a means for a UA custom widget, then I'd also start to lean in the direction of the latter being in JSON. At least that would reduce some potential confusion about which is which.

iherman commented 6 years ago

@mattgarrish I would have the same question than to @rdeltour, because I am really confused.

I realize (having written it down:-) that the two-toc spec is not simple. The current spec is simply silent over just about anything we discussed in this issue. What I would like to understand is what exactly you think should be in the spec?

rdeltour commented 6 years ago

in term of the final spec. What should, in your view, the spec contain?

well, if we say that given an HTML TOC the UA should or must extract a collection of links, then we need to specify how they're supposed to do so (in the simplest naïve case, a flat list of links in DOM traversal order; in a more sensible case, some kind of hierarchical data structure).

My expectation is that our spec defines how to build this collection of links, allegedly with an algorithm.

iherman commented 6 years ago

@rdeltour

if we say that given an HTML TOC the UA should or must extract a collection of links,

at this moment the spec does not say anything. That was the reason of the original issue question...

rdeltour commented 6 years ago

@iherman

at this moment the spec does not say anything. That was the reason of the original issue question...

Right, but there seems to be an agreement that RS would like to access TOC data to render in their own UI. This data can come from:

HTML, in which case we need to define an extraction algorithm
a JSON document, in which case we need to define the structure and where to define it (probably in the manifest)
some kind of API (?)

iherman commented 6 years ago

@rdeltour

We did have, in the past, something like that, see

https://www.w3.org/TR/2018/WD-wpub-20180315/#processing-reading-order

I do not remember why but this was removed from subsequent versions. Is this what you are looking for under (1)?

rdeltour commented 6 years ago

@iherman

Is this what you are looking for under (1)?

I'm not sure I understand your question… Spec-authoring-wise, yes, what I expect would look like this old section you pointed at, i.e. some kind of algorithm to extract a data structure out of HTML content. The steps of the algorithm of course would differ.

Again, I expect this if and only if: (a) we want to specify toc data for a UA to render in its own UI and (b) this data comes from HTML. If we revisit our past decision and say that this data comes from JSON or from some kind of API instead of HTML, then defining an extraction algorithm becomes moot.

I believe it's important to keep in mind that our spec is more helpful if it defines how a UA must process a Web Pub, rather than how an author must code a Web Pub. Contrary to what we did historically in EPUB, and caused a swarm of interop issues. So rather than defining content model restrictions, I'd like to define how a UA is to process some flow content identified as TOC data. Best practice documents, articles, or even a11y guidelines, can invite the authors to use clean nested lists, but that's not something that I see as absolutely required.

HadrienGardeur commented 6 years ago

I'm trying to summarize your suggestion @rdeltour: I think your main point is that we should define in the specification an algorithm for extracting the TOC from HTML rather than restricting how content should be structured (the NavDoc approach in EPUB 3).

This would mean that:

the UA identifies which resource contains the TOC using a rel value (contents)
it attempts to locate role="doc-toc" in that resource and to use the algorithm to extract a TOC
if the UA managed to extract a TOC, it can display this information in its own UI
if it failed to extract a TOC, it can provide an affordance to jump directly to this resource (which gets rendered)

In the examples listed by @deborahgu, this means that the author could author the WP as:

1. Single non-machine processable ToC

the resource containing these complex TOCs may or may not use the rel value (contents)
nothing gets extracted, which means that depending on the presence of the rel value, there might be an affordance to jump to these resources

2. Dual TOC approach

the complex TOC is not marked as such in readingOrder
an additional TOC is created by the author, it gets listed in resources with the proper rel value
the UA can extract a machine-processable TOC from it

From a UX perspective, that's barely different from what I've suggested before. It's mostly from a spec perspective that we take a different approach (specifying an algorithm + a single rel).

iherman commented 6 years ago

Continuing what @HadrienGardeur said: if the TOC is in SVG, the algorithm would (I presume) fail, in which case the UA would 'just' display the TOC. Which is of course fine, too.

I am o.k. with this approach (writing down the extraction algorithm which may fail for complex cases). Who takes the first shot at it?

HadrienGardeur commented 6 years ago

Quick side-note for @iherman :

we'll need to edit our WebIDL again to re-introduce the machine-processed TOC as well
but this doesn't replace the current element, since in some cases that's all we'll have
the current element should be updated to a PublicationLink (even if the TOC shows up in the entry page, we can reference the entry page + a fragment id with a PublicationLink)

iherman commented 6 years ago

@HadrienGardeur, sure. The algorithm to be defined should also go (I presume) to the collection of algorithms in the lifecycle section. But I would prefer to do that when this issue is closed and the lifecycle is finalized (see separate PR #318). Otherwise it becomes an editorial mess...

mattgarrish commented 6 years ago

Who takes the first shot at it?

I'm not necessarily offering to take the first shot at it, but it seems like we've been touching on what it needs to do already: extract all the anchor tags, excluding any that are the descendant of ancillary content (aside, others?), and attempt to construct a hierarchy based on positioning within an identifiable structure, when available (ol, ul, or role=list).

I can't picture an algorithm that would attempt to parse its way into the links, at any rate, as there are too many markup possibilities to consider.

Does that sound right?

iherman commented 6 years ago

@mattgarrish right. But that means, in fact, that the pure algorithm approach, ie, without any prior knowledge on the HTML structure, seems to be overly simplistic: get all the links, use them in document order and there you have a top level TOC. Attempting to construct a hierarchy using ol or ul or anything similar means... defining a TOC format, exactly what @rdeltour proposed to avoid doing.

I agree with @rdeltour that even if we define a format, à la EPUB3, we must have a clear algorithmic description on what is exactly what we do to ensure interoperability. I but I do not think we can avoid having such a format (maybe in a very general format like what @llemeurfr proposed, though I have misgiving using microdata).

Can somebody prove me wrong? Please?

(I agree with @mattgarrish that use some micro parsing rules in the URL is to be avoided.)

iherman commented 6 years ago

B.t.w., following the approach of @llemeurfr but using the ARIA list and listitem may be a better alternative. Somebody with a knowledge of ARIA should tell us if there are hidden pitfalls using those...

mattgarrish commented 6 years ago

Attempting to construct a hierarchy using ol or ul or anything similar means... defining a TOC format, exactly what @rdeltour proposed to avoid doing.

I'm not suggesting we define a toc format, only that there are logical structures from which a hierarchy can be easily identified, namely lists. We have to leverage the markup that has been used. Without doing that, you're not going to get anything more than flat table of contents, which would be the fallback when structure can't be identified.

As for ARIA, the danger is that people will make the regular markup a mess randomly sprinkling lists where there aren't lists. The author needs to understand that when they put role="list" on an element they're actually creating a list in the accessibility tree, so it can only have descendant role=listitem elements. Applying it to a table, like in the microdata proposal, would make a mess. You'd have lists and pieces of tables intermingled, making regular navigation difficult.

The problem I have with using microdata/RDFa is that isn't it an abuse of them? They aren't designed for extracting elements but generating graphs, or am I being overly literal in my interpretation? It seems like the right idea to use attributes to be non-invasive, but the wrong technology. It's one the rare cases where a microformat tends to make more sense, but I thought we were trying to avoid any authoring demands?

iherman commented 6 years ago

I'm not suggesting we define a toc format, only that there are logical structures from which a hierarchy can be easily identified, namely lists. We have to leverage the markup that has been used. Without doing that, you're not going to get anything more than flat table of contents, which would be the fallback when structure can't be identified.

Which does invalide the example of @dauwhe: https://github.com/w3c/wpub/issues/291#issuecomment-416313251

I am not sure what the difference is between a toc format and "there are logical structures from which a hierarchy can be easily identified, namely lists"...

As for ARIA, the danger is that people will make the regular markup a mess randomly sprinkling lists where there aren't lists. The author needs to understand that when they put role="list" on an element they're actually creating a list in the accessibility tree, so it can only have descendant role=listitem elements. Applying it to a table, like in the microdata proposal, would make a mess. You'd have lists and pieces of tables intermingled, making regular navigation difficult.

You are probably right :-(

The problem I have with using microdata/RDFa is that isn't it an abuse of them? They aren't designed for extracting elements but generating graphs, or am I being overly literal in my interpretation?

Certainly for RDFa I agree with you.

Although there is mapping from microdata to RDF, I do not know whether the SW community really considers that as valid. However, using schema.org terms for something like that may be inappropriate, because it would generate statements in a knowledge graph. If we used microdata, we would have to end up using our own vocabulary, meaning that we would probably be the only group on the globe using microdata for other than schema.org:-)

A very liberal reading of the HTML spec may make it possible to use data-* attributes, because they are used for internal purposes only (ie, for the RS), but that is also a slippery slope. We probably shouldn't do that.

HadrienGardeur commented 6 years ago

I agree with @iherman that by defining an algorithm, we also implicitly define which HTML elements will work with our machine-processable TOC.

They key difference is that we won't have any kind of validation and we can also expect some UAs to probably go beyond what we define. I don't know if this is a bug or a feature, but the end result will be that some UAs are capable of extracting a machine-processable from an HTML document, while others won't.

As for microdata/RFDA/ARIA/data-* attributes to semantically enhance such a TOC, I'm against the idea of requiring anything more than role="doc-toc". If we require anything else, we might as well restrict the syntax like we did in EPUB3, since there wouldn't be any benefit from defining an algorithm rather than a syntax anymore.

mattgarrish commented 6 years ago

@iherman

Which does invalide the example of @dauwhe: #291 (comment)

Sorry, I don't follow. What is invalid about extracting the links from that example and having a flat list because there isn't an identifiable hierarchy to them? You still get a table of contents.

I am not sure what the difference is between a toc format and "there are logical structures from which a hierarchy can be easily identified, namely lists"...

One requires structure, the other attempts to discover what structure it can.

@HadrienGardeur

we also implicitly define which HTML elements will work with our machine-processable TOC

Right, I don't think there's any way around this. If we want to retain a hierarchy, it has to be gleaned somehow. I mentioned lists as the easy sources, but if there's a way of making sense of tables, styled div/p/etc. all the better. I'm just sceptical what we can really glean from unstructured data.

iherman commented 6 years ago

Sorry, I don't follow. What is invalid about extracting the links from that example and having a flat list because there isn't an identifiable hierarchy to them? You still get a table of contents.

Wrong formulation from my part. I presume @dauwhe meant this example as a hierarchical TOC, and we wouldn't get it.

I am not sure what the difference is between a toc format and "there are logical structures from which a hierarchy can be easily identified, namely lists"...

One requires structure, the other attempts to discover what structure it can.

My defining/specifying what structure can be discovered, we do define a structure... Unless we set up a probably impossible task extracting a hierarchical TOC from any possible HTML content.

iherman commented 6 years ago

@mattgarrish

If we want to retain a hierarchy, it has to be gleaned somehow. I mentioned lists as the easy sources, but if there's a way of making sense of tables, styled div/p/etc. all the better. I'm just sceptical what we can really glean from unstructured data.

"sceptical" : I am happy to see that Canada has kept up with British-style understatements:-)

dauwhe commented 6 years ago

Wrong formulation from my part. I presume @dauwhe meant this example as a hierarchical TOC, and we wouldn't get it.

I thought that the simplest-possible algorithm (extract a elements in DOM order) would result in a somewhat-useful data structure.

iherman commented 6 years ago

@dauwhe @mattgarrish oops, sorry I misread your example. I thought it was more complex...

Before changing glasses:-) I thought the example was something like:

  <nav role="doc-toc">
    <h1>Contents.</h1>
    <h2>Stave One.</h2>
    <p><a href="chapter1.html">Marley’s Ghost</a></p>
    <h2>Stave Two.</h2>
    <p><a href="chapter2.html">The First of the Three Spirits</a></p>
    <h3>Stave two-and-half></h3>
    <p><a href="chapter2.5.html">something nice here</a></p>  
    <h2>Stave Three.</h2>
    <p><a href="chapter3.html">The Second of the Three Spirits</a></p> 
    <h2>Stave Four.</h2>
    <p><a href="chapter4.html">The Last of the Spirits</a></p> 
    <h2>Stave Five.</h2>
    <p><a href="chapter5.html">The End of it</a></p> 
  </nav>

Note the addition of a <h3> which, for a user, would mean a hierarchical TOC.

mattgarrish commented 6 years ago

I presume @dauwhe meant this example as a hierarchical TOC, and we wouldn't get it.

Ah, sorry, I didn't look closely at the example. I wasn't intending that lists were the only way to get structure; they're just what came to mind as I inquired about how this algorithm would work - the parens weren't intended to be comprehensive.

Headings are indeed another effective structuring element, so could also be used to partition/structure the toc. I don't have any issue with that.

mattgarrish commented 6 years ago

I thought that the simplest-possible algorithm (extract a elements in DOM order) would result in a somewhat-useful data structure.

I should have read all the way through, but I think this is always the end product when reliable structuring information isn't available. Implied sectioning via headings complicates things (versus being able to look straight up the ancestor chain of the element), but that's where the fun begins in terms of what can actually be extracted.

iherman commented 6 years ago

I would still like to see an algorithm that we could codify and that would be simple enough for such a spec. Retrieving hierarchy from a set of nested elements like ol/ul/li type elements is fundamentally different than retrieving hierarchy from the extra semantic knowledge of h1-h6 elements. Unless we can formulate such an algorithm properly I am a bit afraid we are back on square 1...

@rdeltour @HadrienGardeur @dauwhe any idea how to move ahead?

iherman commented 6 years ago

To attempt answering my own question, a possible simplification of the two-toc approach (which, as shown by my own experimentation, is not simple) can be something like:

We define a simple format (something based on ul/ol/nav/li...) that can express a simple, widely used hierarchical TOC in HTML. We accompany this with a precise algorithm on how this format must be used to extract a TOC. (Alternatively, we define two alternatives, one based on h1-h6 elements to define a hierarchy)
The rule to find the TOC are those in the current draft
The general rule for the User agent is (in case a doc-toc element is found, that is):
1. If running the algorithm on the element yields a hierarchical TOC, use it for the UA's purposes like pop-up TOC or the like
2. If the algorithm fails, provide some user interface of the UA's choice to link to the doc-toc entry which is displayed as a run-of-the-mill HTML content

(This is an early morning sketch, probably a bunch of details are too vague.)

rdeltour commented 5 years ago

@rdeltour @HadrienGardeur @dauwhe any idea how to move ahead?

I'd defintely like to try to spend some time on sketching a rough algorithm (or helping anyone doing it), but I'm swamped in other projects right now, so I can't before a few days at least and no promise 😊

rdeltour commented 5 years ago

What I think would be tremendously helpful is to collect various real-world examples of TOC on the Web (web books, long stories, etc), to try and pave the cowpath of how web devs are coding TOCs in the real world. If we can pull off an algorithm that extracts reasonable data from these real-world TOCs, we'll go a long way towards something truly useful!

iherman commented 5 years ago

@rdeltour, just some very "anecdotical" examples, ie, no systematic review:

At our own minutes[1]; the TOC is generated, from markdown, by jekyll, which is the main offline web site generator used on github. That uses the <ul>/<li> hierarchy.
[2] is a typical Wikipedia page that uses a similar structure, although with lots of microformat-like class names but I do not think that is relevant for us.
I have looked at the output of some random jQuery based plugins via some pretty random collections like[3] or simply search results; not all the plugin pages are alive, but what I could see is that all follow the same <ul>/<li> encapsulated in a, say, <div> element. (Care should be taken that any TOC extraction should happen on the DOM of the HTML content after loading, ie, after these plugins execute, because they are not part of the original source; this is something we have to specify in the algorithm). On the other hands, one could look at these scripts as being part of the RS that extract the TOC based on the <h1>-<h6> elements of the main content. The borderling is fuzzy.
Looked at an online scholarly paper[4] which has a TOC (tagged 'outline'), same <ul>/<li> enclosed in a <div>
An ACM proceedings paper in HTML[5] which does have a 'navigation' entry probably generated offline, and which is just a list of <a> elements with the hierarchies "built in" via spaces. But I would think that this is more in the category of (in our terminology) the reading system which probably extracted the TOC using the <h1>-<h6> elements.

mattgarrish commented 5 years ago

The one potential challenge is print-replica tables of contents. Opening the first book I saw on gutenberg resulted in this: https://www.gutenberg.org/files/57803/57803-h/57803-h.htm#CONTENTS

I see table markup used for their books quite often, at least in the HTML versions.

I don't know that print-replicas should be our concern, though. They also create nav docs for their EPUB versions, so we shouldn't worry about every possible case as that path will lead to madness.

dauwhe commented 5 years ago

I think the first step is to write some code and see how it works in the world. I already have JS that just produces a flat list from a random TOC. Getting it to recognize hierarchy is next. But I wouldn't even try to re-implement the outlining algo.

dauwhe commented 5 years ago

The naive flat version:

 var nav = document.querySelectorAll("nav[role='doc-toc'] a");
 var spine = [];
 for (let link of nav) {
    spine.push({
      href: link.href,
      text: link.innerHTML
    });
  }

Would be fun to see how much hierarchy you could get just from how deep in the DOM each a element is.

ghost commented 5 years ago

Thanks for the snippet.

The problems of this approach in real world (which we ran into sometimes) are:

Creator wanted to create a TOC with tree layout of Chapter/Section and we will ignore this by current approach.
Sometimes Creator just wanted to add an TOC item for layout purpose (which does not have an "a" element) then we will ignore this by current approach.

llemeurfr commented 5 years ago

I see that this discussion was held without any mention of the HTML5 Document Outline Algorithm. This is described and criticized in many articles on the web (intro here, Mozilla article here etc.) and if the single HTML ToC + algorithm route is followed, the issues the Web community met with this algorithm should be studied in order to specify a "better" one in the WP context.

Some will say that we shouldn't try to make things different than the W3C has already done. But the issue is that the HTML5 document outline has issues still not solved (e.g. Document outline Dilemna).

baldurbjarnason commented 5 years ago

There are a few key differences between what (I think) is being proposed in this thread (parse for <a> tags in a hierarchy of <li> tags) and the HTML5 Document Outline Algorithm:

The HTML5 Outline Algorithm was a completely new invention, not based on a pre-existing or commonly adopted pattern. Whereas nested a and li elements are what most ToC generation tools output on the web at the moment.
The Outline Algorithm had no realistic adoption pathway from here (how things are now) to there (a brave new world with HTML5 outlines everywhere). Adding support for it in Accessibility Tech, AFAICT, would have meant having to maintain two outline algorithms (HTML5 + heading levels) in perpetuity and a lot of end user confusion.

A few things would make a new ToC generation outline thingamabob much easier to handle than EPUB3 Nav from a content author's perspective:

Support both ul and ol
Be agnostic about about what each li actually contains. Just look for a and li descendents.
Use aria-label/aria-labelledby if they are there on nested ul/ol elements to create unlinked sub-sections instead of the hard-coded li > span + ol structure that EPUB3 requires.

These changes would let us use many pre-existing ToC generators which we can't use for EPUB3.

Pretty sure there is some way of specifying this so that both existing EPUB3 nav files and commonly used ToC patterns on the web generate useful ToCs using a single algorithm going forward.

rdeltour commented 5 years ago

I agree with @baldurbjarnason. For the record the outline algorithm were hastily mentioned on our call of August 27. What I said then was that invoking its complexity and absence of implementation was mostly a red herring for our ToC issue: one of the biggest reasons why the outline algorithm wasn't implemented is the new semantics given to h1-h6 elements ("h1 everywhere and section-defined heading levels"), which was quite far away from the "paving the cowpath" approach, and was very problematic for accessibility… we're in a very different problem space here.

Of course, I also think that ensuring that RS/UA are interested and willing to implement such a solution is important.

iherman commented 5 years ago

This issue was discussed in a meeting.

No actions or resolutions
View the transcript
Garth Conboy: Last week, Monday, there was no meeting due to US Labor Day. There was discussion on the call on issue #291 - we got to a semi-consensus that we’d have two - one for referencing and one for machine processing. Those, not including me, hung on the call and that rough consensus evaporated. There has been more discussion…
… the most recent was that this was a proposal that Ivan put together on having a single point of context that could be used to process the renderable one into something that is more machine readable. And getting hierarchy around the lists. That’s sort of where we are…
Garth Conboy: https://github.com/w3c/wpub/issues/291#issuecomment-417554685
Ivan Herman: In some sense, I think we’re in a deadlock. One approach is to have a clear algorithm in the spec that say “this is how - from this HTML over there - the hierarchical menu is created” - which is fine. For the time being, we don’t have that algorithm. My personal belief is that just having an algorithm without some sort of definition of how the toc looks like would be incredibly difficult. If we define some sort of structure then the algorithm should be put in the spec. Then my proposal goes that the way it would operate from the user is that the algorithm would then be the toc; if the algorithm fails, the user agent should simply play the HTML content that is there.
Dave Cramer: I’m an advocate for ‘define an algorithm’ but I’m not a JS expert, although I’m sure we can create an algorithm that creates the hierarchy. I would really like to see what such an algorithm would do to some real-life table of contents before we go much further…
Ivan Herman: Having javascript algorithm that could take any HTML would be terribly complicated and would be about the organization of the DOM… I looked at some examples that are more anecdotal about what Table of Contents are out there, I looked at ones that were generated and not human-created. I think there was only one exception - most are essentially a NAV or a UL or OL with the LI as the children, which is the natural thing to do with a table of contents. The other possibility is to use H1,H2, H3 to specify hierarchy.
Jeff Buehler: I wrote a number of scripts (node JS) that parse out TOCs and create TOCs, generally from spreadsheets and CSVs. There’s some relationship there that might be helpful. It generates TOCs from CSVs.
… The hierarchy is pretty brute force - i’m not using DOM parsing, I’m using regular expressions. There are some tags required for specific elements, but it might be helpful for anyone working on this. It’s been tested over a few years, so it definitely works.
Garth Conboy: I tend to agree with ivan that I am very much in the camp that one of the successful things we did in epub was the NAV file, but it doesn’t sound that view of the world is going to get traction here, but we might revisit in epub4 land. I don’t see any other way forward other than accepting ivan’s suggestion that we’ll have a placeholder for, but there doesn’t seem to be consensus support to have 2 TOCs.
… It seems to me that is where we are. But maybe we do a brief poll for consensus for accepting Ivan’s language with an algorithm TBH to be defined and experimented with.
Ivan Herman: I think we need to modify a bit - we should leave it open and give ourselves a certain amount of time to see if such and algorithm can be created. If it cannot be created, then we need a combination of what’s in epub3, making a reference to this issue - that for me seems to be the best option.
Garth Conboy: The best thing you said is that come back in a month - which happens to be TPAC - so that could be an interesting approach.
Matt Garrish: I was going to agree with Ivan - I am not of the belief that completely random HTML will become structured TOC. The middle ground we might find here is that there is a recommended way of doing a TOC - so YMMV if you don’t go with the recommended structure for TOC…
Joshua Pyle: +1 to recommended structure for TOCS
Matt Garrish: I want to sit on the fence that it’s required of epub but not necessarily epub, and just note there are better ways to do it.
George Kerscher: Would it be helpful for us to gather a collection of recommendations on how to do it? For example, if you’ve got a collection of content documents - how to traverse those files and extract the things you want to put in the TOC?
Garth Conboy: That sort of strikes me as - building a TOC from a whole cloth - this discussion is about how free can we be with an actual doc-TOC but without looking at it from the reading order
George Kerscher: I don’t know how this will be created. I’ve seen TOCs where the names in the doc-toc do not align with the content in your document.
Garth Conboy: we do have a requirement in the current spec that there must have a TOC identified, and it must point to an element identified as doc-toc. If it’s wrong and points to things that don’t exist, that’s the documents fault.
Dave Cramer: The key here is that it’s the author’s responsibility to create a TOC. They know how to best facilitate access to their content. We’re spending time assuming the User Agent needs to do anything beyond displaying the documents table of contents. I’m not sure how this fits into the web publication world, if we take the word web seriously. It’s not standard for a UA to search for nav elements.
Garth Conboy: I think it’s very arguable on either side on the WP side. When we get to epub4, we’ll want the reading system to be able to do things with the TOC, but lets not argue that thing now.
Ivan Herman: I want to make reminders that the TOC is not a required element. if the author doesn’t want the User Agent to do something, they won’t create a DOC-TOC element. We are not contradicting. There is no problem with a creator doing what they want.
Garth Conboy: https://github.com/w3c/wpub/issues/291#issuecomment-417554685
Garth Conboy: I would like to propose consensus on Ivan’s comments, but proposes an algorithm and/or set of constraints we might place on the table of contents if the algorithm is successful - take a few weeks to TPAC, and if we’ve gotten somewhere…
Garth Conboy: +1
Garth Conboy: We would have some constraints, and if the algorithm is not successful then just display the HTML. If that is acceptable to folks, I will go ahead and be the token +1
Jeff Buehler: +1
Ivan Herman: +1
Joshua Pyle: +1
Ric Wright: +1
Zheng Xu: +1
Juan Corona: +1
Marisa DeMeglio: +1
Nick Ruffilo: +1
Dave Cramer: +1

w3c / wpub

Do we need a more detailed definition for the HTML TOC format? #291

User agent behavior

Getting hold of a TOC and a MAP