Closed iherman closed 5 years ago
The question was discussed in issue #291, and there was no consensus. Some relevant comments, and other issues:
Tables of contents are designed to be presented to end users. Over the last thousand or so years of print publishing, and the last few decades of web publishing, we have developed certain techniques to help authors express themselves. At the very simplest level, these include things like italics, superscripts and subscripts, bold text, etcetera. HTML handles these things effortlessly.
But what about JSON? It is truly common for a TOC entry to contain an italic phrase. How would we express that in a JSON TOC? The solution I've seen is allowing embedded HTML in the JSON:
"toc": [
{
"url": "http://www.example.com/part1/index.html",
"name": "Part 1",
"children": [
{
"url": "http://www.example.com/ch1/index.html",
"name": "The Building of the <i>Titanic</i>"
}]
...
This was a major issue with EPUB's NCX.
It is truly common for a TOC entry to contain an italic phrase. How would we express that in a JSON TOC?
It's worth pointing out that in EPUB, most UAs sanitize strings extracted from the HTML Navigation Document. When the Navigation Document is not directly rendered (for example in the UI of an app, even a Web App), all of these tags are ignored.
I don't see any reason why this would be any different with WP, most EPUB UAs are already built on top of a webview.
Moving away from NCX did not solved that problem with EPUB3 UAs.
@dauwhe if I understand well, this issue isn't about the visual TOC. It's about the TOC when it has to be in the manifest, there are use cases that need it, for instance to achieve accessibility. BTW, This was one of the goal of NCX.
@dauwhe if I understand well, this issue isn't about the visual TOC. It's about the TOC when it has to be in the manifest, there are use cases that need it, for instance to achieve accessibility. BTW, This was one of the goal of NCX.
My understanding is that the "machine-readable" TOC is made available to the user, but the manner of presentation is controlled by the user agent and not the author, much the way the NCX worked.
I think it's important that this TOC still support some inline styling, as it does convey semantic meaning which should be available to all users. If many current EPUB reading systems strip out inline elements from the nav
, I think that is a bug and not a feature. We should not limit the ability of document authors to express information, and we should not prevent user agents from displaying richer information. File formats should adapt to human needs, rather than humans adapting to the limitations of file formats.
The interest for a JSON ToC comes mainly from the Audiopub TF, as it may well be that audiobook publishers will be worried if they have to create HTML documents as ToC. A simple authoring tool would help them add metadata + a simple hierarchical ToC to their work (with links to audio fragments).
If many current EPUB reading systems strip out inline elements from the nav, I think that is a bug and not a feature.
There's a good reason for them to do that:
Sure, you can just blame it all on reading systems, but understanding why they sanitize strings is IMO quite a bit more constructive.
Just for the sake of curiosity, are you aware of any RS that can display such tags @dauwhe?
It seems that even Edge is sanitizing these strings and stripping HTML out of it for its own UI.
This is something that will also be useful to test in the EPUB CG as part of testing support for EPUB 3.2 across RS.
I think audiobooks can be fine with pulling the TOC from HTML. One approach is better than two.
Just for the sake of curiosity, are you aware of any RS that can display such tags @dauwhe?
I made a test. AZARDI preserves italic. iBooks, Kobo, Google Play, and Kindle/Mac strip out the italic.
There's a good reason for them to do that:
- they can't trust the content, which means that they need to white list tags and sanitize strings
In the case of EPUB, many people also put the nav doc in the spine, and so the reading system is obligated to display it. Are you saying that HTML that undergoes further processing requires a higher level of trust? It certainly makes sense to strip out JS here:
<li><a href="chapter2.html" onclick=alert(9)>The Building of the <i>Titanic</i></a></li>
But I don't understand the security risk posed by ordinary HTML phrasing content like i, em, sup, etc.
- these styles can interfere with their own (for Web Apps) or can't be easily rendered (for native apps)
I expect most text rendering facilities used by apps of any sort can handle some simple things like italic. And if the web app has complete control over the presentation of this content, how is there a conflict?
Sure, you can just blame it all on reading systems, but understanding why they sanitize strings is IMO quite a bit more constructive.
What I'm trying to express is a use case: it is much easier for readers to understand certain kinds of text when basic inline formatting, of the type supported by HTML, is available. And I'm concerned that this possibility will be absent in a JSON table of contents.
Our goal as a working group is not to make something that works exactly like EPUB. Our goal should be to make something better—something that better serves the needs of end users. Having an italic word in a TOC entry is admittedly a small thing, but I think it's a real benefit to readers, and I am not yet convinced it's so difficult that the burden on implementors outweighs the benefit to end users.
Thanks, @dauwhe. I could provide several examples of scholarly publications that include math in TOC heads. They don't make much sense in EPUB. We hack around it badly.
I would just caution that EPUB was intended to serve a wide variety of reading systems. An auditory system, for example, would only extract/use text labels. A low-power text reader similarly isn't going to manage more than simple display of the text content. For others, implementing a subset of HTML within their tree views is not as simple as just demanding it be done. The more rigidly we say what a reading system has to do, the more difficult we make it for these to conform. It may be something we don't care about for WP, but apples and oranges are getting compared at times in this thread.
I'm all for giving the user agent the choice to use the descendant HTML of an a
tag, but ruling out a user agent from generating a text-only label strikes me as a bad idea.
Also, the primary motivator for moving to the nav doc, beyond the complexity of the ncx, was to improve support for internationalization: http://github.com/w3c/publ-epub-revision/blob/wiki/Navigation.md
It might be good to expand the discussion beyond what North American/European reading systems make use of. If JSON can't support ruby, and ruby is critical for readers in Asia, then we definitely aren't making progress.
In the case of EPUB, many people also put the nav doc in the spine, and so the reading system is obligated to display it.
And that use case is perfectly fine. If you want to render a table of contents, there's no argument that HTML is the best option.
Are you saying that HTML that undergoes further processing requires a higher level of trust?
It actually does in many cases.
But I don't understand the security risk posed by ordinary HTML phrasing content like i, em, sup, etc.
That's exactly why white-listing would be the usual and necessary best practice, but it's just plain easier for RS to simply remove it all and only extract plain text from the navigation document instead.
If JSON can't support ruby, and ruby is critical for readers in Asia, then we definitely aren't making progress.
I suspect that the situation is the same as with other tags and that even in Asia, most RS simply extract plain text.
I am afraid we are engaging into a discussion which is not the subject of the current issue and we may also be reopening the discussions we had in issue #291 in "visual" TOC. We should try to avoid doing so.
Whether the TOC, as expressed in the draft is used by the User Agent as is (ie, just displaying the HTML), whether it is used to extract an internal data structure along the line of the extraction algorithm and do what "traditional" user agents do, or anything in between is, currently, left to the User Agent. This is not at discussion in this issue. (If we want to discuss this, let us open a separate issue.)
B.t.w., looking at the details of the extraction algorithms, the labels for links are extracted through the accessible name. This means that, again in the current algorithm, HTML tags will be stripped, but the content will take into account such accessibility features like aria labels. This is good for accessibility (probably better than the EPUB 3.2 version) but may not be good for the types of effects that @dauwhe was talking about. If that detail of the algorithm must be re-discussed (eg, by requiring the extraction of an HTML text rather than the accessible name), let us open a new issue.
The only question in this issue is whether authors MAY (not MUST) fill a JSON-LD manifest entry directly with a data structure that is to be extracted by the extraction algorithm.
We should try to concentrate on this and only this issue here...
I don't think the discussion was entirely out of scope either @iherman.
@dauwhe pointed out the lack of support for italics, superscript and similar tags as a major issue with the JSON approach.
I've pointed out that in practice, for a number of reasons, these tags are not extracted by UAs. As you've pointed out as well, they wouldn't be supported by our current take on the extraction algorithm either.
Now that this is out of the way, I think there are a number of pros and cons to saying that a manifest MAY contain a ToC in JSON-LD.
Pros
Cons
This issue was discussed in a meeting.
Garth Conboy: I’m fine with HTML and I’m also fine with only HTML, just because my take is that any reading system taking this packaged audiobook is likely to also be taking EPUB
@GarthConboy I think that's not a correct assumption. There are a lot of "audiobooks only" reading systems available and there are also dedicated audio devices (including smart speakers). Neither of them currently support EPUB or HTML.
Ivan Herman: I have a question to various reading system implementers: at the moment, the TOC is defined to be in HTML and we’ve spent an inordinate amount of time defining the format in HTML… … my favorite option is to say that that’s where we stop, realizing that this means reading systems must be able to parse an HTML file, extract the TOC out of it, even if it doesn’t use any styling… … what I have difficulty judging, is it really such a huge deal for reading systems, knowing that these days taking a public domain HTML parsing library and running it to extract the TOC is really not such a huge deal…
@iherman with the same approach as EPUB, it's not a big deal. But if any HTML is allowed, it becomes quite difficult to achieve properly.
Brady Duga: One of the uses for HTML: you could have ruby for eg Japanese chapter titles. This would be hard in JSON… … you can recreate the properties of HTML in JSON if you want to, but it’s hard…
That's the theory, but in practice I think that @dauwhe has only been able to identify one reading system capable of handling any kind of markup in its TOC and everywhere else only plain text is extracted.
Working with markup in any native UI element is difficult and most product owners working on reading apps would be against it anyway.
Here's a good example: https://twitter.com/micahsb/status/1093657329592033280
This issue was discussed in a meeting.
The choice seems to revolve around:
sol 1: HTML ToC only, with a precise structure and extraction algorithm. -> i18n friendly, optionally styled, but impossible to validate. If the HTML does not follow the rules, the ToC will be unusable for many UAs (those which don't display the HTML as-is, but rather sanitize the content and extract a simple string based structure for native display). To process the HTML structure, the UA has to load the DOM first (processing the serialized HTML would be a nightmare).
sol 2: HTML ToC (still highly structured) with a JSON fallback. -> The JSON structure is not styled, and has some limitations relative to mixed languages, but it is easy to validate and easy to process to a UA. If the HTML ToC is present, the UA will use it (see sol 1). If not, the JSON structure will be used instead. A UA which intends to present an HTML ToC will have to transform JSON to HTML first, using its own styling rules.
Whatever the solution is, Audiobook publishers will need an interactive tool to create a ToC out of a friendly UX (a ToC generator). Therefore I don't really see why Audiobook publishers should prefer one solution over the other. For UAs, my personal take is that because the UA must be able to process the HTML ToC, it's less of a burden to have one use case only, i.e. HTML only (as JSON only is not one the table anymore).
Proposal: Restricted HTML as described in current draft
This issue was discussed in a meeting.
RESOLVED: the TOC is encoded using the restricted HTML as defined in the WPUB spec, and that is the only way it can be done
The current (2018-12-08) draft includes a detailed algorithm for the retrieval of a TOC from HTML. Question is whether it should also be possible to add the TOC directly into the manifest (ie, bypassing the HTML) using the same data structure as produced by that algorithm.