w3c / wpub

W3C Web Publications
https://w3c.github.io/wpub/
Other
78 stars 19 forks source link

Add content and processing requirements for machine-readable toc #371

Closed mattgarrish closed 5 years ago

mattgarrish commented 5 years ago

This PR contains an adaptation of the EPUB toc rules, but simplified in the following ways:

Feedback is, as always, welcome from everyone on these changes.


Preview | Diff

iherman commented 5 years ago

Just a note: as said in the minutes of the 26th of Nov. the issue of whether a TOC in JSON should be added to WP or not will be considered separately. This PR is only on resolutions on that meeting.

llemeurfr commented 5 years ago

The relaxing relative to ol/ul is IMO the most efficient, we heard about many EPUB files presenting this issue.

The other evolutions seem ok also, but I have an issue with backward compatibility with EPUB: the WP spec constrains processing a without href instead of a top heading (h*) and spans. It means that EPUB nav docs containing such headings will not be entirely processed (the content of h* and span will not be processed).

Re. details, in the ol/ul spec I see "in this order" -> li (one elt type only), which is not useful.

mattgarrish commented 5 years ago

the WP spec constrains processing a without href instead of a top heading (h*) and spans

You can still provide a top-level heading. That's mentioned both in the content model and in the processing (see the first bullet about obtaining a title for the toc).

But spans are problematic to keep if we're loosening the content model and allowing any other tagging to be included. A span could appear for any number of reasons not related to the text label to use, while an a tag is always the link, or possible link.

It's also technically not incompatible with EPUB, as you can include an a without an href right now instead of a span. So content can be back-translated without any issues if you very strictly follow the content model, but you're right that if you want to create a wpub out of an EPUB 3 you'll have to make some changes, but that's true regardless. There's still a reasonable path forward in that you just have to transform the span tags to a.

mattgarrish commented 5 years ago

Re. details, in the ol/ul spec I see "in this order" -> li (one elt type only), which is not useful.

Ya, that's odd. It's in the EPUB spec that way, but I can strip "in this order".

llemeurfr commented 5 years ago

@mattgarrish ok for the first issue (top-level heading), I missed it when reading.

For the second point, I agree that a simple transform (span to a) makes rountrip btw EPUB and WP possible. It must be noted that from WP to EPUB 3, the extra / non-processable HTML information will have to be suppressed from the markup.

Therefore I'm ready to approve the PR.

mattgarrish commented 5 years ago

I'll see what I can do to clean up the processing @rdeltour. Then maybe we can get the group to decide on whether to keep the content model stuff or just reduce it to explanation and examples.

avneeshsingh commented 5 years ago

Curious to know the use case for providing anchor without meaningful href.

mattgarrish commented 5 years ago

Curious to know the use case for providing anchor without meaningful href.

The two probably most relevant for publications:

There are other cases that aren't particularly relevant to a machine-readable table of contents, like removing the href from the link to the page the user is currently on so that they aren't presented with a redundant link.

danielweck commented 5 years ago

The "Childrens Literature" EPUB3 sample exhibits such Navigation Document, with some list items used as "containers" (each with a span "heading") for sub-lists of items: https://github.com/IDPF/epub3-samples/blob/master/30/childrens-literature/EPUB/nav.xhtml#L24

TzviyaSiegman commented 5 years ago

@rdeltour made excellent suggestions. This looks really good.

mattgarrish commented 5 years ago

Sorry, I should have mentioned that we've been working on a complete overhaul to use a model like the one defined for the outline algorithm (not to compile a toc from headings or anything like that, but similar in terms of walking over the nodes of the toc nav to extract what is needed).

I have that plus a working model of the algorithm almost ready for review, so will hopefully be able to update this PR either tonight or tomorrow.

The revisions are ongoing in the toc-algo branch, for those interested, but will be merged into this branch: https://cdn.staticaly.com/gh/w3c/wpub/toc-algo/index.html?env=dev#app-toc-ua

mattgarrish commented 5 years ago

Further to my last comment, I've now pushed the new algorithm for extracting the table of contents: https://cdn.staticaly.com/gh/w3c/wpub/machine-processable-toc/?env=dev#app-toc-ua

The working implementation of this model is at: https://cdn.staticaly.com/gh/w3c/wpub/machine-processable-toc/experiments/toc_generator/ (The source is annotated to help match up the javascript to the specification.)

At this point, what we'd like to get is feedback on the technical approach, including if there are any obvious bugs in the algorithm. (Especially from anyone who has experience in this kind of extraction, if it needs saying. @danielweck?)

Content issues, like the structure of the table of contents, can be taken up later. This update is just a new implementation of the basic list structure parsing we've already agreed on.

iherman commented 5 years ago

(For some reasons, the 'preview' of the original PR has not been updates with the new commit. Look at the date of the draft, it should say 6th of December!)

mattgarrish commented 5 years ago

Hm, looks like I committed to the wrong branch. I wonder what toc-updates was for?

mattgarrish commented 5 years ago

Yup, definitely a case of pebkac. That was a stale branch with my pre-squashed work the first time around. Need better housekeeping.

Anyway, the preview looks like it's updated now, but the direct links are:

(I'll adjust the urls in the previous comment and get to deleting branches.)

iherman commented 5 years ago

A minor extension: I wonder if keeping, if available, the value of the rel and type attributes of the 'a' element in the generated object wouldn't be valuable, eg, if the target is a media object.

mattgarrish commented 5 years ago

A minor extension: I wonder if keeping, if available, the value of the rel and type attributes of the 'a' element in the generated object wouldn't be valuable, eg, if the target is a media object.

Those are simple enough to add in, as it's just a couple of additional attributes to inspect on the a tags.

rdeltour commented 5 years ago

Looks good to me!

There are still a few issues that are worth discussing further IMO:

But all these can be treated in separate issues 🙂

iherman commented 5 years ago

But all these can be treated in separate issues

In my view, s/can/should/ :-)

(It would be important to have a consistent version in the main branch, allowing further discussions...)

rdeltour commented 5 years ago

In my view, s/can/should/ :-)

oh yes, I wasn't aware GitHub comments were also subject to RFC2119 conformance 😁

mattgarrish commented 5 years ago

In the interest of moving on to more specific issues, and as there hasn't been any new negative feedback, I'm going to merge this PR now.