Closed mattgarrish closed 3 years ago
Why not simply stick with polyglot? Sigil actually uses a slightly modified version of google's gumbo html parser to do automatic html repair of both xhtml and html, injecting closing tags for non-void elements as needed and intentionally serializes things to polyglot (ie. only acknowledged void tags are self-closed, etc). In this fashion only pure xml that adds a separate closing tag on a void element needs to be fixed before passing it an html5 compliant parser like gumbo. The resulting xhtml/html5 will parse directly with any browser interface that follows the official html5 parsing rules, while at the same time is completely valid xhtml. Using our slightly modified gumbo based parser (light weight, few dependicies, fast, C code) we can happily accept and fix epub author mistakes on the fly and allow either downstream xml or html based tools. We would be happy to make our changes to google's gumbo available under any license needed as I am the author of all of those changes.
This seems like the only sane solution for epub3 support in Sigil. BTW, as a bonus, this approach allows standard javascript tools (jquery, etc) to work as expected since the same dom tree will be generated from the html5 or xhtml polyglot serialized document. Dialect specific serialization just makes no sense when you have html4, html5, xml 1, xhtml 1.1, xhtml5, and even old mobi html 3.2 dialects running around in the ebook world. Simply throw a fast html5 compliant parser like gumbo at it and polyglot serialize the result. You get an xhtml version that works in any browser as html5.
Hi! I maintain the Standard Ebooks project, which produces ebooks using Epub 3.0.1 as the base format, with an emphasis on using as many of the rich opportunities for semantics and metadata that 3.0.1 provides.
Switching to HTML5 syntax over XHTML is a great step forward. XHTML is a horror to work with and dropping it for a more flexible and friendly flavor of markup can't come soon enough.
I am concerned, however, about the loss of epub:type. If we're switching to using the role attribute, then the vocabulary afforded by the W3C spec is already much thinner than the Epub semantic inflection vocabulary. As new ebooks are produced we'll be losing opportunities to add significant semantic information, like which parts of the work are front, body, or backmatter, while retaining redundant semantics, like "title" (which can already be implied by an <h#>
tag--isn't <h2 role="title">
a bit redundant?).
Another plus to epub:type (and a grudging nod to XHTML) is that we can use other semantic vocabularies not defined in the official Epub spec. For example, Standard Ebooks goes to great lengths to add extensive semantics to all of the books we produce. We prefer the standard Epub 3.0.1 semantic vocabulary if it contains what we need; if not, we look at the greatly expanded z3998 vocabulary; and finally we have our own custom vocabulary, which is used as a sort of transitional vocabulary until we sort out schema.org. This means that within a single ebook, we can use a variety of standardized vocabularies to inflect sections as "poems", "songs", "letters", and so on.
That's nice for two reasons:
It opens up a lot of fascinating possibilities for data crunching and machine processing ("find all books from 1890 that had letters in them"; "create a list of every unique ship name in Moby Dick"). This isn't a particularly practical goal in today's terms, but I think marking up ebooks as richly as we can is a noble nod towards our future.
The thing with ebooks is that it's easy to add semantics during the proofing process, but really, really hard and time consuming to go back and do it later. Because of that, losing the ability to include these kinds of semantics would, in a practical sense, lock the door and throw away the key for interesting machine processing via rich semantics for years or decades.
It gives us a hook for CSS styles. Consider the following snippet:
<div epub:type="z3998:letter">
<p>Dear sir...</p>
</div>
<p>That's all she wrote.</p>
Not only we do get some nice semantic inflection there, but we can hook CSS to it like so:
[epub|type~="z3998:letter"]{
margin: 1em;
}
[epub|type~="z3998:letter"] + p{
text-indent: 0;
}
Without that, we'd have to style with CSS classes, which nets us the same styling but without a semantic freebie, and leaves us at the mercy of unsemantic and unstandardized antipatterns like <p class="smcap">
.
So ultimately, I applaud the move to HTML5, but doing so at the expense of getting rich semantics would be a big loss. Ideally we would have a way to get the ease-of-use of HTML5 along with the richness of semantic inflection and the ability to use non-epub-spec semantics that the current epub:type definition allows.
Closing this issues as it was resolved not to add HTML in the 3.1 revision
The issue was discussed in a meeting on 2020-12-18
Are you are looking for input from epub development software like Sigil?
If so, pure html5 has such lax parsing rules that it actually makes downstream xml based tools harder to implement as a full browser parsing engine would be needed just to parse the resulting code due to the "flexible" state based parsing rules used by pure html5.
Using a modified version of Google's gumbo parser inside Sigil, Sigil can happily take pure html5 with its lax rules and automatically create strict xhtml5 based output with no real cost to the end epub developer. With Sigil generating more strict xhtml variant of html5 allows all existing downstream xml toolchains to still be used.
So simple open source tools and software already exist to take pure html5 with its lax parsing rules and create something more easily processed downstream. Not all downstream tools should need to implement the full html5 parser spec just to work properly.
So instead of moving the source for epubs to pure html5, simply use freely available epub devloper tools to create the proper strict syntax that makes the entire toolchain work.
Five years later, I got an email about activity in this issue, and I wanted to pop in to say that my view of XHTML has changed since my previous comment. Previously, I was focused on the annoyance of authoring it, compared to the laxer HTML5 parsing rules. XML-isms like namespaces were a pain point.
But, after working on nearly 450 epubs at Standard Ebooks, my perspective has changed. Like @kevinhendricks stated above, XML is easier and faster to parse, and thus process by other programs. Since we already need an XML parser for the metadata file, programs that work with a whole epub would need to package an additional HTML5 parser instead of reusing the XML parser. The publishing industry is heavily invested in XML so being able to use a single parsing library to (for example) process both an OPDS feed and an epub is very valuable.
XHTML gives us nice things like xpath and xslt for free. Pretty-printing/canonicalization is easier and has good support in many libraries and programs.
I still think XML namespaces are an annoyance at the XML spec level, but in the context of the average epub, the annoyance is limited because 95% of the time the only non-default namespace will be the epub
namespace.
Importantly, canonicalized XHTML forces uniformity on the output. In a canonicalized document we can always expect to see <br/>
, not <br>
or <br></br>
or anything else. This is useful not just for presentation but for the rare and unpleasant times when a program must massage XHTML using regexes or other naive string operations.
So, if my opinion is worth anything, five years later I would like to reverse my previous position and suggest sticking with XHTML, but expanding it to the HTML5 element vocabulary. In other words something like XHTML5. epubcheck already allows HTML5 vocabulary like <section>
so maybe the epub spec already allows for that, I don't have it in front of me right now.
(My previous opinion on epub:type
still stands; it is very useful and it would be a pity to see it go in favor of a non-standard attribute and vocabulary.)
About our company's RS(named BinB),
A. Is your reading system capable of rendering EPUBs that contain HTML that is not well-formed XML?
No. Our RS does not use any web browser rendering engine. It uses our original rendering engine.
B. Is your reading system capable of ingesting such EPUBs? Does your toolchain depend on all content documents being well-formed XML?
Our RS does not use XML parsers, so does not depend on well-formed XML. On the other hand, to ingest HTML systax EPUB completely, it needs additional development for parsing.
And we have some satellite tools for processing EPUB files that uses XML parser. They demand all content documents being well-formed XML. For example, preview(sample) file maker(my original).
@acabal,
So, if my opinion is worth anything, five years later I would like to reverse my previous position and suggest sticking with XHTML, but expanding it to the HTML5 element vocabulary. In other words something like XHTML5
That is already the case in 3.2. The relevant section §2.2 says:
An XHTML Content Document has to meet the following basic requirements:
- It MUST be an [HTML] document that conforms to the XHTML syntax.
And the references are to HTML5 XML syntax. The upcoming EPUB 3.3 draft takes this text over verbatim.
Speaking about Readium toolkits (Mobile and Desktop) and the desktop Thorium Reader app:
A. Is your reading system capable of rendering EPUBs that contain HTML that is not well-formed XML?
Yes, as all Readium toolkits are based on major web rendering engines. They belong to what could be called the "Open Web RS profile".
The Readium Mobile iOS toolkit relies on Webkit (WKWebView). The Readium Mobile Android toolkit relies on Chrome WebView. The Readium Desktop toolkit relies on Chromium (via Electron.js). And Thorium Reader relies on Readium Desktop.
B. Is your reading system capable of ingesting such EPUBs?
Yes they are.
To be sure, I created a very dirty EPUB from "wasteland" (which contains a unique spine item) by replacing the XHTML content by some random HTML tag soup. It would be good to have a proper sample, but until then ... it works on Thorium like a charm.
(Copied from Laurent's email for an easier reference, with authorization.)
CC @llemeurfr
For Play Books.
- A. Is your reading system capable of rendering EPUBs that contain HTML that is not well-formed XML?*
Currently "no" -- epubcheck is part of the ingest pipeline and would fail such content. XML processing tools are currently used in the ingest pipeline.
- B. Is your reading system capable of ingesting such EPUBs Does your toolchain depend on all content documents being well-formed XML?*
Such content would not currently get past the front door.
However, if EPUB 3.3 were to add HTML serialization, we would embark on the (non-trivial) effort to support non-XML content.
(Copied from Garth's email, with authorization.)
Cc @GarthConboy
Speaking about our Reading System (PUBLUS Reader for Android/iOS);
A. Is your reading system capable of rendering EPUBs that contain HTML that is not well-formed XML?
Yes. PUBLUS Reader uses Blink (Android) or WkWebKit (iOS).
B. Is your reading system capable of ingesting such EPUBs? Does your toolchain depend on all content documents being well-formed XML?
Seems possible.
In my personal opinion, the various satelite tools needed for delivery have a greater impact than RS. As mentioned https://github.com/w3c/epub-specs/issues/636#issuecomment-748425162, I use XML parser to develop the tool to create sample EPUB from full EPUB. When dealing with "HTML syntax" instead of "XHTML syntax", the scope of additional development will be larger because XML parser cannot be used. I think it's better hearing to not only RS vendors but also to bookstore system vendors.
I cannot speak for the Readium "mobile" implementations (iOS/Android), but here is my feedback from the Readium Desktop / Thorium point of view: support for non-XML HTML would require a thorough audit of a relatively large codebase, in order to identify code where we rely on the assumption that the markup is well-formed XML. This is by no means a complete analysis, but just from the top of my head:
application/xhtml+xml
, for example, as authored in the EPUB OPF package manifest items), but sometimes we may rely on implicit / default XML handling, depending on the library used to process documents.epub:type
attribute matching), with fallback on syntactical conventions (e.g. namespace prefix as plain string of characters)To conclude: adding support for non-XML HTML is definitely within the realm of possibilities in Thorium / Readium-Desktop, but as with any non-trivial development task, this would require in-depth analysis and methodical regression testing (in other words, it wouldn't just be a case of "adding" HTML support, we would need to make sure that the existing XHTML support doesn't break when implementing dual support for XHTML / HTML in the various content processing modules).
This is by no means a complete analysis, but just from the top of my head:
You'd probably remember this better, but weren't there issues with CFIs and the HTML syntax? I seem to remember a discussion about having to use the parsed DOM for HTML but that being potentially unreliable.
You'd probably remember this better, but weren't there issues with CFIs and the HTML syntax? I seem to remember a discussion about having to use the parsed DOM for HTML but that being potentially unreliable.
I don't remember the details, but indeed there were concerns that the XML-centric CFI processing model wouldn't work reliably with HTML DOM, due to subtleties in text encoding, character offsets / text node normalisation, element boundaries (e.g. self closing tags), etc. As an implementer, I would certainly anticipate weird edge cases in the path resolution logic (i.e. when converting DOM Ranges to CFI, and vice-versa). In principle, "polyglot" (X)HTML5 helps mitigate this, but I suspect that in practice we would need to work around some XML / HTML discrepancies in web browsers.
"Polyglot Markup: A robust profile of the HTML5 vocabulary" W3C Working Group Note 29 September 2015 https://www.w3.org/TR/html-polyglot/
PS: in Thorium / Readium-Desktop we do not make internal use of CFI (unlike the first / original incarnation or Readium SDK), instead we use our own optimised DOM-Range (de)serialisation technique. We can (and do) generate equivalent CFI expressions for our bookmarks / annotations, but we do this in a "vacuum" in the sense that we have no consuming API at the moment (in a future software iteration, we may produce CFI references in the context of interoperable W3C content annotations, if the need arises).
This is by no means a complete analysis, but just from the top of my head:
You'd probably remember this better, but weren't there issues with CFIs and the HTML syntax? I seem to remember a discussion about having to use the parsed DOM for HTML but that being potentially unreliable.
that would definitely be problematic. I do not remember the details, but the HTML5 parser, possibly, adds some extra elements into the DOM for, e.g., enclosing other elements. Ie, the meaning of CFI can definitely be distorted.
Apart from RS developers' advice, it is important to take into account the pressure that could come for the publishers' side.
Big publishers are mostly using internal XML based workflows. I say mostly because we know that there is also a small production of carefully crafted (X)HTML based ebooks. Small publishers willing to create reflow EPUB seem to use desktop tools that produce EPUB via a save button. And the vast majority of EPUB FXL publishers are using InDesign.
As a consequence, from what I hear, the pressure is quasi-null.
Ie, the meaning of CFI can definitely be distorted.
Right, and I believe this is partly why we dropped reading system support for authored CFIs in 3.1 (aside from lack of anyone actually writing these things manually). It was preparatory for possibly adding HTML as they aren't webbish.
The only reason I raised it was that I know there's a possible resurrection of CFI in the works, and this could prove problematic for an HTML syntax. Not a blocker, but a consideration for the viability of CFIs as currently written.
that would definitely be problematic. I do not remember the details, but the HTML5 parser, possibly, adds some extra elements into the DOM for, e.g., enclosing other elements. Ie, the meaning of CFI can definitely be distorted.
I believe the parser would add some things even to the XML syntax of HTML. For example, tables without tbody
get one.
that would definitely be problematic. I do not remember the details, but the HTML5 parser, possibly, adds some extra elements into the DOM for, e.g., enclosing other elements. Ie, the meaning of CFI can definitely be distorted.
I believe the parser would add some things even to the XML syntax of HTML. For example, tables without
tbody
get one.
Ouch. As usual, @dauwhe is right...
The issue was discussed in a meeting on 2021-05-27
The issue was discussed in a meeting on 2021-05-28
List of resolutions:
@wareid @dauwhe @shiestyle in accordance with the vF2F resolution, I have created a new label 'to-be-incubated-further' and used it for this issue before closing.
See also #2259, which raises similar issues and arguments.
And to be fair, Amazon is not an epub vendor and popup footnotes were specifically suggested in the original epub3 spec, so these arguments are specious at best.
Moving to html will invalidate all existing epub readers and publisher tool chains, allows for spaghetti html code that would require a full browser whatwg parser just to do just about anything. And finally enforcing xml syntax format is not difficult and can be done by any decent serializer. And all browser engines support xhtml.
Epub3 is finally taking off, especially internationally, the tools to create them are many and free alternatives exist. Let's not break everything now. Slowly evolving the standard while nudging people in the right direction with epubcheck seems like the safest and best decision possible.
Part of the alignment process with the open web platform is to begin supporting the HTML syntax of HTML in addition to the XHTML syntax.
Details are available in the following document: https://docs.google.com/document/d/1m2XsQbYcYIRJ1CL2HojeU8XNXOluHk6g7AScM5hkrZg/edit
The proposal implemented for the first draft is to allow support for both HTML and XHTML syntax in content and require support for both syntaxes in reading systems.
The epub:type attribute will be superseded by the ARIA role attribute, but will remain available for backwards compatibility and for specifications whose semantics haven't been ported.
This issue will remain open past the first draft for comments.