The term “XHTML” has no longer a meaning in the HTML5 specification.

w3c / epub-specs

Shared workspace for EPUB 3 specifications.

Other

303 stars 60 forks source link

The term “XHTML” has no longer a meaning in the HTML5 specification. #2259

Closed MicuentaGit closed 2 years ago

MicuentaGit commented 2 years ago

Please consider the possibility to give authors the opportunity to write a EPUB document in HTML syntax. The XML syntax on HTML documents have no longer a term in the specification and the examples given in the specification uses the HTML syntax. EPUB specification is pushing authors to write documents with a syntax that is by far less popular than the alternative HTML syntax. In addition, the recommendation for HTML polyglot documents has been abandoned, which is a symptom of the future death of the XML syntax. With Microdata, the HTML specification has the means to add the semantics needed by EPUBs on the HTML documents, using these attributes leverages the authors the possibility to style based on them using CSS style rules. Authors should not need to change their HTML documents to write a EPUB document. Authors will prefer that their readers can have the possibility to read their documents in a browser. By adding in the Head of the HTML document the script <link rel="search" type="application/oebps-package+xml" href="../../OEBPF.opf" />, allows browsers to access the information needed to add functionality for EPUB documents, if they find it useful.

iherman commented 2 years ago

@MicuentaGit this has been a (very) longstanding issue. There is a github issue https://github.com/w3c/epub-specs/issues/636 in this repository with some comments referring to discussions on various WG meetings. At the moment, the decision has been to not (yet?) switch to HTML5 due to the difficulties it would generate in existing EPUB workflows and reading system deployments.

Note that this issue is planned to be discussed and incubated further in the Publishing CG with the goal of possibly moving to a resolution in later releases of the standard. It would be great if you joined the CG to explore the issue further.

cc @mteixeira-wwn @Jeffxz @WSchindler

mattgarrish commented 2 years ago

EPUB specification is pushing authors to write documents with a syntax that is by far less popular than the alternative HTML syntax.

This isn't true in the traditional publishing market that EPUB has served, though. XML workflows are still prevalent and the tools professional publishers use, like InDesign, have been generating the XML flavour of HTML since the early days of EPUB. The ecosystem of EPUB 2 and 3 has been built up around supporting XML, so introducing regular HTML is not a trivial ask as it has the potential to break content on many platforms. That's been the major stumbling block to adding the HTML syntax to EPUB 3. Without a version change, we haven't been able to get consensus that the benefits would outweigh the drawbacks.

It's an issue we've argued over in every revision of EPUB 3 going back to the original 3.0.

And just to be pedantic about the title of this issue, we don't refer to xhtml as a term in html, either. We removed those references in this revision to match the HTML change.

"XHTML content document" is a name specific to epub that in the context of epub 3 refers to:

an [html] document that conforms to the XML syntax.

Until a time comes that we allow both syntaxes, we have to retain this distinction. The "XML syntax of HTML content documents" doesn't have the same ring to it... 😉

MicuentaGit commented 2 years ago

Dear @iherman, Thank you very much for your replay. I have read your previous discussion on issue #636, including the document referenced in the first post, and I think my answer can add relevant information to the discussion.

I would also like to thank your invitation to the Publishing CG, although it may be rhetorical, I am a simple author working on my first book without a funding source to cover my expenses. Because of this lack, my time for any collaboration with the group is very limited, and I need to decline your invitation.

While I was investigating the requisites to publish my digital book in most important digital stores, I found that their support for the EPUB format was not fully compliant with EPUB specifications, even though they claim to fully supported. Let me give you two examples; Amazon have additional restrictions for SVGs images given in its publication guidelines; and Kobo will act with a footnote link by displaying an emerging window with the destination document, breaking the spirit of the spirit of the HTML by mixing presentation and meaning.

A specification that is not fully supported by the main players in the digital distribution of digital books, is worth nothing because the reader will not get a consistent experience of the same book independently of where it bought it. We should learn from the experience of the HTML specifications that consistent experience is mandatory and by now an author have no guaranty to give the same experience to his readers by using EPUB specification.

Things are getting worst because HTML specifications are evolving very fast and EPUB readers will not implement soon enough to keep the pace with the standards.

Additionally, CSS is not giving support for namespaces and EPUB readers need to hack the standard to give support to the selection of elements by attributes that are prefixed by a namespace reference. Because of the sign used to distinguish the namespace's prefix part from the attribute's name is the same as the one used for pseudo-classes, an ambiguity issue has not an easy way to be solved.

Developers of EPUB readers can benefit from open-source render engines like Gecko, Blink, and WebKit. By pushing developers to follow this track, authors will benefit with a giant improvement in the consistency of readers' experience, and a richer and up-to-date environment to express themselves in exactly the way they want to communicate with the reader.

Web authors do not use WYSIWYG tools anymore, I, personally, started to use Sigil, but now I am writing my e-book using Brackets. Competent author needs to know HTML languages because conformance with the HTML specifications has not only to obey with syntaxes rules but with semantics rules.

WYSIWYG tools are not fulfilling their target of bringing more authors to the use of EPUB specification in their writings. This is because these tools have limited functionality to help authors on other writings than a soup of plain text with a few images flouting on it. It is a shame the following two facts: Firstly, digital scientific magazines ask authors to write in Latex to publish in PDF instead of asking to write in HTML to author a EPUB; secondly, EPUB format is ignored by content management systems for education in the composition of their digital books.

Transition should not be traumatic if the EPUB specification starts to define the insertion of its markup using the Microdata feature of the HTML in the HTML documents. By doing this, EPUB documents will be a specific type of HTML documents instead of documents that blend two composition languages that force the need of using a specific software to read them.

Authors using XML syntax may choose to use both markups as a reiterative description of the document to ensure that old EPUB readers can render its documents too.

MicuentaGit commented 2 years ago

Dear @mattgarrish, Thank you very much for your reply. Following, I quote some fragments of it that I would like to reply:

EPUB specification is pushing authors to write documents with a syntax that is by far less popular than the alternative HTML syntax.

This isn't true in the traditional publishing market that EPUB has served, though. XML workflows are still prevalent and the tools professional publishers use, like InDesign, have been generating the XML flavour of HTML since the early days of EPUB. ...

I was referring to the entire universe of HTML documents and not to the tiny EPUB world.

And just to be pedantic about the title of this issue, we don't refer to xhtml as a term in html, either. We removed those references in this revision to match the HTML change.

It might sound pedantic, but I am not the one who made that statement, the HTML specification team made it. If you read carefully what this section states, you might understand my warning. HTML does not declare an XML type; so, even if the XML document uses any arbitrary node's name, the document will still be in conformance with HTML5 specifications. There is not even a need by this specification that the root element of the XML document must be the HTML element of its specification. In addition, HTML5 does not declare any namespace for its vocabulary, and this means that there is no means to reference its terms in an XML document. The namespace that EPUBs use to declare their markup names for HTML in their XML documents is “http://www.w3.org/1999/xhtml” which is no longer updated and the new terms of HTML5 are not in it. To fulfill XML specifications, authors should not use terms outside the vocabulary of the namespaces declared. Following the polyglot recommendation, EPUB documents declare its “XHTML documents” using <!DOCTYPE html>, but there isn't any registered declaration of such a document type, it was just a transitory patch to make the use of HTML5 markup possible. Browsers are doing their best effort to render these documents, but there is no warranty that they do it consistently because it is not normative. "XHTML content document" is a name specific to epub that in the context of epub 3 refers to:

an [html] document that conforms to the XML syntax.

Until a time comes that we allow both syntaxes, we have to retain this distinction. The "XML syntax of HTML content documents" doesn't have the same ring to it... 😉

You may understand now why this definition is not enough to make sure that any EPUB reader will understand the same thing in its description.

mattgarrish commented 2 years ago

It might sound pedantic, but I am not the one who made that statement

I think you misunderstand. I was saying that I was being pedantic in responding to the question about "xhtml" in your title. It's not strictly what you ended up writing about, but I wanted to be clear that we don't refer to the old "xhtml" terminology anymore.

HTML does not declare an XML type; so, even if the XML document uses any arbitrary node's name, the document will still be in conformance with HTML5 specifications.

Sorry, but could you cite specific passages of the html specification that confirm these assertions? The xhtml namespace hasn't changed, all elements belong to it, and writing arbitrary elements and calling it xhtml is not backed up by anything I can find. Specifically, I'd direct you to the xml compatibility section, which says among other things:

To ease migration from HTML to XML, user agents conforming to this specification will place elements in HTML in the http://www.w3.org/1999/xhtml namespace, at least for the purposes of the DOM and CSS. The term "HTML elements" refers to any element in that namespace, even in XML documents.

Except where otherwise stated, all elements defined or mentioned in this specification are in the HTML namespace ("http://www.w3.org/1999/xhtml"), and all attributes defined or mentioned in this specification have no namespace.

https://html.spec.whatwg.org/multipage/infrastructure.html#xml

The big problem raised with only supporting the xml syntax has been in scripting support. There are incompatibilities in writing xml that not all the big scripting frameworks bother to handle.

mattgarrish commented 2 years ago

@iherman this is the lather, rinse and repeat of EPUB issues, so I wonder if just defering it again is useful, especially since it duplicates #636. Given the varied feedback we got in the earlier issue over many years, reopening that one and leaving it deferred and to-incubate might be a better approach. This could be closed as a duplicate.

I know no one's going to want to sign up to do this, but it would also help to outline the issues that have proven the roadblocks so we don't restart from the beginning each time, or maybe this is for the CG to dive into.

The barriers to change have more often been within the ecosystem - getting reading systems to support; what we're doing to users with reading systems that will never be upgraded; authoring support for multiple grammars; vendor support for multiple grammars; publishers being invested in xml; etc. Those have been the more intractable problems to changing epub 3 mid-life than the technical ones.

Many of the earlier technical barriers are now gone or on their way out, too: the epub:* elements are deprecated, there may be alternatives to the ssml attributes in the future (not that the ssml attributes work now), epub:type has never fully panned out.

murata2makoto commented 2 years ago

Let's close this as a duplicate of #636.

iherman commented 2 years ago

I agree @mattgarrish and @murata2makoto. I have not realized that we had a separate label on #636 on 'to-be-incubated-further'; my mistake.

I have

made an explicit link in a comment from #636 to this one to, logically, "bind" these two issues because they are indeed in the same space
I have changed the label from 'deferred' to 'to-be-incubated-further'

And I will close the issue. @MicuentaGit this does not mean we are dismissing your arguments; this only means that, at this moment, the WG has decided to keep XHTML, and we would not reopen the issue unless fundamentally new arguments were brought forward. Once EPUB 3.3 is published as a formal specification (planned for in about a year) all these deferred, to-be-incubated, etc, issues may be looked at by a future reincarnation of this WG.

MicuentaGit commented 2 years ago

Dear @iherman, @mattgarrish and @murata2makoto, I would like to thank you for considering my arguments, I hope you have found those helpful and what I have point out convenient.

Unfortunately, your response does not suit my needs because I cannot wait that much time to self-publish my book. I would like to give a brief description of what I am going to do now because I need to question you about it. I am going to share with you part of my future work, I would like to question you if there is a good chance of that work will be part of the future specification. Please be sincere because this work will not make sense if there isn't any chance to be in the future specification.

I'll follow the recommendation for HTML polyglot documents and I will declare EPUB namespace in the root HTML element. However, I am not going to mark the elements of the book that are required by the EPUB specification because I want to avoid having to search for them to delete once the current format is obsolete. Currently, there is no special feature in the specification, associated with these markups, that I would like to give the reader. Therefore, if the book conforms to the specification, no reader will notice the lack of these markings.

I would not like that my book miss the future possibilities that the future EPUB specification will give authors. I believe that once this format takes advantage of all the power that HTML is currently giving to HTML authors, the format will be ubiquitous in all text documents. With my research, I have come to the conclusion that HTML is a complete solution to create any type of document; I am amazed, for example, with the chances that HTML currently give to authors in the use of fonts to the point of being able to typeset mathematical formulas in ways that I didn't think it was possible.

To achieve my target of composing my book with markup that will facilitate me improve future reading experiences with the new chances that future EPUB specification will give, I am considering to add annotations to it. By using microdata, HTML gives the chance to authors to markup everything that the EPUB standard recommends. Adapting the marking to the HTML recommendations, will preserve the EPUB standard from conflicted with HTML, and will keep EPUB implementors from dealing with those conflicts. Therefore, I would like to submit a proposal for an alternative to the current annotation so that it can be considered and included in the future specification.

If there is not a good chance that proposal form part of the future specification, it makes no sense to make that effort. My work will be compensated by having my book with the correct annotations and prepared to take advantage of the new features that a future EPUB specification might give.

I did a little of search to figure out what you mean by using SSML as a replacement of EPUB prefixes markup. I would like to warn you that HTML specification does not mention this vocabulary and embedding XML from foreign vocabularies is a practice that the specification encourage not to do.

I would also like to do a rectification on my previous replies because I found out that I was wrong about the namespace “http://www.w3.org/1999/xhtml”, the specification publish all its terms in this namespace for compatibility purposes.

@mattgarrish, sorry about my misunderstood, I really understood it in a second reading but did not want to take the effort to change my redaction because I did not give any importance to have been considered pedantic.

iherman commented 2 years ago

I'll follow the recommendation for HTML polyglot documents and I will declare EPUB namespace in the root HTML element. However, I am not going to mark the elements of the book that are required by the EPUB specification because I want to avoid having to search for them to delete once the current format is obsolete. Currently, there is no special feature in the specification, associated with these markups, that I would like to give the reader. Therefore, if the book conforms to the specification, no reader will notice the lack of these markings.

I must admit I am not sure I fully understand what you mean here. I think you refer to features like epub:switch or epub:type. switch is already marked as deprecated. epub:type is not formally deprecated, but only to be used in very special situations.

I guess what I am saying is that the epub-specific elements and attributes are already on their way out, and you make the right decision not to use them. They are certainly not required. Furthermore, if you do not use them, then there is also no need to declare the corresponding namespace in the (X)HTML content. Ie, if you use the XML syntax, and your file starts with:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
...

you are perfectly fine.

As for using microdata to mark up your semantics: that is indeed a possible approach, both semantically and syntactically, except that, out there, microdata is almost exclusively used for schema.org properties only. In case you want to go beyond schema.org, then, with the current practice, you are a bit on your own. RDFa is an alternative, but it is not very widely used either. (I am talking about the browser usage, not even about EPUB readers'!)

That being said, I do not know what type of annotations you want to use all these for, so it is difficult to comment. If your goal is to improve the accessibility of the publication, you may to look at ARIA, specifically DPUB ARIA which is an already existing standard and we advise to use that...

mattgarrish commented 2 years ago

Furthermore, if you do not use them, then there is also no need to declare the corresponding namespace in the (X)HTML content. Ie, if you use the XML syntax, and your file starts with:

Technically, you don't even need to start with the HTML doctype, either.

I did a little of search to figure out what you mean by using SSML as a replacement of EPUB prefixes markup.

I wasn't suggesting them as a replacement, but as an example of the features that originally made compatibility with HTML difficult. Earlier discussions around allowing HTML focused on how it would lead to two different profiles of EPUB -- the XML syntax allowing features that authors wouldn't have in HTML. Those incompatibilities have slowly been removed over the years, or, in the case of SSML, alternatives are under development.

You can't judge EPUB 3.0 based on today's standards, either. It was the product of its time, which had to deal with the competing WHATWG HTML and W3C HTML5 specifications, competing microdata and rdfa specifications, an ARIA role attribute that was theoretically extensible but not practically, plus a history of reading systems that weren't strictly rooted in browser cores.

The philosophy then was to work with web technologies as much as possible, but to also add whatever EPUB needed to them to address specific concerns. There was an expectation that RMSDK, the major EPUB 2 rendering engine, would continue to have an outsized influence on EPUB 3, but in the end that didn't happen. The epub:* elements, like content switching and media triggers, plus bindings in the package document, were rooted in accomodating this alternative model.

Hindsight being what it is, sure we could have done things differently, but changing now is made difficult by the decade plus of legacy content, tools, reading systems, etc. that have been developed around what we did end up with.

Therefore, I would like to submit a proposal for an alternative to the current annotation so that it can be considered and included in the future specification.

We have gone over various alternatives, as @iherman has already mentioned. RDFa and microdata are arguably technical overkill for semantic inflection, and they are property based metadata. ARIA role could work for a couple of the required navigation documents elements, but doesn't help with epub's concept of landmarks. Misusing ARIA also has detrimental effects on users of assistive technologies, so I don't trust authors not to make a mess of content if we were to use it. Plain old microformats are another option, but they've been ruled out in the past because they collide with styling hooks. If we decide in the future that semantics are an internal workflow concern of publishers, using a data-* attribute with the vocabulary wouldn't be out of the question.

This doesn't capture all the discussions, as many have happened during previous revisions, but the same question was raised again earlier in this revision: https://github.com/w3c/epub-specs/issues/1291

The working group isn't oblivious to these problems, in other words, but finding solutions that work for everyone is the challenge.

If you have a new proposal, though, the place to start is the community group. The working group's mandate is to only add new features that have been proven to work, so the CG is now the incubator of new ideas.

mattgarrish commented 2 years ago

Technically, you don't even need to start with the HTML doctype, either.

Sorry, I missed the part about following the polyglot spec. Since HTML requires a doctype, you would need to declare it.

iherman commented 2 years ago

If you have a new proposal, though, the place to start is the community group. The working group's mandate is to only add new features that have been proven to work, so the CG is now the incubator of new ideas.

👍 to that.

MicuentaGit commented 2 years ago

Dear @iherman and @mattgarrish, I've been doing some research on your last two answers, and unfortunately, I'm absolutely devastated with what I've found. EPUB deserves that Amazon is the only player in the sale of electronic books with its proprietary format.

The publisher consortium, instead of doing the necessary work for the standard to become the dominant format for digital publishing, is blocking its progress by its incompetence in managing the project.

When someone only knows how to look at themselves in the mirror, they should not expect others to worry about their beauty.

I have investigated the CG at your insistence and, being ironic, the activity that exists in them is overwhelming. There are many open posts with no activity since 2018, for example this post is one of them, and it's assigned to someone who has no activity since the 2016. From what I've seen, there have been many initiatives related to EPUB that I guess they didn't find enough understanding from the project managers and decided to go alone with them and search for followers.

It appears to me that this project breaks records in kicking the ass of those who are early adopters. The negligent vision of the editorial consortium has transformed a good idea into an instrument of wear and tear for everyone who is interested in it.

Isn't it sad that the project management prefers lamer authors working with a tool that is used to typeset documents with a fixed layout to authors with experience in composing texts in HTML?. Is anyone embarrassed that the HTML spec publishers themselves don't offer an EPUB version of their document, offering it in single HTML page and PDF? Has anyone tried reading the EPUB version of the draft of this specification in the Thorium reader by jumping from section to section using the navigation document, what a nice experience!

Please stop looking at yourself and start looking at what the end-user experience is with the technologies supported by your format. A piece of junk made with InDesign will never be device-independent.

Please competent author, do not ask for better CSS media query support to be able to customize the layout depending on the device because InDesign does not need it. Nor do they ask for the possibility of marking the structural elements of the book in its content to give the reading devices a hint, facilitating its job; the possibility of knowing these structures makes none of sense in the task of giving a user the experience of a book instead of a web page.

Throwing away people's good ideas sounds very smart. I'm going to be ironic, I loved the answer given to @Dauwhe's proposal back in 2019 which is very similar to mine.

Before signing this replay, I must say that all of the above is not addressed to you personally, but to the publishing industry (@WSchindler, @Jeffxz) who are not tech-savvy and think they are.

Thanks again for all your answers because they have really helped me to know how I have to write my book.

iherman commented 2 years ago

@MicuentaGit, you were looking at the wrong repository. The one I was referring to is at https://github.com/w3c/publishingcg/issues.

mattgarrish commented 2 years ago

There are many open posts with no activity since 2018, for example this post is one of them, and it's assigned to someone who has no activity since the 2016.

I'll take it you're not aware that repository is no longer used as it was for an earlier incarnation of the group (the publishing and epub community groups were joined not too long ago). The community group is using this one now: https://github.com/w3c/publishingcg

That you don't see activity at the old one doesn't mean much. The issue you referenced, for example, and all the others marked "best practice", were part of an effort to write documentation for MDN. That effort stalled but has since resumed. I don't know if those issues are even being used for the new effort, so maybe they can all be closed off?

I believe the new CG is slowly looking at the old issues as they can find people to take them up, but finding people with time has always been a challenge.

It would be good to put that old repository into archive mode and update the readme to push people forward, though. I'm more than happy to do that if the CG folks are okay with it?

iherman commented 2 years ago

It would be good to put that old repository into archive mode and update the readme to push people forward, though. I'm more than happy to do that if the CG folks are okay with it?

+1 to that

cc @Jeffxz @WSchindler @mteixeira-wwn