feature: improve support for epub/html styling

At the moment users have no practical means to add rich stylization to EPUB output, because it is impractical to write CSS selectors that apply specifically and uniquely to many of the document elements users might wish to select.

Adding class tags to logical document elements would easily solve this problem.

A small sampling of class names that might be useful follows:

title_page
title
author
chapter
chapter_title
chapter_label
chapter_number
chapter_body
chapter_paragraph
section1
section1_title
em
strong
break (i.e. thematic break)

Would a more comprehensive list of this character be one that might be considered as basis for adding a set of class tags to the EPUB output of Pandoc? If so, I can consider drafting a more comprehensive list.

Let's see, don't we already have most of this?

titlepage, title, author:

<section epub:type="titlepage" class="titlepage">
  <h1 class="title">Pandoc User’s Guide</h1>
  <p class="author">John MacFarlane</p>
  <p class="date">July 6, 2019</p>

chapter:

<section id="synopsis" class="level1">

(Okay, so its called level1, but it applies to the whole chapter section.)

chapter_title (now in HEAD):

<h1 class="chapter-title">Synopsis</h1>

OK, so focusing on what we don't already have:

chapter_number: We do have

<h1 class="chapter-title" data-number="1"><span class="header-section-number">1</span> Synopsis</h1>

But I suppose you object to having to say h1 > span.header-section-number if you want to target chapter numbers but not section numbers?

chapter_label: I don't think pandoc generates anything that could be considered a chapter label (I assume you must mean the word "Chapter"). Since we don't generate this, it's irrelevant here.

chapter_paragraph: Are you saying you want all the paragraphs inside the main text to say <p class="chapter_paragraph">?

section1: I assume you mean sub-chapter sections:

<section id="general-options" class="level2">

There is currently no separate class on the h2 that comes directly under this as the section title. I understand that you object to .level2 > h2 or just h2 which would also work.

em, strong: we use em and strong elements, why add a class that simply duplicates the element name?

break: I assume you're looking for <hr class="break">? Again, this seems redundant: hr means "thematic break" in HTML5.

Since I'm not really sure what you are looking for, and some of what you suggest seems already to be there in pandoc's output, the most useful thing would be a direct comparison of pandoc output (please use HEAD, which incorporates changes motivated by your other issue) and the output you think would be better. And I would suggest bringing up this issue on pandoc-discuss.

So you only want me to suggest class names that are not redundant with existing ones in the output?

I think ultimately what is needed is a list of class names that are both described in documentation and used by the writer and template. I think the first step is to achieve a consensus on what the list of classes should be. It should be something that someone can review and affirm both of the following:

Yes, it looks like the writer and template can support this set of class names, with the missing ones being added in a future revision, and the list being placed somewhere in the documentation.
Yes, it looks like a style designer would have all the tools needed to customize fully the output.

Currently classes are used in the output in somewhat of an ad-hoc fashion, which is by no means unexpected given the state of ongoing development. But it becomes hard to sustain this model because to write a style sheet one has to look at the output and make observations about the current set of class names, and meanwhile template and writer designers are making guesses on-the-fly about what style designers might want in the future.

So everyone is second guessing each other without knowing really what to do. Again, it's not unexpected while the software is still maturing, but it's also not the desired destination. This problem is solved if both parties are referring to a design that is used as the handshake between them.

I could write a list that excludes currently existing classes, but then we don't have a list of classes that actually represents a design. We are then stuck in the model of the code is the documentation, which should be something we are trying to move past.

I know that in a issue tracker, the temptation is to ask simply for a minimum set of code changes, but sometimes it is better the think more broadly.

However, do you think it is reasonable first to try to agree to a set of all the classes that should be used?

But I suppose you object to having to say h1 > span.header-section-number if you want to target chapter numbers but not section numbers?

There is currently no separate class on the h2 that comes directly under this as the section title. I understand that you object to .level2 > h2 or just h2 which would also work.

Earlier the suggestion was made to select a logical document element by means of the physical structure that happens to constitute it. I find this confusing, obtuse, and unstable from a user standpoint, as you recall I mentioned.

I'm not opposed categorically, however, to using child selectors or descendant selectors, when composed of constituent selectors that the user would recognize as corresponding to the logical document elements. So while I do find h1, h2, and span to be inappropriate for this use case, we could consider something like .level2 > .section_number. As such, the entire selector is given by its logical significance, and isolated from the physical structure. However, this yet could create problems. Suppose the child selector works in an early version, until the output changes such that an intermediate element is inserted into the tree path. Then the child selector must be replaced with an descendant selector. The latter would work in both cases, but there is no means to enforce its use beforehand, meaning some users will create style sheets that are not future proof. Fortunately, the problem is trivial to avoid simply by naming a class section2_number.

In this case the compound selectors are not terrible, but also more complicated and error-prone than necessary. It is not something I would passionately fight over.

em, strong: we use em and strong elements, why add a class that simply duplicates the element name?

break: I assume you're looking for <hr class="break">? Again, this seems redundant: hr means "thematic break" in HTML5.

Yes, but every selector in any style sheet in the EPUB document applies to all sections of the book. So the moment someone uses the <hr> tag in a new context within the total XHTML content, existing selectors for this element are applied to these new elements, causing an undesired effect. For example at some point a title page might be generated that employs <hr> elements for visual effect, even though such use does not logically correspond to a thematic break. Better to insulate against this possibility by using the break class to uniquely identify thematic breaks that follow from the logical document structure.

Also, I would say at this point the user will be accustomed to using classes for selection. Asking the user to use element names in some places simply complicates matters for the user. I think a user is best served by a simple table of class names that can be used reliably, not a larger set of cases that look obtuse to the user. It would be common for the user to include or omit a period inadvertently when taxed with dealing with implementation details that ought to be transparent to him (e.g. "Is this one of the elements that Pandoc wants me to select using a class name or element name?").

Finally, note that the<hr> element is not the only means to represent a thematic break. Some designs entail character sequences or even images.

Utilizing classes offers safety and flexibility in an unpredictable environment.

And overall, using them as exclusively as possible makes the user's world straightforward and predictable.

The tight coupling you describe between element name and logical function might be accurate for simple, single page articles, but for books it is not.

chapter_label: I don't think pandoc generates anything that could be considered a chapter label (I assume you must mean the word "Chapter"). Since we don't generate this, it's irrelevant here.

There are a bunch of different items that compose a chapter header. At the moment, the word "Chapter" might not be output (which I didn't realize), but generally books do use it, and this is what was meant by label. There are other issues currently open related to how Pandoc renders chapter headers in EPUB. Of course we don't necessarily need to adopt classes to describe features not currently in existence, but it is best to be aware that such features might occur in the future.

Meanwhile consider the following are all parts of a chapter header:

The label, for example, "Chapter".
The number.
Both the label and number combined.
The title.
The punctuation separating the title from the label and number.
All of the above combined.

chapter_paragraph: Are you saying you want all the paragraphs inside the main text to say <p class="chapter_paragraph">?

Almost. The <p> tag is already heavily overloaded in the output. It is used for the author and date. It may be used for many other things in the future. Captions. Footnotes.

So all body text needs at least one class tag. But not all body text is the same. Some body text appears directly in a chapter. Some under a section. Some under a subsection. Style designers should be free to independently style all of these cases. Having said as much, it might also be wise to use a body_text class that includes all these contexts.

A simplified representation of how a chapter might look:

<html>
<body epub:type="bodymatter" class="bodymatter">
<section id="beginning" class="level1">
<h1 class="level1-header><span class="level1-header-number">1</span><span class="level1-header-spacing"> </span><span class="level1-title">Beginning<span></h1>
<div class="level1-body">
<p class="level1-paragraph">First.</p>
<p class="level1-paragraph">Second.</p>
<div class="level2">
<h2 class="level2-header><span class="level2-header-number">1.1</span><span class="level2-header-spacing"> </span><span class="level2-title">Continuing</span></h2>
<div class="level2-body">
<p class="level2-paragraph">A.</p>
<p class="level2-paragraph">B.</p>
</div>
</div>
</div>
</section>
</body>
</html>

em, strong: we use em and strong elements, why add a class that simply duplicates the element name?

break: I assume you're looking for <hr class="break">? Again, this seems redundant: hr means "thematic break" in HTML5.

If the preference is strong to use native HTML tags for corresponding document elements, a hybrid approach is available: <hr class="doc-body"/> and similarly for em and strong.

The difference is sharing a common class among a variety of tags. And still the class designates the element for a specific purpose and context. Selectors can then filter for the combination: hr.doc-body. I'm not sure that this approach is easier, but some may like it better.

But documentation will be invaluable to guide the user about which selection criteria may be safely used for which effects.

It's possible to conceive of cautiously omitting class tags on some elements, if done carefully and documented adequately.

In the below example, notice that .level1 p, selects for all paragraph text, not just level 1 text, because an element with class tag level1 is an ancestor to all paragraph text. Further, while selector .level1 > .level-header selects top-level header text, the selector .level1 > p selects no elements, because no p elements are direct children of a level1 element. Equally, wrapping the header in any intermediary element at some future point will break the header selection using the child selector. This distinction opens the possibilities of errors in style sheets not created with adequate care.

<html>
<body epub:type="bodymatter" class="bodymatter">
<section id="beginning" class="level1">
<h1 class="level-header"><span class="header-number">1</span><span class="header-spacing"> </span><span class="header-title">Beginning<span></h1>
<div class="level-body">
<p>First.</p>
<p>Second.</p>
<div class="level2">
<h2 class="level-header><span class="header-number">1.1</span><span class="header-spacing"> </span><span class="header-title">Continuing</span></h2>
<div class="level-body">
<p>A.</p>
<p>B.</p>
</div>
</div>
</div>
</section>
</body>
</html>

How common is it to want to style paragraph text differently if it appears under a section heading, as opposed to immediately under the chapter heading? I don't recall having seen that in real books. (Sometimes you might start with, say, a "summary" block in larger text, but this would easily be achieved more explicitly, by putting it in a div with a class.)

Having read all of this discussion I still assert that adding classes to everything is officially an anti-pattern in HTML/CSS usage and clutters up the output. Any book themes written for such a style system would be incompatible with EPUBs generated with other tooling.

As it stands:

Pandoc's approach to every other document format is to use the closest available semantic markup.
There are already fairly established conversions for what content goes into what semantic place in HTML and conventions for EPUB as well. So far Pandoc has headed in the direction of following those. Adding a secondary set of naming schemes that only allow themeing that avoids the cascading nature of CSS would be to take it quite a different direction.

If there is anything that cannot be targeted using CSS2 selectors (i.e. that would require CSS3 to target) I would be happy to see improvised solutions so ebook reader support isn't eschewed. I would be unhappy to see the output cluttered with a redundant way to write selectors.

There are some examples above of data that could easily gain extra spans and classes to improve the level of content that is selectable and stylable independently. Those would be good things, but they should be simplified, for example:

<!-- too cluttered -->
<h2 class="level2-header"><span class="level2-header-number">1.1</span><span class="level2-header-spacing"> </span><span class="level2-title">Continuing</span></h2>

<!-- simplified, everything can be targeted with CSS2 selectors -->
<h2><span class="numbering">1.1 </span>Continuing</h2>

There are also quite a few examples of duplicating things that are already easily doable. Those would be bad:

<!-- entirely unnecessary -->
<div class="level2">
<div class="level2-body">
<p class="level2-paragraph">A.</p>

<!-- existing markup is fine, use child selector: level2 > p -->
<section class="level2">
<p>A.</p>

I see both sides of the issue, and I would like to get feedback on this issue from other people who use pandoc to produce EPUBs. Posting on pandoc-discuss and linking to this issue might be a way to get more eyes on this.

How common is it to want to style paragraph text differently if it appears under a section heading, as opposed to immediately under the chapter heading?

It is uncommon, of course, in familiar practice, but why anticipate practice, if it's easy simply to support this case, and every other?

Any book themes written for such a style system would be incompatible with EPUBs generated with other tooling.

Are style sheets for EPUB documents currently mutually compatible? Is Pandoc currently participating in such a standard of mutual compatibility?

There are already fairly established conversions for what content goes into what semantic place in HTML and conventions for EPUB as well

That's great, but please, please provide a reference. Simply saying there are conventions doesn't help anyone make a good decision.

Also notice that all elements of a given type having the same style is well suited to articles on a single page, since every use of a particular element has the same meaning in the document. Within complicated systems of documents, such as eBooks, established conventions generally for HTML may be inadequate unless they are expanded to account for the greater range of concerns and pitfalls occurring in HTML eBooks. Oonly conventions should be considered that have been used and proven in a context of EPUB.

There are some examples above of data that could easily gain extra spans and classes to improve the level of content that is selectable and stylable independently. Those would be good things, but they should be simplified...

There are also quite a few examples of duplicating things that are already easily doable. Those would be bad.

It seems that <section> might be indeed preferred, as suggested, in HTML5.

But many, perhaps most, of the simplifications that you made in your examples, @alerque, reduce functionality. I'm not sure whether you are unaware of such, or rather you are aware but opining that such functionality is unneeded.

The EPUB and HTML outputs share similarities, both in current as well as optimal structure. Design improvements for both outputs should be considered together, emphasizing appropriate reuse in both design and implementation.

Changed issue title accordingly.

jgm / pandoc

feature: improve support for epub/html styling #5749