nlbdev / nordic-epub3-dtbook-migrator

Tools for converting between a strict subset of DTBook and EPUB3.
http://nlbdev.github.io/nordic-epub3-dtbook-migrator/
GNU Lesser General Public License v2.1
8 stars 7 forks source link

General figure #432

Closed kalaspuffar closed 3 years ago

kalaspuffar commented 3 years ago

Hi @josteinaj

I've added a general figure case where we allow figures with any elements.

Also some minor bug fixes.

Best regards Daniel

josteinaj commented 3 years ago

Since all classes are allowed in the general figure element, then badly marked up image figures will be allowed according to that definition.

For instance, this is valid according to the generic definition:

<figure class="image">
    <img src="1.jpg"/>
    <img src="2.jpg"/>
    <img src="3.jpg"/>
</figure>

Would it work to use <exclude> to exclude image and image-series? I haven't tried if this works, or if this is the correct way to do it in RelaxNG, but something like this:

                    <optional>
                        <attribute name="class">
                            <list>
                                <ref name="classes">
                                    <exclude>
                                        <value>image</value>
                                        <value>image-series</value>
                                    </exclude>
                                </ref>
                            </list>
                        </attribute>
                    </optional>

If that doesn't work, maybe image figure validation needs to be moved to schematron?

kalaspuffar commented 3 years ago

Hi @josteinaj

I ran your example with the our current ruleset and got this validation issue:

[nordic253b] Image figures must contain exactly one img. (<figure class="image">) (EPUB/C00000-08-chapter.xhtml)

So that example is not allowed, at least.

I have not seen an exclusion feature in RelaxNG. I think you need to create another list to exclude them.

Best regards Daniel

josteinaj commented 3 years ago

Ok. Then maybe we should just define the third case in RelaxNG, and have more specific rules for figures with image/image-series in Schematron?

In the RelaxNG:

In Schematron:

What do you think? Would this work?

kalaspuffar commented 3 years ago

Hi @josteinaj

It seems correct. The only thing I'm a bit confused by is the page break. Do we really need/want to allow this?

You have a figure that only can contain an image, a description/caption of that image, and a production note about the image. Is it reasonable to have a page break in between any of these elements?

To clarify, these are the only cases that are allowed.

<figure class="image">
    <figcaption>
    <img>
    <prodnote>
</figure>
<figure class="image-series">
    <figcaption>
    <figure class="image">
        <figcaption>
        <img>
        <prodnote>
    </figure>
    <prodnote>
    <pagebreak>
    <figure class="image">
        <figcaption>
        <img>
        <prodnote>
    </figure>
</figure>

Or is there a case I've missed in my understanding of your ruleset above?

Best regards Daniel

josteinaj commented 3 years ago

Mostly correct. But for image-series, all the image figures must be grouped at the end according to the RelaxNG. Also, where prodnote and pagebreak is referenced, there's no restriction on the order or number of occurences.

So:

<figure class="image">
    <figcaption> <!-- optional -->
    <img>
    <prodnote> <pagebreak> <!-- any number of these two in any order -->
</figure>
<figure class="image-series">
    <figcaption> <!-- optional -->
    <prodnote> <pagebreak> <!-- any number of these two in any order -->
    <figure class="image">
        <figcaption> <!-- optional -->
        <img>
        <prodnote> <pagebreak> <!-- any number of these two in any order -->
    </figure>
    <figure class="image">
        <figcaption> <!-- optional -->
        <img>
        <prodnote> <pagebreak> <!-- any number of these two in any order -->
    </figure>
</figure>

The model stems from DTBook, where page breaks were added to the model in 2007:

jpritchett@rfbd.org: 2007-12-17. (Issue 200) Added pagenum to imggroup content model

I managed to dig up the issue in the waybackmachine: https://web.archive.org/web/20100906155555/http://www.daisy.org/z3986-issues-detail?ZedIssuesId=200

It was submitted by MTM, and argues that page breaks are needed in image series:

When a group of images spans several pages, there is no way to represent that in DTBook. The current restriction of use applied to <pagenum> elements within <imggroup> is leading to markup solutions that in practical coding terms, are at best long-winded and at worst may lead to error. In terms of semantics one can find oneself providing deeply ambiguous documents. Any content that a producer judges best represented as a suite of images suffers as a result of this restriction, if that group of images stretches over more than one page. While <caption imgref="..."> attribute application can be used to associate the images it cannot be considered as optimal as consistent and logical element markup. This issue resembles an earlier posted issue (#199) concerning the application of <pagenum> within <table> markup. Both issues could satisfactorily be resolved by the removal of the restriction on <pagenum> markup within the named elements.

It seems to be the last change made to the DTBook standard, together with allowing pagebreaks in tables.

In HTML, figcaption has to be the first or last element according to the standard. So when we converted the DTBook model of image series:

imggroup: (prodnote | img | caption | pagenum)+

We chose to wrap the img element in a figure, as well as making the imggroup into a figure, so that it would still be possible to have prodnote/img/caption/pagenum in any order, as in DTBook:

image-series: figcaption? (prodnote | pagenum)* image+
image: figcaption? img (prodnote | pagenum)*

If there's something to change, I agree that it doesn't make much sense to blend pagebreaks in between figcaption, prodnote and img. I think it should be at the very beginning of the <figure class="image">, but figcaption needs to be the first element when present (according to the HTML spec), and pagebreaks are allowed inside figcaptions already, so we could allow zeroOrMore pagebreaks right after figcaption, to allow for marking up that an image starts at a new page (I suppose the figcaption could be at the previous page in a paper book). The way it is today is that pagebreaks must be after the img element, which is not an optimal solution.

If we allow pagebreaks before the img, then it wouldn't be necessary to allow it after the img, as it could be put where it belongs instead; outside the figure or in the next figure. For the prodnote, I don't know what makes most sense (before or after the img, but maybe allow both for that one as well. Production notes are not part of the original paper book, and so it shouldn't be necessary to have pagebreaks between them I think. So maybe:

image-series: figcaption? pagenum* prodnote* image+
image: figcaption? pagenum* prodnote* img prodnote*

(CC @AndersEkl @martinpub - you might have an opinion here)

AndersEkl commented 3 years ago

For our purposes, I'd be fine with only allowing pagebreaks inbetween <figure class="image"> sets in an image series. There would be no case where we would want to have a page break between a figcaption and the image itself. I know there are cases where the caption is placed on a different page than the image, but I see no reason why we would specifically want to recreate that in the EPUB file. This is me talking from SPSM's point of view, though, so there may be opposing ideas.

kalaspuffar commented 3 years ago

Hi @josteinaj

I might have created a bird from a feather here. I have been thinking about the RelaxNG rules, and is there any reason to allow the class image and image-series for any other element than a figure? If not, we could remove them from the list of classes and have a working solution for all kinds of figures without any other changes.

Have I missed something?

Best regards Daniel

josteinaj commented 3 years ago

@kalaspuffar When we reference the "classes" definition, we're really saying that "anything goes". The list in that definition is just potentially useful for reference, and for autocomplete in XML editors. At the end it declares that any other class is also allowed, so removing image and image-series from the definition doesn't work (unless we figure out a way to exclude certian values):

https://github.com/nlbdev/nordic-epub3-dtbook-migrator/blob/a46aa82e1d15205c23b15bbe4010d1bcf6b06d7f/src/main/resources/xml/schema/2020-1/nordic-html5.rng#L4482-L4486

kalaspuffar commented 3 years ago

Hi @josteinaj

Could this be a solution?

https://github.com/nlbdev/nordic-epub3-dtbook-migrator/pull/432/commits/793e9b4f152995a10874ca09de71f2b798f5ee17

Best regards Daniel

josteinaj commented 3 years ago

I don't think the RelaxNG should limit the classes that can be used - we want to add custom classes later without the book becoming invalid.

I tried this test with <except>, which seems to work:

file:///tmp/test.xhtml

<!DOCTYPE html>
<?xml-model href="file:///tmp/test.rng"?>
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <figure class="asdf image foo"/>
    </body>
</html>

file:///tmp/test.rng

<?xml version="1.0" encoding="UTF-8"?>
<grammar 
    xmlns="http://relaxng.org/ns/structure/1.0"
    xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
    datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

    <start>
        <ref name="html"/>
    </start>

    <define name="html">
        <element name="html">
            <element name="head">
                <element name="title">
                    <text/>
                </element>
            </element>
            <element name="body">
                <ref name="figure"/>
            </element>
        </element>
    </define>

    <define name="figure">
        <element name="figure">
            <attribute name="class">
                <list>
                    <zeroOrMore>
                        <data type="string">
                            <except>
                                <value>image</value>
                                <value>image-series</value>
                            </except>
                        </data>
                    </zeroOrMore>
                </list>
            </attribute>
        </element>
    </define>

</grammar>

So I think the "classes" definition can be replaced with something like this:

<define name="classes">
    <a:documentation>Allow all classes, except certain classes that have a special meaning.</a:documentation>
    <a:documentation>Classes with a special meaning will be handled explicitly where relevant.</a:documentation>
    <zeroOrMore>
        <data type="string">
            <except>
                <value>frontcover</value>
                <value>rearcover</value>
                <value>leftflap</value>
                <value>rightflap</value>
                <value>line</value>
                <value>linenum</value>
                <value>title</value>
                <value>prodnote</value>
                <value>sidebar</value>
                <value>annotation</value>
                <value>byline</value>
                <value>dateline</value>
                <value>linegroup</value>
                <value>verse</value>
                <value>verse-author</value>
                <value>linegroup</value>
                <value>line</value>
                <value>line_indent</value>
                <value>line_longindent</value>
                <value>linenum</value>
                <value>acronym</value>
                <value>page-front</value>
                <value>page-normal</value>
                <value>page-special</value>
                <value>noteref</value>
                <value>annoref</value>
                <value>image-series</value>
                <value>image</value>
                <value>emptyline</value>
                <value>separator</value>
                <value>title</value>
                <value>docauthor</value>
                <value>bridgehead</value>
                <value>lic</value>
                <value>row</value>
                <value>col</value>
                <value>rowgroup</value>
                <value>colgroup</value>
                <value>table-summary</value>
            </except>
        </data>
    </zeroOrMore>
</define>
kalaspuffar commented 3 years ago

Hi @josteinaj

It seems to work fine. Have changed the "classes" definition to your suggestion and tested it.

This solution does not enable the feature of autocompletion if I understand it correctly, then again I don't know if that is important.

Best regards Daniel

josteinaj commented 3 years ago

If we want autocompletion (or maybe just for documentation purposes), I think it might work with:

    <zeroOrMore>
        <choice>
            <data type="string">
                <except>
                    <!-- … --->
                </except>
            </data>
            <value>class-1</value>
            <value>class-2</value>
            <value>etc</value>
        </choice>
    </zeroOrMore>

But I don't think autocompletion is important.

martinpub commented 3 years ago

For our purposes, I'd be fine with only allowing pagebreaks inbetween <figure class="image"> sets in an image series. There would be no case where we would want to have a page break between a figcaption and the image itself. I know there are cases where the caption is placed on a different page than the image, but I see no reason why we would specifically want to recreate that in the EPUB file. This is me talking from SPSM's point of view, though, so there may be opposing ideas.

I agree, I see no point in having that for MTM either. Allowing page breaks between image figures in an image-serious is enough.

josteinaj commented 3 years ago

So I guess:

image-series: figcaption? (pagenum* prodnote* image)+
image: figcaption? prodnote* img prodnote*
AndersEkl commented 3 years ago

@josteinaj Why is there a prodnote* before the img in the image case?

josteinaj commented 3 years ago

@AndersEkl Good catch, thanks.

The 2015-1 guidelines is:

image: img (pagebreak | prodnote)* figcaption?

So, updated suggestion for 2020-1:

image-series: figcaption? (pagenum* prodnote* image)+
image: img prodnote* figcaption?

I hope this DTD-ish syntax makes sense, I just thought it would be more compact while discussing.

By the way, I see now that the 2020-1 guidelines has a new element that we need to handle in the validator:

<aside class="fig-desc">…</aside>

Is this meant to replace prodnotes? Prodnotes are not mentioned.

<aside epub:type="z3998:production">
AndersEkl commented 3 years ago

@josteinaj Yes, we didn't want to include the z3998 semantic vocabulary as there is no support for it in any assistive technology. Assistive technology rather depends on ARIA instead of epub:type.

Shouldn't the order be: image: figcaption? img prodnote*

That is the order we have stated in the guidelines.

martinpub commented 3 years ago

Is this meant to replace prodnotes? Prodnotes are not mentioned.

@josteinaj There could be prodnotes that are not image descriptions. This has been raised for a guidelines revision with https://github.com/nlbdev/epub3-guidelines-update/issues/121

josteinaj commented 3 years ago

@AndersEkl Oh, right. I didn't notice that it had changed order in 2020-1.

So:

image-series: figcaption? (pagenum* prodnote* image)+
image: figcaption? img prodnote*

Note that this may result in two consecutive figcaptions, one for the image-series and one for the image. Screen readers might handle the association between img, figure and figcaption fine, I'm not sure. But visually, and when read aloud, it might be more difficult to separate the two figcaptions.

josteinaj commented 3 years ago

@martinpub So since prodnotes are not replaced by image descriptions, let's keep both in the RelaxNG I guess?

So we need a new definition of fig-desc, separate from <define name="prodnote"> (which should be left as in the 2015-1 guidelines for now until the fate of prodnotes has been decided).

So:

image-series: figcaption? (pagenum* prodnote* image)+
image: figcaption? img (prodnote | fig-desc)*
AndersEkl commented 3 years ago

@josteinaj I don't see any huge problem with having two consecutive captions. It can often be that way in the printed original as well, that the main caption is directly followed by the first individual caption. Often there is some sort of numbering that would clarify what each caption refers to.

josteinaj commented 3 years ago

@kalaspuffar is it clear what should be done here, or should I summarize?

kalaspuffar commented 3 years ago

Hi @josteinaj

A quick summation would be nice as there has been a lot of discussions back and forth.

Best regards Daniel

kalaspuffar commented 3 years ago

Hi @josteinaj

I'm still not sure what to change here.

Is this the last word?

image-series: figcaption? (pagenum* prodnote* image)+
image: figcaption? img (prodnote | fig-desc)*

Best regards Daniel

josteinaj commented 3 years ago

@kalaspuffar yes that is the last word.

Here's a suggestion, does it look right to you?

    <define name="figure.imggroup">
        <choice>
            <ref name="figure.image"/>
            <ref name="figure.image-series"/>
        </choice>
    </define>

    <define name="figure.image">
        <a:documentation>An image. The figure provides a container for the associated figcaption.<a:documentation>
        <element name="figure">
            <ref name="attlist.figure.image"/> <!-- @class, @id, @title, @xml:space, @xml:lang, @lang, @dir, @epub:type -->
            <optional>
                <ref name="caption.figure"/> <!-- figcaption -->
            </optional>
            <ref name="img"/> <!-- img -->
            <zeroOrMore>
                <choice>
                    <ref name="fig-desc"/> <!-- aside -->
                    <ref name="prodnote"/> <!-- aside -->
                </choice>
            </zeroOrMore>
        </element>
    </define>

    <define name="figure.image-series">
        <a:documentation>An image series. The figure provides a container for multiple images and a figcaption(s) associated with the series itself.</a:documentation>
        <element name="figure">
            <!-- In HTML there can be at most one figcaption per figure, and it must be either the first or last child of the figure.
                 As a result, we wrap each (img caption*)-group in nested figure elements. -->
            <ref name="attlist.figure.image-series"/> <!-- @class, @id, @title, @xml:space, @xml:lang, @lang, @dir, @epub:type -->
            <optional>
                <ref name="caption.figure"/> <!-- figcaption -->
            </optional>
            <oneOrMore>
                <zeroOrMore>
                    <choice>
                        <ref name="pagebreak.block"/> <!-- div -->
                        <ref name="prodnote"/> <!-- aside -->
                    </choice>
                </zeroOrMore>
                <ref name="figure.image"/>
            </oneOrMore>
        </element>
    </define>

    <define name="fig-desc">
        <element name="aside">
            <ref name="attlist.fig-desc"/>
            <ref name="flow"/>
        </element>
    </define>

    <define name="attlist.fig-desc" combine="interleave">
        <attribute name="class">
            <list>
                <ref name="classes"/>
                <value>fig-desc</value>
                <ref name="classes"/>
            </list>
        </attribute>
        <attribute name="attr.describedby"/>  <!-- or however this attribute is defined. See: #435 -->
        <ref name="attrsrqd.notype"/> <!-- @id, @title, @xml:space, @xml:lang, @lang, @dir, @class -->
    </define>
kalaspuffar commented 3 years ago

Hi @josteinaj

Great suggestion. I added it to this PR with some minor modifications.

Another thing we might want to implement later on is to check that if there is a fig-desc there should be an image that points to it and if there is a pointer aria-describedby then that should have a corresponding description within the same figure.

I think these rules should be implemented in Schematron and might not be in the scope of this PR.

Best regards Daniel

josteinaj commented 3 years ago

Another thing we might want to implement later on is to check that if there is a fig-desc there should be an image that points to it and if there is a pointer aria-describedby then that should have a corresponding description within the same figure.

I think these rules should be implemented in Schematron and might not be in the scope of this PR.

Good idea. Yes those should be in schematron.