sbsdev / pipeline-mod-sbs

SBS specific modules for the DAISY Pipeline 2
0 stars 0 forks source link

EPUB3/HTML support #9

Closed bertfrees closed 6 years ago

bertfrees commented 8 years ago

If SBS also intends to use EPUB3 as input format, pipeline-mod-sbs should also have an epub3-to-pef script and corresponding HTML translator.

mixa72 commented 8 years ago

Yes, it's quite likely that we will also use EPUB3 as input format in the long run.

bertfrees commented 8 years ago

Other agencies have done this by making some or all XSLT and CSS files work with both DTBook and HTML (EPUB). It seems indeed like maintainability and readability benefits from that because DTBook is quite similar to HTML.

(The alternative for making XSLT and CSS files work with both DTBook and HTML is to have separate stylesheets for DTBook and HTML, but because they have a lot in common they could import a common stylesheet.)

See for example:

For us, it means the following files either have to be made to work with HTML or ported to HTML.

bertfrees commented 7 years ago

There are a number of SBS-specific extensions to DTBook for which we need to find an alternative in EPUB 3. I will create a table here with the mapping and we can discuss the problematic ones.

[table moved to wiki page]

bertfrees commented 7 years ago

@egli Can I get that pointer to the Nordic EPUB specification?

bertfrees commented 7 years ago
bertfrees commented 7 years ago

@egli The progress can be seen on branch sbs-9. I'm waiting with merging it because I still can't run all the tests at once even after increasing the memory.

bertfrees commented 7 years ago

If at some point we want to let Mischa try it, and I haven't found a real solution to the problem yet, we could merge it but with the new tests disabled.

egli commented 7 years ago

We do have an EPUB test document that was produced in India

bertfrees commented 7 years ago

@egli and @mixa72 Please have a look at my table above, I've updated it. Maybe you have some more ideas.

mixa72 commented 7 years ago

That looks pretty good. It is interesting how many standards there are for the different purposes. Compared to our current DTBook, EPUB3 will apparently involve a lot more namespaces and the terminology will be a varied mix. So whatever you decide for the elements brl:select, brl:running-line, brl:toc-line, brl:time is fine by me since it's not possible to find a uniform naming anyway. BTW: AFAIK @brl:class is indeed only used for SBSForm.

bertfrees commented 7 years ago

Coherentness is indeed something we need to carefully think about. You need to work with this every day so your opinion is important. At the same using standards is also important, and last but not least, compatibility with the Nordic guidelines. Changing the Nordic guidelines is possible but apparently a slow process.

The Nordic guidelines have apparently chosen to use "class" for some semantics instead of a custom "epub:type" prefixed with "nordic:". I'm not sure what the motives were. However they do use epub:types that are available in either the default or the z3998 vocabulary. Moreover, they do have a "nordic:" prefix but they only use it for some of the metadata, not for epub:types.

Nordic's use of class is not always appropriate in my opinion, but I think we have to live with this. It's also hard to avoid the mix of different attributes and prefixes because this is just how EPUB works, and because of the compatibility requirement with Nordic. What we could do to simplify things a bit is to not use our own "sbs:" prefix and use classes instead. This is semantically not optimal, but at least it creates some coherentness with the Nordic guidelines. In addition, we can try to completely eliminate "brl:" elements and attributes.

egli commented 7 years ago

I would not take the Nordic guidelines as the be-all-end-all truth. While they are useful and most likely will define the shape of the EPUB we will get from our providers I would also be forward looking and improve things where you think it makes sense.

bertfrees commented 7 years ago

We could of course have a converter from "Nordic EPUB 3" to "SBS EPUB 3". But this makes interchanging files a bit difficult unless we have the conversion in the two directions.

mixa72 commented 7 years ago

Doing the markup with Oxygen is very user-friendly. DTBooks can be validated against both our inhouse minimal schema and the classic DTD. The most important feature is that the editor displays a list with all the possible elements at any place in the document (auto completion). If Oxygen also behaves like that with EPUB3 files then I don't see any problems for the users. It will take some time to learn and memorize the new markup, that's obvious, but after a while everybody will get used to it.

egli commented 7 years ago

I talked to @mixa72 about this yesterday and the consensus seems to be that the actual names of the elements that we will use in the EPUB are not so important to the transcribers, as long as oXygen does the auto completion.

bertfrees commented 7 years ago

Yes that's what Mischa said last time. But still we should think it through. What about the things where I have put question marks?

mixa72 commented 7 years ago

By me it's ok if you use the following for EPUB3: brl:class --> @class (no prefix)

brl:select --> brl:select (or solution with span) brl:when-braille --> brl:when-braille (or solution with span) brl:literal[@brl:grade=...] --> brl:literal[@brl:grade=...] (or solution with span) brl:otherwise --> brl:otherwise (or solution with span)

brl:running-line --> brl:running-line brl:toc-line --> brl:toc-line brl:volume[@brl:grade=...] --> br[@class='braille-volume-break-grade-...']

brl:time --> brl:time (if we keep brl:date; if we use sbs:date instead, I'd also prefer sbs:time)

But I'm open to accept anything as long as there is no loss in functionality with respect to the actual system.

bertfrees commented 7 years ago

The choice for moving from brl:homograph to span[epub:type='z3998:homograph'], from brl:v-form to span[epub:type='z3998:v-form'], from brl:place to span[epub:type='z3998:place'], etc. was a no brainer, because all of these terms are defined in z3998. However for brl:date, brl:time and brl:name there are no obvious replacements.

For brl:name I have proposed span[epub:type='foaf:name'] because I saw that z3998:personal-name is derived from foaf:name. For brl:date I proposed span[epub:type='dc:date] because I saw there is a term in z3998 that is derived from a dc term (namely z3998:fulltitle), from which I concluded it must be allowed to use dc.

Using the foaf and dc vocabularies for adding semantic structure feels a bit weird though because normally I associate these vocabularies with metadata (like in "this is the date of this event" or "this is the name of this person").

For brl:time I haven't got a solution yet. There doesn't seem to exist anything in the z3998 or dc vocabularies.

The reason I proposed the alternative span[epub:type='sbs:date'] was because I thought maybe this way we could impose a specific format (dd-mm-yyyy or whatever), but I don't even know if that makes sense (if it is possible, or even needed).

Is there a specific reason why you want to keep "date" and "time" uniform with each other, but not with the other elements?

Finally, something I'm still wondering is why we have brl:emph in addition to em and strong. All the attributes that are allowed on brl:emph also seem to be allowed on em and strong. Also I would like to use a class attribute instead of the brl:class attribute, however it appears that the class attribute is already allowed on brl:emph, em and strong. What is it used for?

mixa72 commented 7 years ago

1) As for sbs:date that was a misunderstanding: I first thought you wanted to create an element sbs:date with a separate namespace sbs:... just for this element (date). That gave me a bit the impression of an overkill. Then I saw that your alternative is in fact a span (span[epub:type='sbs:date']). That's ok for me ("time" does not have to be uniform with "date") 2) As for brl:emph and the class attribute on em, strong and brl:emph: brl:emph is used to render highlighted text other than em/strong (e.g. colored, underlined, capitalized, in a different font, etc.) with the same 4 possibilities as em/strong (brl:render = emph / quote / single quote / ignore). The class attribute on em, strong and brl:emph once was created to semantically group these elements, e.g. em "foreignword", "onomatopoeia", "stressed", "propername", etc. in order to render them in a coherent way by means of the brl:render attribute: e.g. "foreignword" --> quote, "stressed" --> emph, "propername" --> ignore, etc. However, this practice has changed in the meantime: currently all em's are rendered with brl:render="emph" (default) regardless of the semantics (with some rare exceptions: educational/non-fiction books - they often have plenty of differently highlighted/colored words and each highlighting/color has a special meaning).

bertfrees commented 7 years ago

OK, got it now. Thanks!

Because in EPUB3 (HTML5) em and strong have been given semantic meanings (see http://html5doctor.com/i-b-em-strong-element), I still propose to drop brl:emph in favor of em or strong, and if you want to make clear that in the paper book they are styled differently (a different font or whatever) you indicate this with a class or several classes.

Because a class attribute can have more than one class, it shouldn't be a problem to combine all the requirements (except brl:continuation) in a single attribute.

So a emphasis element could look something like this:

<em class="propername capitalized braille-render-ignore">bla bla</em>

Note that foreignword, onomatopoeia, etc. is semantical information, so ideally we should try to capture this in an epub:type, however because I assume you cannot make a predefined vocabulary of all possible groups, I think a class is more appropriate here.

Another possible improvement could be to base the rendering of em/strong in braille on the CSS value of text-transform instead of the brl:render attribute or braille-render-foo classes.

This way you can still use the braille-render-foo classes (or whatever you want to call them), if we define them in the default CSS:

.braille-render-ignore {}

.braille-render-quote {
    text-transform: quote;
}

.braille-render-singlequote {
    text-transform: singlequote;
}

.braille-render-emph {
    text-transform: emph;
}

@text-transform quote {
    system: -sbs-indicators;
    open: "(";
    close: ")";
}

@text-transform singlequote {
    system: -sbs-indicators;
    open: "'(";
    close: "')";
}

@text-transform emph {
    ...
}

(see issue https://github.com/sbsdev/pipeline-mod-sbs/issues/38 about how the @text-transform rule works)

but in addition you can also specify a custom mapping in CSS. For example:

.foreignword {
    text-transform: quote;
}
mixa72 commented 7 years ago

Great proposals, thanks a lot! That simplifies the whole em/strong/brl:emph story a lot. The reason why we currently use brl:emph and not em/strong is basically the large print: em/strong would italicize/bold the text, a thing to be avoided. So I suppose, if we drop brl:emph and use em/strong instead, we also need a class for the large print .lp-render-ignore or .lp-render-emph to have control over the output, right? Or perhaps this could also be handled in a @media query in the CSS? Anyway, combining multiple classes in the same attribute gives the user much more flexibility. The only downside I see is possibly the fact that all these predefined classes cannot be shown in a drop-down-list in oXygen. The user has to be aware of all the available classes and their functions. Therefore we will also need exhaustive and userfriendly guidelines.

bertfrees commented 7 years ago

OK, so if we decide to drop brl:emph and we want to support EPUB3-to-largeprint (or if we want to drop brl:emph from DTBook too), we would indeed need additional classes like e.g. lp-render-ignore. It could indeed also be done with media queries, however that would mean we'd need to make the largeprint converter CSS-aware (which is a possibility, just some extra work).

I'm not sure how oXygen's auto-completion works when you allow multiple classes. I can't try it because I don't have oXygen on this computer. @egli or @mixa72 could you maybe try? Use e.g. this schema:

start = anyElement
anyElement =
  element * {
    attribute class { ("foo" | "bar" | string)+ }?,
    attribute * - class { text }*,
    (text | anyElement)* }
mixa72 commented 7 years ago

Christian converted it to .rng for me, but when I open and validate it in oXygen, there is an error at line 10 'E [Jing] repeat of "string" or "data" element'. Do I have to change something?

<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="">
  <start>
    <ref name="anyElement"/>
  </start>
  <define name="anyElement">
    <element>
      <anyName/>
      <optional>
        <attribute name="class">
          <oneOrMore>
            <choice>
              <value>foo</value>
              <value>bar</value>
              <data type="string"/>
            </choice>
          </oneOrMore>
        </attribute>
      </optional>
      <zeroOrMore>
        <attribute>
          <anyName>
            <except>
              <name>class</name>
            </except>
          </anyName>
        </attribute>
      </zeroOrMore>
      <zeroOrMore>
        <choice>
          <text/>
          <ref name="anyElement"/>
        </choice>
      </zeroOrMore>
    </element>
  </define>
</grammar>
bertfrees commented 7 years ago

TBH I have no idea. I assumed that because trang (the tool for converting between rnc and rng) didn't complain the schema was valid. I am trying to combine predefined classes ("foo", "bar") with any other classes (string). This is how I thought it is done in RelaxNG. I hope it is possible to do somehow. What happens if you remove <data type="string"/>?

bertfrees commented 7 years ago

Actually I think the problem might be the <oneOrMore>. What if you remove that (and leave the data type="string")?

bertfrees commented 7 years ago

I was hoping the oneOrMore inside attribute would be valid, and you could then use oXygen's auto-completion to insert one or more classes. But I guess it is not valid and so you'll probably only be able to auto-complete one class.

bertfrees commented 7 years ago

A workaround could be to enumerate all combinations of common classes. For example:

You can also add the classes that you define in the default (braille) CSS, for example:

And possibly you can also list the permutations:

And finally, because I guess most books are converted to either braille or large print, we should also list only the braille and only the large print classes:

mixa72 commented 7 years ago

Great! Removing oneOrMore already did the trick: the schema is valid. When I add a class attribute to an element oXygen pops up a list with the values "foo" and "bar". In turn, when I use a value other than "foo" and "bar" the xml document is still valid. It's exactly what we want.

<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="">
  <start>
    <ref name="anyElement"/>
  </start>
  <define name="anyElement">
    <element>
      <anyName/>
      <optional>
        <attribute name="class">
          <choice>
            <value>foo</value>
            <value>bar</value>
            <data type="string"/>
          </choice>
        </attribute>
      </optional>
      <zeroOrMore>
        <attribute>
          <anyName>
            <except>
              <name>class</name>
            </except>
          </anyName>
        </attribute>
      </zeroOrMore>
      <zeroOrMore>
        <choice>
          <text/>
          <ref name="anyElement"/>
        </choice>
      </zeroOrMore>
    </element>
  </define>
</grammar>
bertfrees commented 7 years ago

An important question is also how uniform you want the DTBook and EPUB3/HTML5 markups to be. Did we bring up this issue already? As it looks now the new EPUB3 markup will differ considerably from the old DTBook markup, so what you could do is change the DTBook markup too. Except for the epub:type attributes, everything in the EPUB proposal can be applied to DTBook as well. The new schema could have a new version 2005-3-sbs-full/minimal-2.0 or something.

In addition you could also make a concession in the EPUB markup by not using any epub:types (replace with class, or maybe role?). I personally think using class in DTbook and epub:type in EPUB is not the biggest problem. As long as everything else is uniform it should be workable.

Or maybe you say the difference doesn't matter because in the future you will completely switch to EPUB anyway?

mixa72 commented 7 years ago

I've just talked with Manfred about that and we both think it is better to not change the current DTBook markup. Sooner or later we will for sure switch to EPUB3 but at the moment nobody knows when exactly this will happen. In view of the oncoming introduction of Braille-in-DAISY-Pipeline in 2017, which is a considerable change for the users, it also makes sense to go step by step and avoid too many changes (markup + formatter) at the same time.

bertfrees commented 7 years ago

Okay.

bertfrees commented 7 years ago

An importance remark that was made in our call today is that what we use as authoring format does not need to be standards compliant, as long as what we distribute or exchange with Nordic countries is standards compliant. So it is no problem if the authoring format has really SBS-specific things such as brl:select in it as long as we remove it when distributing/exchanging. The same can be said about the whole markup. In theory we could have two completely separate types of EPUB. One with all the brl:* that we are used to in DTBook, and one that is standards compliant, and conversion scripts to go from one format to the other and back.

bertfrees commented 7 years ago

Test suite works again (https://github.com/sbsdev/pipeline-mod-sbs/issues/52#issuecomment-290651862).

bertfrees commented 7 years ago

All the existing unit tests pass now. I'm going to merge the sbs-9 branch even though some things might not work yet, and even though the exact EPUB 3 format (see wiki page) hasn't been decided yet.

We can move the issue back to "Backlog" if Mischa finds issues, or if we want to make changes to the EPUB 3 format.

mixa72 commented 6 years ago

I found some issues in the EPUB3 output. I first created a file as DTB and an identical one as EPUB. Here are the differences I found. Possibly my markup is wrong, please take a look at it. test_epub3_html.zip

 Output from DTB                   Output from EPUB

                               |
         *H:TSV7Z34X           |           *H:TSV7Z34X
         -----------           |           -----------
                               |
           7]7 B+D             |             7]7 B+D
                               |
 TO'CL*E ................ #*A  |   TO'CL*E ................ #*A
                               |
          ZW3T7 B+D            |            ZW3T7 B+D
                               |
 TO'CL*E ............... #,,C  |   TO'CL*E ............... #,,C
                               |
         ::::::::::::          |           ::::::::::::
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
         *H:TSV7Z34X       >I  |           *H:TSV7Z34X       >I
p                              |  p
p                              |  p
       HEAD*G VOLUME #A        |         HEAD*G VOLUME #A
       ----------------        |         ----------------
                               |
 SPAN-+SW7 ---                 |   SPAN-+SW7 ---
 SPAN-+SW7'#A -                |   SPAN-+SW7'#A -
 SPAN-BO'X ---                 |   SPAN-BO'X ---
 LI-BRL-'CLA^                  |   LI-BRL-'CLA^                             <-- brl:class not working in EPUB (css was specified, but has no effect)
 'A-PA&REF #A                  |   'A-PA&REF #A
 BRL-HOMOGRAPH W<]UBE          |   BRL-HOMOGRAPH W<]UBE
 BRL-'V-F?M $S                 |   BRL-'V-F?M S                             <-- brl:v-form not working in EPUB
 BRL-NUM                       |   BRL-NUM                                  <-- brl:num not working in EPUB
   'C)D*AL #E                  |   'C)D*AL #E
   ?D*AL #?                    |   ?D*AL #E.                                <-- brl:num not working in EPUB
   ROMAN >II.                  |   ROMAN II.                                <-- brl:num not working in EPUB
   PHONE #JDC.CCC.CB.CB        |   PHONE #JDC !, #CCC #CB #CB               <-- brl:num not working in EPUB
   ISBN #IGH.C.DIB.BDJGB.G     |   ISBN #IGH-#C-#DIB-#BDJGB-#G              <-- brl:num not working in EPUB
   MEASURE #D'DL               |   MEASURE #D DL                            <-- brl:num not working in EPUB
   FRA'CTJ #C/                 |   FRA'CTJ #C!,#D                           <-- brl:num not working in EPUB
   MI'XED #H#A;                |   MI'XED #H #A!,#B                         <-- brl:num not working in EPUB
 BRL-PLA'CE M+NH3M             |   BRL-PLA'CE M+NH3M
 BRL-SYE'CT KZ                 |   BRL-SYE'CT BASIS Q KZ                    <-- brl:select not working in EPUB
 BRL-[PH                       |   BRL-[PH
   _[PH                        |   [PH                                      <-- brl:emph not working in EPUB
   '(S*GLE'QUO(')              |   S*GLE'QUO(                               <-- brl:emph not working in EPUB
   ('QUO()                     |   'QUO(                                    <-- brl:emph not working in EPUB
   IGN?E                       |   IGN?E
 _#I,)       R/N*GL*E      #A  |                                            <-- brl:running-line not working in EPUB (1 was selected)
p                              |  p
p                              |  p
         *H:TSV7Z34X           |           *H:TSV7Z34X
         -----------           |           -----------
                               |
          ZW3T7 B+D            |            ZW3T7 B+D
                               |
 TO'CL*E ............... #,,C  |   TO'CL*E ............... #,,C
                               |
         ::::::::::::          |           ::::::::::::
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
         *H:TSV7Z34X       >I  |           *H:TSV7Z34X       >I
p                              |  p
p                              |  p
       HEAD*G VOLUME #B        |         HEAD*G VOLUME #B
       ----------------        |         ----------------
                               |
 BRL-A'C'CCTS-SPAN R"EDUIT     |   BRL-A'C'CCTS-SPAN R"EDUIT
   D"%TAIQ"%                   |   D"%TAIQ"%
 BRL-'COMPUT7 '$WWW.SBS.CH     |   BRL-'COMPUT7 WWW.SBS.'4                  <-- brl:computer not working in EPUB
 BRL-DA( #,=AJ#BJJD            |   BRL-DA( #AG.AJ.BJJD                      <-- brl:date not working in EPUB
 BRL-TIME #E.AE                |   BRL-TIME #JE":#JE                        <-- brl:time not working in EPUB
 BRL-NAME K1FM+N               |   BRL-NAME K1FMN                           <-- brl:name not working in EPUB
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
                               |
 _#AJ,;      R/N*GL*E      #C  |
p                              |  p
bertfrees commented 6 years ago

OK thanks for the heads up!

bertfrees commented 6 years ago

@mixa72 What is supposed to happen with brl:class? As far as I remember this had something to do with macro's in dtbook2sbsform, which I guess would translate to CSS in the new system. If you want to select an element with a brl:class in CSS you should do it like this:

@namespace brl url(http://www.daisy.org/z3986/2009/braille/);
brl|class~='myclass' {
   ...
}
bertfrees commented 6 years ago

OK I see what you are trying to do. You put this in the EPUB:

<style>
@namespace xml "http://www.w3.org/XML/1998/namespace";
@namespace brl url(http://www.daisy.org/z3986/2009/braille/);

li[brl|class='myclass'] {
   margin-left:2;
}
   </style>

The problem is that this CSS is not enabled unless you specify the "apply-document-specific-stylesheets" option (why is currently not available in the SBS version of the script).

mixa72 commented 6 years ago

@bertfrees Thanks for the hint with the syntax. However, it appears that any css instruction in the style Element is ignored by the system. I even tried

      @namespace xml "http://www.w3.org/XML/1998/namespace";
      @namespace brl url(http://www.daisy.org/z3986/2009/braille/);
      li{
        margin-top:2 !important;
      }

but nothing changes. Is that possible?

bertfrees commented 6 years ago

Well, there are two problems. Firstly, like I said above you need the "apply-document-specific-stylesheets". (I will add it.) Secondly, you need to add type="text/css" to the style to make it work. (Preferably also add media="embossed" to make the style not influence the rendering on screen).

mixa72 commented 6 years ago

I seem to understand it now: as "apply-document-specific-stylesheets" is disabled now, I'll have to test the brl:class attribute via external stylesheet (scss), right?

bertfrees commented 6 years ago

Indeed.

bertfrees commented 6 years ago

However I think there is another issue, which might also explain why the elements like brl:v-form, brl:num etc. don't work. I'm investigating it now.

bertfrees commented 6 years ago

Never mind, forget that last comment.

bertfrees commented 6 years ago

OK so I've added the "apply-document-specific-stylesheets" option and that solves the brl:class issue.

All the other issues are because brl: elements are not valid in HTML and as a result the prefixes are removed in the load step. (brl: attributes are also invalid but here the prefixes are retained). A solution is to make the translator and the style sheets work regardless of whether the "brl:" prefix is present. But better is of course to create valid HTML, for example by using epub:type or class attributes.

Another issue I found in your EPUB is that it uses <list type="pl">. In EPUB use <ul style="list-style-type: none"> instead. NLB has a "list-style-type-none" class for it:

.list-style-type-none {
    list-style-type: none;
}
mixa72 commented 6 years ago

OK. I'll adjust my EPUB accordingly. Thanks!

mixa72 commented 6 years ago

BTW is the apply-document-specific-stylesheets option also visible in the GUI or just available in the background?

bertfrees commented 6 years ago

Yes it will be visible in the GUI.

bertfrees commented 6 years ago

Done.

I had to make some small adjustments to the EPUB in order to make it behave exactly as the DTBook: see chapter.xhtml.

mixa72 commented 6 years ago

EPUB3 to PEF Conversion works now. All the above mentioned inline elements are translated as in the DTB to PEF Conversion. CSS Support for stylesheets inside EPUB3 also works. Thanks.

The embedded braille rendition from the EPUB3 to EPUB3 conversion differs a bit from the output in the PEF in that some inline elements are not translated accordingly: brl:num (ordinal, phone, isbn, measure, fraction, mixed) em (strong) (brl:emph) brl:date brl:time brl:name The brl:select element should only render the braille in the corresponding grade (not each literal element). The rest looks good.