w3c / publishingcg

Repository of the Publishing Community Group
https://www.w3.org/community/publishingcg/
Other
18 stars 8 forks source link

How to write the OPF of a book containing two or more translations of the same work? #47

Open dejudicibus opened 1 year ago

dejudicibus commented 1 year ago

I am developing an ePub. In the content.opf file I have to specify a series of metadata by using DC standard. For example dc:title and dc:creator.

However my book is a multilanguage book, that is, it contains three translations of the same text: Italian, English and Russian. The standard reference manual states that I can have more dc:language statements. For example:

    <dc:language>it</dc:language>
    <dc:language>en</dc:language>
    <dc:language>ru</dc:language>

but it does not say how to specify the other metadata for more than one language. Consider, for example, dc:creator. I tried

    <dc:creator xml:lang="it">Dario de Judicibus</dc:creator>
    <dc:creator xml:lang="en">Dario de Judicibus</dc:creator>
    <dc:creator xml:lang="ru">Дарио де Юдицибус</dc:creator>

I get an error from the distribution platform validator, which states that the format of ePub is not correct. It looks like I cannot use xml:lang in dc:creator even if, in theory, that is an XML attribute that can be used with any XML tag. Same for dc:title:

    <dc:title xml:lang="it">Il Titolo del mio Libro</dc:title>
    <dc:title xml:lang="en">My Book Title</dc:title>
    <dc:title xml:lang="ru">Название Mоей Kниги</dc:title>

Could someone who has had to face the same problem as me, namely writing the OPF for an ePub that contains a text in multiple languages, tell me what is the correct way to do it? In the standards for the OPF 3.x I have not been able to find any useful information to establish this. P

iherman commented 1 year ago

Per the EPUB spec the xml:lang attribute is allowed on both dc:title and dc:creator, i.e., what you did is correct imho. I do not know what validator your distribution platform uses, but it should not flag that as an error.

dejudicibus commented 1 year ago

Thank you very much.

On Fri, Oct 7, 2022 at 2:48 PM Ivan Herman @.***> wrote:

Per the EPUB spec https://www.w3.org/publishing/epub32/epub-packages.html#sec-shared-attrs the xml:lang attribute is allowed on both dc:title and dc:creator, i.e., what you did is correct imho. I do not know what validator your distribution platform uses, but it should not flag that as an error.

— Reply to this email directly, view it on GitHub https://github.com/w3c/publishingcg/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK5PEH2KO4H5WN7URSCNSLWCALY7ANCNFSM6AAAAAAQ7R3V2M . You are receiving this because you authored the thread.Message ID: @.***>

-- Dario de Judicibus, Rome, Italy (EU) Site: https://www.dejudicibus.it, https://genealogia.dejudicibus.it Blog: https://www.lindipendente.eu Book: https://www.lalamanera.it, https://www.lasorgentedeimondi.it

mattgarrish commented 1 year ago

the xml:lang attribute is allowed on both dc:title and dc:creator

It's allowed on the elements, but that doesn't indicate that they are translations, only that the text is in the specified language. The author's name is likely to be listed three times in a bookshelf going by the example above.

We have the alternate-script property for expressing names, titles, etc. in other scripts, but no translation properties.

dejudicibus commented 1 year ago

So, what if in the same ePub there are two versions of the same document? For example, a tale in Greek and its translation in English? This is quite common, especially for ancient languages, but used also for modern ones. For example, in my epub there is a tale in three languages. Of course I have to provide a title for each language and even the creator name could be written in different alphabets.

On Fri, Oct 7, 2022 at 2:53 PM Matt Garrish @.***> wrote:

the xml:lang attribute is allowed on both dc:title and dc:creator

It's allowed on the elements, but that doesn't indicate that they are translations, only that the text is in the specified language. The author's name is likely to be listed three times in a bookshelf going by the example above.

We have the alternate-script property for expressing names, titles, etc. in other scripts, but no translation properties.

— Reply to this email directly, view it on GitHub https://github.com/w3c/publishingcg/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK5PEGC6RLUDAGZCZLRLADWCAMNBANCNFSM6AAAAAAQ7R3V2M . You are receiving this because you authored the thread.Message ID: @.***>

-- Dario de Judicibus, Rome, Italy (EU) Site: https://www.dejudicibus.it, https://genealogia.dejudicibus.it Blog: https://www.lindipendente.eu Book: https://www.lalamanera.it, https://www.lasorgentedeimondi.it

mattgarrish commented 1 year ago

We took this up in https://github.com/w3c/epub-specs/issues/1527 and https://github.com/w3c/epub-specs/issues/1553

The feedback we got from those issues was that publishers don't add translations to the package document metadata (they target to the region/primary audience) and reading systems don't support translations, so there wasn't a case for pursuing more metadata.

It'd be problematic to add a new property now, as being a new feature we'd need to show two reading systems that support translations. It's probably an issue for incubation by the CG to determine its viability.

dejudicibus commented 1 year ago

Did you checked that all around the world or only with an Anglo-Saxon audience? Because in Europe we care about translations. Furthermore, if you consider technical documentation as manuals, especially for appliance, multi-language documents are very frequent. Do not focus only on fiction, novels and essays.

On Fri, 7 Oct 2022 at 15:38 Matt Garrish @.***> wrote:

We took this up in w3c/epub-specs#1527 https://github.com/w3c/epub-specs/issues/1527 and w3c/epub-specs#1553 https://github.com/w3c/epub-specs/issues/1553

The feedback we got from those issues was that publishers don't add translations to the package document metadata (they target to the region/primary audience) and reading systems don't support translations, so there wasn't a case for pursuing more metadata.

It'd be problematic to add a new property now, as being a new feature we'd need to show two reading systems that support translations. It's probably an issue for incubation by the CG to determine its viability.

— Reply to this email directly, view it on GitHub https://github.com/w3c/publishingcg/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK5PEH6EHPAN7OIXWOA63LWCARVDANCNFSM6AAAAAAQ7R3V2M . You are receiving this because you authored the thread.Message ID: @.***>

-- Dario de Judicibus, Rome, Italy (EU) Site: https://www.dejudicibus.it, https://genealogia.dejudicibus.it Blog: https://www.lindipendente.eu Book: https://www.lalamanera.it, https://www.lasorgentedeimondi.it

iherman commented 1 year ago

Yes, we do have non Anglo Saxon participants in the group, including RS manufacturers. That wasn't the issue...

mattgarrish commented 1 year ago

Shall we transfer this to the CG to look into?

wareid commented 1 year ago

Strongly suggest moving this to the CG for incubation, might also fit in well with some of the education work currently taking place.

mattgarrish commented 1 year ago

Okay, I'll do a transfer. Not like it's hard to bring back if we need.

Heads up to @Jeffxz and @WSchindler

To recap for that group: putting xml:lang on dublin core elements doesn't indicate a translation as there is nothing that links the translations together -- they're just multiple instances of a property where the values are in different languages to any reading system. Multiple renditions was partially meant to address this issue, but that's only if the translations are separate renditions.

The easy answer is to mint yet another property, similar to alternate-script, but someone should check if there's already a property in a known vocabulary we can use, or a pattern for doing this that we can follow. We also need to see that there's going to be uptake from publishers and reading systems.

sueneu commented 1 year ago

Could the validation issue be the multiple xml:lang tags?

The Daisy Knowledge Base explains the XML lang attribute like this:

Example 1 — Declaring the package language The xml:lang attribute is set to English on the package element to ensure the metadata in the package is correctly interpreted.

<package … xml:lang="en">

The reading system needs to know the language in which you are writing the metadata. In the OP's title example, the XML lang is always "en", since the metadata dc term is always "title" even when the value of the metadata is in Italian or Russian.

Il Titolo del mio Libro My Book Title Название Mоей Kниги I'm struggling with this myself. It isn't very clear how to handle multilingual texts in a way that would give downstream access to the metadata in multiple languages.
mattgarrish commented 1 year ago

The reading system needs to know the language in which you are writing the metadata.

The xml:lang attribute only tells you the language of the element's value, regardless of whether it is specified globally or locally. That's the problem with trying to use it alone for translations.

For example, say I co-authored a work in English with someone with a French name. If I put xml:lang on so assistive technologies can pronounce that person's name correctly, which is what xml:lang is more properly designed for, that obviously doesn't make their name a translation into French of mine. That's why we can't just make xml:lang signal a translation.

We solved this problem for the Publication Manifest specification by allowing arrays of values for each property, so you could group the translations of a name, title, or whatever together (you can see some examples in section 4.4.2). But that doesn't translate back to EPUB without bringing in the refines attribute. You need a new property, like "translation" that chains the translations together:

<dc:title id="title">My Book Title</dc:title>
<meta property="translation" refines="#title" xml:lang="it">Il Titolo del mio Libro</meta>
<meta property="translation" refines="#title" xml:lang="ru">Название Mоей Kниги</meta>

That binds the related translations without breaking how reading systems currently process metadata (i.e., they won't see three unique titles because there are three dc:title elements).

dejudicibus commented 1 year ago

I find nothing strange in the fact that dc:title is used even when the title is not in English. It's the same thing we do with programming languages. The language instructions are always English words, even if the application is NLS-enabled and is provided in several languages. For example if...then...else dc:title is in effect an instruction and therefore it is perfectly fine that it is in English. The important thing is to be able to specify the language of the element content. Other metadata, as for example, keywords and tags, should be available in several language. In some case, even dc:creator might be represented by different alphabets. For example, Russian or Chinese.

dejudicibus commented 1 year ago

You need a new property, like "translation" that chains the translations together:

<dc:title id="title">My Book Title</dc:title>
<meta property="translation" refines="#title" xml:lang="it">Il Titolo del mio Libro</meta>
<meta property="translation" refines="#title" xml:lang="ru">Название Mоей Kниги</meta>

That binds the related translations without breaking how reading systems currently process metadata (i.e., they won't see three unique titles because there are three dc:title elements).

Ok, but if I use

<dc:title id="title">My Book Title</dc:title>
<meta property="translation" refines="#title" xml:lang="it">Il Titolo del mio Libro</meta>
<meta property="translation" refines="#title" xml:lang="ru">Название Mоей Kниги</meta>

which title is shown in the catalogs? The English one? Because, if a book, for example an appliance manual, contains the same text in 5 languages, but it is a single ePub, it has really five titles, in five different languages. So I must be able to search for it by usin any title of theirs. Furthermore, in my case, the original work is in Italian, so it would be correct to write:

<dc:title id="title" xml:lang="it">Il Titolo del mio Libro</dc:title>
<meta property="translation" refines="#title" xml:lang="en">My Book Title</meta>
<meta property="translation" refines="#title" xml:lang="ru">Название Mоей Kниги</meta>
sueneu commented 1 year ago

Could the proposed translation property also apply to text/content?
Would it then display the book text in only one language? Would a reader be able to select the language for their instructions? How would they know which languages are available? I make bilingual picture books, multiple languages are displayed on the print page. The publisher intends for multiple languages to be displayed on the ebook screen as well. Technically the text is a translation- the same story in two languages.

mattgarrish commented 1 year ago

Could the proposed translation property also apply to text/content?

No, this is where multiple renditions was supposed to come in. That specification defines a way to include multiple versions of a work, allowing the user to open the one they prefer. The metadata would all be in their language. It's not supported anywhere that I know of, though.

dejudicibus commented 1 year ago

Well, in multi-language paper books, one sees both languages. For example, many books of works in ancient Greek or Latin have the text in the original language on one page and the translation on a facing page. The same happens for some books in modern Arabic, where on one side there is the page in Arabic and on the other the translation into another language, for example in French. This is done for those who still want to read the original as well but need the translation close by, in order to better understand the original text. There are other cases, like mine, where the three stories, written in three different languages, are one after the other because each reader will only read the one in their own language.

mattgarrish commented 1 year ago

Because, if a book, for example an appliance manual, contains the same text in 5 languages, but it is a single ePub, it has really five titles, in five different languages.

This again sounds like a case for multiple renditions: being able to provide the user the choice of which version to open. Trying to have all five versions in one rendition doesn't work out well in practice, for this reason that you can't associate the metadata well to each rendition.

For reference, that specification is currently available as a W3C note: https://www.w3.org/TR/epub-multi-rend-11/

But this is where we've heard that publishers will create five separate EPUBs rather than bundle the content altogether. (You also need the navigation document in separate languages, etc.)

wareid commented 1 year ago

This could be a chicken/egg issue in regards to the dc:title element, so for the purposes of CG discussion I'd like to step back a little bit. We know we can't do anything in EPUB 3.3 in regards to this (which is why we're in the CG).

The use cases being described here are: As a content creator, I would like to publish a multi-lingual edition of a book where there are multiple translations of the same content in a single package, and the metadata represents all languages. As a reader, I would like to be able to search for, manage, and read the content in the language or languages of my choice.

This is at least the gist I am getting, so feel free to elaborate or correct these.

Matt's pointed out we have standards capable of this already to a certain extent (multiple renditions), but I think it's important to look at the fact that we rarely see multiple rendition content in market as publishers tend to just produce separate editions. In addition to that knowledge, I think we need to look at the end-user perspective here as well. What value are we providing to users by having ebook editions containing multiple languages, with associated metadata and navigation for each of those languages? Is there a strong demand for a single edition containing several languages over separate editions for each language? Bilingual books (I'm Canadian, so bilingual content is not foreign to me) I do see as a use case here, but I think it's different for works in translation in academic contexts, because even in those cases, there is a "primary" language for the work.

It's helpful for us and the CG to understand the goals and use cases and then solution from there, because there definitely are gaps in the specs, but there are also industry realities we need to consider as well. We also now are at a point where we have enough end-user information to add that to our considerations.

dejudicibus commented 1 year ago

A publisher creates different renditions if it is fiction or an essay, but often, for manuals or technical works, there is only one rendition. At least this is what publishers do for paper books. In theory, digital books should be able to replicate any paper work.

In other cases, for example religious books or ancient works, we have that translation pagers are facing each other. This is also true for many diplomatic, legal, and regulation books, especially in European Union.

On Wed, 12 Oct 2022 at 22:51 Matt Garrish @.***> wrote:

Because, if a book, for example an appliance manual, contains the same text in 5 languages, but it is a single ePub, it has really five titles, in five different languages.

This again sounds like a case for multiple renditions: being able to provide the user the choice of which version to open. Trying to have all five versions in one rendition doesn't work out well in practice, for this reason that you can't associate the metadata well to each rendition.

For reference, that specification is currently available as a W3C note: https://www.w3.org/TR/epub-multi-rend-11/

But this is where we've heard that publishers will create five separate EPUBs rather than bundle the content altogether.

— Reply to this email directly, view it on GitHub https://github.com/w3c/publishingcg/issues/47#issuecomment-1276724608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK5PEG6YVNZF4PNSLTBF3LWC4QEBANCNFSM6AAAAAARDRTQLI . You are receiving this because you authored the thread.Message ID: @.***>

-- Dario de Judicibus, Rome, Italy (EU) Site: https://www.dejudicibus.it, https://genealogia.dejudicibus.it Blog: https://www.lindipendente.eu Book: https://www.lalamanera.it, https://www.lasorgentedeimondi.it

dejudicibus commented 1 year ago

As a reader, I would like to be able to search for, manage, and read the content in the language or languages of my choice.

Not necessarily. There are case where I want to see both languages at the same time. For example, Latin on even pages and English on odd ones. This is common in poetry, for example.

On Wed, 12 Oct 2022 at 23:03 Wendy Reid @.***> wrote:

This could be a chicken/egg issue in regards to the dc:title element, so for the purposes of CG discussion I'd like to step back a little bit. We know we can't do anything in EPUB 3.3 in regards to this (which is why we're in the CG).

The use cases being described here are:

As a content creator, I would like to publish a multi-lingual edition of a book where there are multiple translations of the same content in a single package, and the metadata represents all languages. As a reader, I would like to be able to search for, manage, and read the content in the language or languages of my choice.

This is at least the gist I am getting, so feel free to elaborate or correct these.

Matt's pointed out we have standards capable of this already to a certain extent (multiple renditions), but I think it's important to look at the fact that we rarely see multiple rendition content in market as publishers tend to just produce separate editions. In addition to that knowledge, I think we need to look at the end-user perspective here as well. What value are we providing to users by having ebook editions containing multiple languages, with associated metadata and navigation for each of those languages? Is there a strong demand for a single edition containing several languages over separate editions for each language? Bilingual books (I'm Canadian, so bilingual content is not foreign to me) I do see as a use case here, but I think it's different for works in translation in academic contexts, because even in those cases, there is a "primary" language for the work.

It's helpful for us and the CG to understand the goals and use cases and then solution from there, because there definitely are gaps in the specs, but there are also industry realities we need to consider as well. We also now are at a point where we have enough end-user information to add that to our considerations.

— Reply to this email directly, view it on GitHub https://github.com/w3c/publishingcg/issues/47#issuecomment-1276735425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK5PEBFH3P6OOLMISPM663WC4RTVANCNFSM6AAAAAARDRTQLI . You are receiving this because you authored the thread.Message ID: @.***>

-- Dario de Judicibus, Rome, Italy (EU) Site: https://www.dejudicibus.it, https://genealogia.dejudicibus.it Blog: https://www.lindipendente.eu Book: https://www.lalamanera.it, https://www.lasorgentedeimondi.it

wareid commented 1 year ago

As a reader, I would like to be able to search for, manage, and read the content in the language or languages of my choice.

Not necessarily. There are case where I want to see both languages at the same time. For example, Latin on even pages and English on odd ones. This is common in poetry, for example.

Which is why I added "languages" but I actually think that use case needs to be further broken down because the rendering implications for each requirement are very different.

More appropriately: As a reader, I would like to be able to search for the content in the language or languages of my choice. As a reader, I would like to be able to manage the content in the language or languages of my choice. As a reader, I would like to be able to read the content in the language or languages of my choice.

mattgarrish commented 1 year ago

We're also starting to stray in the direction of interlinear text here, too, which we briefly discussed with the APA group at TPAC.

While you can interleave pages in different languages, it's complicated by having to do fixed layouts.

Jeffxz commented 1 year ago

We had discussion in today's PCG meeting. Here are action items.

  1. Looking for volunteer to compare between ONIX and OPF regarding language
  2. Looking for volunteer to test certain multi-language book cross different reading system to see how it is rendered
  3. Dig a little bit more into what EPUB33 reading system defined about how reading system should display metadata in different language such as Book Title in several languages regarding local system language setting. As well as default language for example local system language is set as French but the book does not have French language metadata but have Chinese and Japanese.
  4. Define use case more specifically so it will be easier for further discussion and investigation.

Note: we will not go down with different rendition regarding language.

Please correct me if I am wrong or missing anything. @wareid @TzviyaSiegman @WSchindler @liisamk

gautierchomel commented 1 year ago

I volunteer for

  1. Looking for volunteer to test certain multi-language book cross different reading system to see how it is rendered

Actually Thorium will display Title and Author with language code corresponding to user language setting.

As far as I've seen it does not affect rendering of the content, only book information panel.