Open jccr opened 4 years ago
IMO there is no default language embedded in the publication. There is instead a preferred language (or a list of) in the reading app.
Im looking at this from the Shared Models API perspective.
Trying to deal with two types of data, for example in typescript:
interface LocalizedString {
[key: string]: string
}
interface Metadata {
title: string | LocalizedString
}
When I want to grab a value from title... I have to deal with the union type first with some "unwrapping" code that IMO is too cumbersome.
Ideally I think I want this:
interface Metadata {
title: LocalizedString
}
Where all data is normalized to that structure.
{
"metadata": {
"title": {
"fr": "Vingt mille lieues sous les mers",
"en": "Twenty Thousand Leagues Under the Sea",
"ja": "海底二万里"
}
}
}
would work fine as is, and would fit into LocalizedString
nicely.
But.. what about the case if the data is just a simple bare string? Like this:
{
"metadata": {
"title": "Twenty Thousand Leagues Under the Sea"
}
}
In my interface design it would end up being parsed like this:
title = {
"": "Twenty Thousand Leagues Under the Sea"
}
Still ugly.. but it's normalized (is it better? I'm asking myself)
Alright, my thinking is now I'm moving towards your suggestion @llemeurfr
Still would like to draft up a design for a convenient API though, and IMO It's easier with normalization of the data.
@jccr have you looked at the APIs in the Swift version?
@HadrienGardeur I have actually. I'll go back and iterate my thoughts on that approach too.
Actually the Kotlin version is more up-to-date now. But thank you for raising this issue, we improvised a bit there when this should be specified and shared among platforms.
Here's how it works on Kotlin:
LocalizedString
object holding a Map<String?, Translation>
.
LocalizedString.Translation
only contains a String
for now, but could be extended to support text direction, for example.null
(e.g. when parsing a RWPM). But with EPUB, we try to use the xml:lang
element, or fallback on the publication's language (@qnga might chime in on this).LocalizedString
to JSON, if a key is null
then we use the BCP-47 language code und
, which is made for that.
The 'und' (Undetermined) primary language subtag identifies linguistic content whose language is not determined. IETF
Metadata.title
, we actually have two properties:
localizedTitle
which is the LocalizedString
object.title
which is an alias to localizedTitle.string
.Here's the API of LocalizedString
:
translations: Map<String?, Translation>
getOrFallback(language: String?): Translation?
null
und
en
defaultTranslation: Translation?
= getOrFallback(null)
string: String
= defaultTranslation.string
fromJSON(json): LocalizedString?
LocalizedString
from a JSON string or JSON BCP–47 language map.fromString(strings: Map<String?, String>): LocalizedString
LocalizedString
from a map of strings. It's convenient when parsing a package.LocalizedString
, since it is immutable.So as you can see, metadata.title
is actually an alias to metadata.localizedTitle.getOrFallback(null).string
, which ideally returns the translation matching the user's locale. Which matches what @llemeurfr said:
IMO there is no default language embedded in the publication. There is instead a preferred language (or a list of) in the reading app.
One thing we might want to discuss is the heuristics to decide how to fallback on the default translation. It would be nice to be able to use the publication's first language instead of null
or en
, but we don't have access to it in LocalizedString
, unless we provide it at construction.
If we don't know the language, then the key can be null (e.g. when parsing a RWPM). But with EPUB, we try to use the xml:lang element, or fallback on the publication's language (@qnga might chime in on this).
Sure, I can chime in. I think I already suggested somewhere to drop this fallback on the publication's language. This behaviour looks like an unjustified and unnecessary assertion since RWPM supports a non specified language. When directly parsing a RWPM title with no specified language, no such an assertion is made, and as far as I know, this interpretation is in no way favoured by the Epub specification.
I think I already suggested somewhere to drop this fallback on the publication's language.
I agree with you, and it would lead to simpler parsing. I think only the Kotlin implementation falls back on the publication language right now.
In the TypeScript implementation, for "contributors" metadata (e.g. author), as well as for title and subtitle metadata, we use the underscore _
pseudo-language-key as a fallback for cases where there are "alternative scripts" declared in the package OPF (as per the EPUB3 definition), and when the parser cannot determine the language of the string based on XML lang attribute (on the meta itself, or package OPF root element), or failing that, use the "primary" package OPF meta language instead (i.e. "primary" = first item in the array). Obviously, _
is not a great solution, so I will migrate to und
instead. Thanks Mickael for pointing this out.
Current parser algo inspired from: https://github.com/readium/architecture/blob/master/streamer/parser/metadata.md#title
Man! I was looking for something like und
Thanks for the analysis, everyone! 👍
I think I already suggested somewhere to drop this [language of the publication] fallback on the publication's language.
This is exactly what I myself did in the Go implementation for the LCP server, when parsing W3C Manifests, as the low level json unmarshalling of a Localizable string would then rely on a global variable (the global language of the publication) and this would lead to a terrible implementation.
As qnga said, in EPUB the language of the publication (which may be multiple) is not directly related to the language of its metadata.
W3C Publication are slightly different because there are two different properties: inLanguage
for the publication and a top level language
(here) for the manifest, -> metadata. But we can be pretty certain that the latter will not be used before long, and there is no corresponding property in the RWPM.
In conclusion, I think we can rephrase Mickaël's wording as: If we don't know the language (because the property is expressed as a plain string), then the key is "und".
W3C Publication are slightly different because ...
For all intents and purposes, isn't EPUB OPF's xml:lang
the same as W3C WebPub's @context
language
? (and EPUB OPF's metadata dc:language
the same as W3C WebPub's inLanguage
)
@danielweck you're right, xml:lang
in EPUB has the same use than the @context / language
in JSON-LD.
Given I have a publication with a title in metadata like this:
What would the default language be? If all I want is just any string, without having a localization preference. Would it be the first in the "list", i.e. the value of "fr"?
If so.. the order of the keys might be a problem.