w3c / wpub

W3C Web Publications
https://w3c.github.io/wpub/
Other
78 stars 19 forks source link

Expressing metadata in multiple scripts/languages #124

Closed HadrienGardeur closed 6 years ago

HadrienGardeur commented 6 years ago

The current WP infoset is fairly consistent with the WAM:

This is quite different from EPUB 3.x where each metadata can be expressed in multiple scripts/languages. Here's an example from the 3.1 spec:

<dc:creator opf:alt-rep-lang="ja" opf:alt-rep="村上 春樹">
    Haruki Murakami
</dc:creator>

Readium and the RWPM also provide support for multiple scripts/languages per property:

"author": {
  "name": {
    "ru": "Михаил Афанасьевич Булгаков",
    "en": "Mikhail Bulgakov",
    "fr": "Mikhaïl Boulgakov"
  }
}

Since the Japanese publishing industry (ping @frivoal) told us multiple times in the past that this is very important for them, I'm wondering if the current direction for WP is on purpose or not.

TzviyaSiegman commented 6 years ago

@murata2makoto please comment on behalf of JEPA

murata2makoto commented 6 years ago

Expressing each piece of metadata (e.g., titles and author names) in multiple scripts (CJK ideographics and some phonetic script (e.g, Kana)) is a must for Japanese. This is because there are multiple and equally reasonable phonetics for some names. For example, 智子 might be pronounced as either Tomoko or Satoko. We really do not know.

HadrienGardeur commented 6 years ago

Thank you @murata2makoto for this feedback.

I think that we need to carefully reconsider what's in our infoset. Here are a few notes:

I'll also open a separate issue about reading direction, this is a related problem but it's more difficult to solve IMO.

iherman commented 6 years ago

My approach would be:

I think we should include the consideration on directions into this issue right from the start.

HadrienGardeur commented 6 years ago

RWPM has a slightly different approach:

In the RWPM context we're currently using a language map, which works fine when we only define a language but couldn't include a direction as well (it MUST use a string or an array of string).

This is an example on a title:

"title": {
  "fr": "Vingt mille lieues sous les mers",
  "en": "Twenty Thousand Leagues Under the Sea",
  "ja": "海底二万里"
}

We could of course adopt a different approach instead of a language map, for example:

"title": [
  {
    "@value": "Vingt mille lieues sous les mers", 
    "@language": "fr"
  },
  {
    "@value": "Twenty Thousand Leagues Under the Sea", 
    "@language": "en"
  }
]

This would allow the inclusion of an additional direction as well (ignored by RDF parsers). Are these two semantically the same from an RDF output perspective @iherman?

iherman commented 6 years ago

@HadrienGardeur yes, this is roughly what I had in mind, and I believe what you wrote are equivalents, but adding the direction may lead to some problems. Can we keep this in hold and I would look at this later this week?

iherman commented 6 years ago

@HadrienGardeur The problem is that the following JSON-LD:

  "title" :
 {
    "@value": "Vingt mille lieues sous les mers", 
    "@language": "fr",
   "dir":"ltr"
  },

is invalid JSON-LD. Which means that, e.g., the JSON-LD playground rejects it.

We could come up with a hack. Another possibility is that we wait for the JSON-LD WG to be formed and raise an issue. Yet another is that we raise an issue with the CG that delivers JSON-LD 1.1. In any case, I would feel bad coming up with some sort of a hack ourselves...

BigBlueHat commented 6 years ago

This is why I like HTML. 😸

iherman commented 6 years ago

Well, actually... you gave me an idea. We may go in this direction (but not necessarily the way you think it:-).

The problem with the whole directionality is when things get mixed up; ie, when the language itself is not enough to a proper interpretation of the characters and the BIDI algorithm needs some extra "help". The texts of @r12a like bidirectional text in HTML or (the bidi algorithm description](https://www.w3.org/International/articles/inline-bidi-markup/uba-basics) do a much better job in explaining, better than I will ever be able to do. The situation may come up in titles.

The problem, if we use JSON-LD, is that RDF strings do not have built in constructions to handle that. The only thing you can do is to assign a language or (and that is an exclusive "or") a special datatype. On the other hand, HTML gives all the tools that are necessary to describe the intricacies which, let us face it, are not the majority of our usages.

But... RDF (and therefore JSON-LD) offers a hack: the rdf:HTML datatype. Essentially, it says "consider this text an HTML fragment, and interpret it accordingly". But then, it is perfectly possible to do the following in JSON-LD (using an example from Richard's text:

{
  "@context": {
    "rHTML" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML",
    "ex" : "http://example.org/"
  },

  "ex:title" : {
    "@value" : "<p>The title is <cite dir="rtl">مدخل إلى <span dir="ltr">C++</span></cite> in Arabic.</p>",
    "@type":"rHTML"
  }
}

This perfectly fine JSON-LD (and RDF). It feels and smells like a hack, but may give a direction (sic!) for thoughts...

(I was wondering about a slightly different direction, namely to define some parametrized datatypes that would combine a language tag and a writing direction, but we would end up reproducing the HTML semantics...)

Cc: @r12a


(Note that JSON-LD 1.1 has a type-based indexing just like the indexing along language tags, as used in the example of @HadrienGardeur. We may have a use case for a 'datatype' indexing for the evolution of JSON-LD 1.1)

HadrienGardeur commented 6 years ago

I really dislike this solution, it will make things extremely complicated for User Agents and I don't think it's making things easier for authoring either.

Forcing UAs to implement on every string:

... is not my definition of a good solution for a super specific problem.

We've already solved the issue of multiple languages/scripts with the language map in RWPM, I'd rather wait for the UTF-8, RDF or JSON-LD community to solve this issue with reading direction than implement a hack that will deeply impact everything we do.

BigBlueHat commented 6 years ago

@iherman interesting approach. We didn't quite go that far with Web Annotation, but certainly spec'd out how one would express HTML in JSON-LD as well as textDirection for non-HTML content, etc: https://www.w3.org/TR/annotation-model/#embedded-textual-body

There are several different ways to model this for moving HTML around inside JSON-LD. However, I think the whole thing has a bad smell.

What's regrettable is that we are re-recording information which is also likely to have HTML representation. Because of the needs of i18n (which are legitimate needs!), we're now likely to need to put HTML into JSON...

Consequently, I think we need to get the JSON out of the way, and reappraise the needs of an "infoset" serialization.

HTML is far more expressive, human readable, extensible, displayable, and has far more multi-lingual work done for it. JSON has nearly none of those features, and is likely simply to be used as a "transport" format ultimately to be put back into some HTML-based UI.

I'd like us to reconsider the decision made in #7, because I'm hopefully the reasoning behind using HTML (rather than JSON) are increasingly clear to more folks.

iherman commented 6 years ago

@HadrienGardeur I did not say I like it:-) But I do not see, at this moment, any better solution within the existing specifications. As I mentioned, my other option was to define a number of datatypes and use those, but that would require an extra specification work and to get at least some of the RDF environments to accept the datatypes.

If we use JSON but not JSON-LD, then the problem does not arise, in fact. We can easily add a direction to any structure, and the only complication for a JSON parser would be to accept, for a key, either a string or an object that includes a string value with additional information about it. A pretty standard way of operating in the JSON world.

HadrienGardeur commented 6 years ago

BTW, I've looked at what we have in EPUB right now, and while it partially solves the problem for language, it doesn't handle the issue completely for direction:

<package dir="ltr">
  <metadata>
    <dc:creator opf:alt-rep-lang="ja" opf:alt-rep=" 樹春上村">Haruki Murakami</dc:creator>
  </metadata>
</package>

I can't express the direction of the opf:alt-rep if it's different from package or dc:creator.

It does provide a little more flexibility than our infoset though, since a number of elements allow dir for their text node (but not their attributes).

BigBlueHat commented 6 years ago

@iherman curious how you see this being that much "cleaner" in JSON vs. JSON-LD.

iherman commented 6 years ago

I can do in JSON what I cannot do in JSON-LD, see https://github.com/w3c/wpub/issues/124#issuecomment-361943023. What stands in a way is the JSON-LD restrictions or, to be more precise, the RDF restrictions...

BigBlueHat commented 6 years ago

@iherman that looks a lot like the format we made for Web Annotation (which is JSON-LD): https://www.w3.org/TR/annotation-model/#example-4

{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "id": "http://example.org/anno5",
  "type": "Annotation",
  "body": {
    "type" : "TextualBody",
    "value" : "<p>j'adore !</p>",
    "format" : "text/html",
    "language" : "fr"
  },
  "target": "http://example.org/photo1"
}

What am I missing here? 😃

HadrienGardeur commented 6 years ago

@iherman you're 100% right that this is a JSON-LD issue rather than a JSON issue.

Here's an example in pure JSON:

"title": [
  {"language": "fr", "value": "Vingt mille lieues sous les mers"},
  {"language": "en", "value": "Twenty Thousand Leagues Under the Sea"},
  {"language": "ja", "value": "海底二万里", "direction": "ltr"}
]

This example goes beyond what EPUB 3.x supports:

We just need to figure out how we could avoid parsing direction in the JSON-LD context.

HadrienGardeur commented 6 years ago

@iherman I just tried the following example in JSON-LD playground and it works fine:

{
  "@context": {"title": "http://schema.org/name"},
  "title": [
    {"@language": "fr", "@value": "Vingt mille lieues sous les mers"},
    {"@language": "en", "@value": "Twenty Thousand Leagues Under the Sea"},
    {"@language": "ja", "@value": "海底二万里", "dir": "ltr"}
  ]
}

It ignored dir but the RDF output looks fine.

iherman commented 6 years ago

@iherman https://github.com/iherman you're 100% right that this is a JSON-LD issue rather than a JSON issue.

Here's an example in pure JSON:

"title": [ {"language": "fr", "value": "Vingt mille lieues sous les mers"}, {"language": "en", "value": "Twenty Thousand Leagues Under the Sea"}, {"language": "ja", "value": "海底二万里", "direction": "ltr"} ] This example goes beyond what EPUB 3.x supports:

we're not limited to 2 languages, we can include as many alt representations as we need the direction can be expressed on the alt representation as well We just need to figure out how we could avoid parsing direction in the JSON-LD context.

The only way I see now (and I would be happy to be proven wrong) is to define the terms 'value', 'language', and 'direction' in our own "namespace" so to say, as terms defined in our own @context and ignore its native JSON-LD/RDF meaning. But that would not be a really good direction either I guess...

HadrienGardeur commented 6 years ago

@iherman we already redefine "@id" in the RWPM default context, but that's entirely for cosmetics.

We could also use "@id", "@value" and "@language" as-is and everything would work fine.

Have you closed this issue on purpose?

BigBlueHat commented 6 years ago

Wrong button @iherman. 😄

Also, introducing two disparate processing models is likely a Bad Thing.

@HadrienGardeur there is a JSON-LD WG in the offing (we hope!), so now would be a great time for stating the need for text direction expression to the JSON-LD Community Group. Specifically, send an email to this mailing list https://lists.w3.org/Archives/Public/public-linked-json/

Beyond what might be available, the way the Web Annotation WG did it still seems viable, and could do with some proper consideration.

iherman commented 6 years ago

Wrong button indeed, sorry :-(

iherman commented 6 years ago

@iherman we already redefine "@id" in the RWPM default context, but that's entirely for cosmetics.

We could also use "@id", "@value" and "@language" as-is and everything would work fine.

I am not sure that would work, worth a try with the json-ld playground. I think redefining "@value" is just cosmetics in this sense, it would not make us avoid the problem in https://github.com/w3c/wpub/issues/124#issuecomment-361943023.

HadrienGardeur commented 6 years ago

@BigBlueHat two separate processing models? Do you mean string, array or object for each metadata like in RWPM?

It does introduce a bit of extra processing, but nothing compared to the processing of the default reading order (I'm working on this currently for the draft and it's much much worse than anything we're discussing here).

@iherman I tried both options (keeping "@value" and "@language" as is, or redefining them in the context) and they both work fine in the JSON-LD playground.

iherman commented 6 years ago

@HadrienGardeur: redefining "@value" and "@language" of course works. But the following does not:

{
  "@context" : {
  "language": "@language",
  "value": "@value",
  "direction" : "http://ex.org/direction",
  "title" : "http://ex.org/title"
   },
  "title" : {
    "value" : "something",
    "language": "en",
    "direction": "ltr"
  }
}

on playground this leads to the error message:

jsonld.SyntaxError: Invalid JSON-LD syntax; an element containing "@value" may only have an "@index" property and at most one other property which can be "@type" or "@language".

HadrienGardeur commented 6 years ago

@iherman don't include anything about the direction in the "@context" and it works fine in JSON-LD playground:

{
  "@context": {
    "title": "http://schema.org/name", 
    "value": "@value", 
    "language": "@language"
  },
  "title": [
    {"@language": "fr", "@value": "Vingt mille lieues sous les mers"},
    {"@language": "en", "@value": "Twenty Thousand Leagues Under the Sea"},
    {"@language": "ja", "@value": "海底二万里", "dir": "ltr"}
  ]
}
plinss commented 6 years ago

Please see https://w3c.github.io/string-meta/ and coordinate needs with i18n rather than invent something new in isolation here (apologies if that conversation is already happening).

HadrienGardeur commented 6 years ago

@plinss thanks for pointing that out, we should definitely reach out to them.

I just did a quick scan of the document and one of the example of best practice is almost exactly what I've proposed: https://w3c.github.io/string-meta/#bestPractices

@iherman what would be the best way to coordinate with that group?

iherman commented 6 years ago

@HadrienGardeur

@iherman don't include anything about the direction in the "@context" and it works fine in JSON-LD playground

Oh yes, you are right, I forgot about this trick. However, it is a trick which means that the resulting RDF metadata will not be proper. But we may have no other solution.

Personally, I am fine if, for the time being, this is the way we go, but I would think that @BigBlueHat and I will have to raise this issue at the (hopefully upcoming) JSON-LD WG to see if there is a better solution.

iherman commented 6 years ago

@plinss I think all the approaches we have been discussing here are in line with that document. The problem is that the document you refer to does not deal with the problem that a representation of direction cannot be done in RDF, which means it cannot (properly) been done in JSON-LD either.

That being said, the RDF issue should be indeed solved outside this group, too.

HadrienGardeur commented 6 years ago

I've extracted a few points from the best practice section of https://w3c.github.io/string-meta/

The document also considers that the best practice is to use a language map + Localizable dictionary, which IMO is a little problematic:

iherman commented 6 years ago

@HadrienGardeur, I have raised an issue in the JSON-LD CG and also commented on the string-meta document.

I would propose that, at this point:

  1. we should modify the draft to make clear that every metadata item should be "localizable", to use the terminology of the string-meta document. (That needs a PR on the draft that I will come up with at some point.)
  2. If we indeed use JSON-LD we use, temporarily, the trick you have above, with the hope that JSON-LD 1.1 will make it official, eventually.

@BigBlueHat @plinss

HadrienGardeur commented 6 years ago

I also created an additional issue at https://github.com/w3c/string-meta/issues/13

HadrienGardeur commented 6 years ago

I will also update the lifecycle branch to include the Localizable dictionary in WebIDL, but this means that we can't rely on the ES to WebIDL dictionary algorithm for metadata.

llemeurfr commented 6 years ago

There is still an aspect that is badly covered by Unicode (bidi controls), which is mixing ltr and rtl scripts in a single string in a data format like JSON. Such representation of information is rare, difficult to author / manage in a database and display in applications (eg. using native code). But it's fairly easy to manage as html, as we said.

I suggest considering that each metadata (even the title) will be expressed in a unique language and dir. A more complex expression of the information can be expressed as html in the content itself, which seems to be largely enough.

iherman commented 6 years ago

Propose closing: this is now part of the latest draft (per https://github.com/w3c/wpub/pull/129). The JSON serialization may be tricky, but this should be looked at, when the time comes, via separate issues...

iherman commented 6 years ago

Closing per https://www.w3.org/publishing/groups/publ-wg/Meetings/Minutes/2018/2018-03-12-minutes.html#resolution13