Proposal for handling localizable texts (writeup of the F2F discussions)

iherman commented 5 years ago

This is an attempt for a write-up for the result of the discussion on language and direction of the F2F in Lyon on how to handle localizable texts in the Web Publication Manifest. (Note that this issue is really a JSON-LD one, hence the cc below to people outside of the WG.)

Global setting of language and base directions

The current draft remains unchanged: we use the schema.org inLanguage and our own inDirection term. The only change is that, eventually, the latter may become a schema.org term, and it should be removed, eventually, from the WPM Context

Item specific language (ie, Localizable text)

The mechanism that we agreed upon is as follows. There are three ways of setting a (localizable) literal. These literal are used for the title (ie, name) property, the names of creators, name of publishers, or (alt text) descriptions.

The three possibilities are, in an increasing level of complexity:

Simple string, ie,
"name":"some text"
where the value is a text in UTF8.
String with explicit language setting, i.e.,
```
"name": {
   "value": "some text",
   "language": "en"
}
```
where value is a text in UTF8, and language may use any valid bcp47 tag.
Strings making use of the HTML control for language features, ie,
```
"name" : {
   "value": "some text with <span>HTML syntax</span>",
   "datatype": "HTML"
}
```
where value is an HTML snippet. We will have to define a very minimal subset of HTML that fits the purpose of internationalization, and require that value MUST NOT go beyond that. The exact set of allowed HTML tags and attributes are still to be decided (hopefully in cooperation with other parties), but it will probably mean restricting to the usage of span, ruby, rt, rb, bdi, and bdo elements, and the dir and lang attributes.

Notes:

Both in (1) and (2) Unicode's directional control characters may be used to control directionality, if necessary (see appendix for examples).
Bcp tags may give more information than 'just' the language, and may also indicate, say, the script used. Example are zh-Hant and zh-Hans for Chinese written in traditional, resp. Simplified, characters.
The WebIDL dictionary in the WPM for localizable strings must be adapted, allowing the presence of a datatype.
The manifest canonicalization step of the WPM should turn a string representation in (1) into (2), using the global inLanguage value. That makes the mapping to WebIDL's representation in Javascript more uniform (this is already the case in the current draft).
It is not possible to use both the datatype and the language terms within the same object. If HTML is used and the language must be set explicitly, this should be done by enclosing the content into a span like this: <span lang="..>...</span>
datatype is an alias for @type in JSON-LD (to be added to the WPM context). This anticipates JSON-LD 1.1 where @datatype is planned to become a new keyword.

Cc in the group: @BigBlueHat @HadrienGardeur @TzviyaSiegman @GarthConboy @wareid

Cc in I18n: @r12a @aphillips

Cc in JSON-LD: @azaroth42 @gkellogg

Cc in WoT: @mkovatsc

iherman commented 5 years ago

If there is an agreement on this writeup, a separate PR will have to be prepared (probably by me...)

toshiakikoike commented 5 years ago

On the title of metadata, about the necessity of ruby in Japanese. I beleive that ruby is not necessary for metadata title, author's name.

Reason 1: The use of HTML tags as a metadata format at bookstores is not permitted. Amazon, Google Playbooks, Apple Books Store, Rakuten kobo, honto, Kinikuniya, BookLive !, ...

In these cases, HTML tags can not be used for metadata title and author name. The use of HTML tags is permitted only for content descriptions.

Garth, I recognize Google PlayBooks metadata is above spec. That's right?

Reason 2: In the above-mentioned bookstores, an input item of "title_pronunciation" is prepared. Regarding titles that are difficult to read, you do not need to add ruby to the title as you follow here.

HadrienGardeur commented 5 years ago

I'm very uncomfortable with the HTML proposal for reasons that have been pointed out before, as well as new ones based on your writeup:

the fact that datatype and language can't be used together IMO means that a lot of UAs won't be able to process the language at all for such terms
I don't think that just adopting a subset will work, authors will always include more HTML if any HTML is allowed
overall, this makes the processing much more complex for UAs, most of them won't be able to work with HTML at all and will have to sanitize all these strings
if we go down that road, we won't be able to rely on language maps (#299) which would make the expression of most multi-lingual strings easier to author
there are not that many strings that need to be localized in the first place and not all of them would require such complexe use cases (only the title of the publication IMO)

Overall, I would be much more comfortable reframing the current discussion just in the context of the title.

iherman commented 5 years ago

I would propose to leave #299 out of the discussion for now. The impression I got from the discussion with @danbri is that schema.org processors may not process language maps anyway, in which case that question may be moot.

iherman commented 5 years ago

there are not that many strings that need to be localized in the first place and not all of them would require such complex use cases (only the title of the publication IMO)

That may depend on areas of the world, but I agree the HTML version will not be used widely. But three-way choice means that, in most of the cases, the HTML version will not be necessary, ie, this is not a real load on users/authors. (I recognize it is a load on UA-s.)

I don't think that just adopting a subset will work, authors will always include more HTML if any HTML is allowed

If we really reduce the set to just a few attributes and elements, it may be relatively easy to handle these, including in a checker.

Overall, I would be much more comfortable reframing the current discussion just in the context of the title.

I am not sure what you mean.

HadrienGardeur commented 5 years ago

To go back to what I mentioned about the title of the publication, currently in our spec:

the title is not required in the infoset
a UA may fallback to the <title> of an entry page if it's missing from the manifest (ref)
in our canonicalization, <title> is extracted from the entry page if it's missing from the manifest (ref)

There's an inconsistency in our spec that we need to solve, but if we decide that using <title> from the entry page is more than a MAY for UAs, I think this pretty much solves our problem regarding i18n:

title is the only term for which we need such complex use cases
if a title is too complex to express using a BCP 47 language tag + Unicode control characters, it can be omitted from the manifest and expressed strictly on the entry page

murata2makoto commented 5 years ago

In the RDF vocabulary of the Japanese National Diet Library, they introduce dcndl:transcription everywhere. For example:

<dc:title>
   <rdf:Description>
     <rdf:value>デジタル時代における図書館の変革: 課題と展望 : 公開シンポジウム記録</rdf:value>
     <dcndl:transcription>デジタル ジダイ ニ オケル トショカン ノ ヘンカク : カダイ ト テンボウ : コウカイ シンポジウム キロク</dcndl:transcription>
   </rdf:Description>
</dc:title>

llemeurfr commented 5 years ago

I found a document (*) that explains very clearly, in the scope of the Arabic language, how Unicode bidi characters can solve the issue of mixing languages in a Unicode string. http://www.emi.ma/ntounsi/COURS/TechWeb/ScriptArabe/scriptArabe.html

I'm therefore not convinced that inventing an XML syntaxe (even if it's a subset of HTML, this is in practice an XML dialect) inside a json value would be a better solution for production tools, databases and parsers all together, for a use case which may represent only 1% of the total production of metadata.

@iherman, note that in your list, it seems that if the html tagging was accepted, there would also be a need to define the &lrm; and &rlm; entities, which in yet another complexity.

(*) in French, sorry; I guess English texts can be also found easily

iherman commented 5 years ago

@HadrienGardeur (on https://github.com/w3c/wpub/issues/354#issuecomment-433852370)

You are right that the title is the most problematic use case and that the <title> element may solve things. (Although there are still open issues on the MUST-s and MAY-s and SHOULD-s on the usage of that element.) I am not sure about person and company names, though.

But we also have the description entry for a Publication Link, which can be, essentially, the "alt text" for images, for example.

(That being said, the 'alt text' in HTML for images is also restricted to simple texts...)

HadrienGardeur commented 5 years ago

@iherman You'd need a very twisted mind to include bold text, ruby and multiple languages in a company name.

For person names, this was handled just fine in EPUB. In Japanese for instance, we could have both Kanji and Hiragana/Katakana versions of the same string.

For description, simple text strings and alignment with what HTML does is IMO the right choice.

iherman commented 5 years ago

Actually, @HadrienGardeur, what you wrote in https://github.com/w3c/wpub/issues/354#issuecomment-433852370 may not work 100%. Indeed, the definition of the title element defines it to have "text" as a content model which, in my reading, means that it is possible to set the base direction and the language for the title, but no other HTML tag based language trick is possible.

It may be "good enough" for us, but we should know that.

@r12a ?

HadrienGardeur commented 5 years ago

@iherman quite frankly, I'm fine with that as well.

It means that we're aligned with the Web which is one of the purpose of this group in the first place.

BigBlueHat commented 5 years ago

I found a document (*) that explains very clearly, in the scope of the Arabic language, how Unicode bidi characters can solve the issue of mixing languages in a Unicode string.

According to the I18N group which met with us at TPAC Unicode isn't always sufficient. See this document for more: https://w3c.github.io/string-meta/#unicode-enough

Additionally, there is an ongoing TAG review of the broader issues around multiple languages in JSON formats.

There's also some WICG exploration around "purification" of HTML content down to specific tag and attribute sets: https://github.com/WICG/purification Browser defined HTML subsets are in their list of intended interests/use cases, so having a defined "language-focused subset" would have value there also.

datatype is an alias for @type in JSON-LD (to be added to the WPM context). This anticipates JSON-LD 1.1 where @datatype is planned to become a new keyword.

I'd prefer we not alias anything until it's official.

HadrienGardeur commented 5 years ago

According to the I18N group which met with us at TPAC Unicode isn't always sufficient. See this document for more: https://w3c.github.io/string-meta/#unicode-enough

No one is suggesting that Unicode is enough.

Right now we have Unicode + BCP 47 language tags with script subtags in our draft, which would correspond to the following note: https://w3c.github.io/string-meta/#script_subtag

HTML/XML is also covered in there and they do include some important cons to that approach:

The downside of this approach is that many data values are just strings. As with adding Unicode tags or Unicode bidi controls, the addition of markup to strings alters the original string content. Producers are required to introspect strings and add markup as needed. Consumers must likewise remove any additional markup introduced by the producer.

The addition of markup also requires consumers to guard against the usual problems with markup insertion, such as XSS attacks.

aphillips commented 5 years ago

@toshiakikoike, @murata2makoto I want to briefly address this comment and Murata-san's reply. I agree that it is not necessary to add a transcription/pronunciation field to every natural language string. However, publication metadata should have a slot for each title and author to include this information. That's because sorting Japanese and Chinese depends on this information, which cannot generally be computed (particularly in Japanese) from the string itself. Without this information, sorting sets of content is generally reduced to radical based sorting, which is much less natural for the user. Overall, this is an aside from this issue's topic, but is still important for users of East Asian ideographic languages.

aphillips commented 5 years ago

In today's I18N teleconference I was tasked with replying to Ivan's email as well as this thread. I'm going to do that primarily with this comment.

The current set of proposals (and the draft's text) covers language metadata reasonably well. The use of BCP47 language tags (which is also to say Unicode locale identifiers) at the publication and LocalizedString level provides the necessary granularity for setting and overriding the language of the document and individual values.

What remains at issue is the provisioning of direction metadata. We continue to feel that the construct LocalizedString should include an (optional) direction value for use in setting/overriding the base direction. Each localized string, which includes titles, authors, descriptions, and other natural language fields, will both encounter the need for this value (as a way of handling strings whose direction cannot be computed) and as a way for implementers (like my employer) who provide internally for this type of metadata to send it unscathed through a document.

One of the sticking points at TPAC was whether the resulting construct was compatible with JSON-LD or with RDF. From the above thread, it seems like this is something that could be worked around? What specific pros/cons need addressing here?

Using markup for this, as championed by @BigBlueHat, would do this fine. Our sense is that this is not currently the consensus choice, however.

We also note that the current draft contains language describing how to perform "first strong" detection of direction. We have a nit about the current text. It should say that the Unicode Bidirectional Algorithm is used to determine the base direction. In practice this means "usually the first strongly directional character", but there are certainly cases where the first strong character doesn't determine the direction (notably, when that character is surrounded by an "isolating" control character).

@HadrienGardeur noted:

No one is suggesting that Unicode is enough.

I think what is important is understanding why Unicode (by which I mean the use of Unicode bidi control characters) is insufficient.

The Unicode embedding bidi controls do not solve all textual problems in a "wrong base direction" context, particularly with neutrals and directionally sensitive paired punctuation (such as parentheses). While LRM/RLM can be used with the controls to fix things up, quite a bit of introspection must be done by the content author to ensure the right result--this is inconvenient (and difficult to do automatically) for plain strings coming from databases or content management systems. Having the base direction for the string is often available to the producer and solves the problem (without having to edit invisible control characters not naturally occurring in the text!).

Using "Bidi isolation" could help with this problem to some degree, although knowing the base direction is still necessary. The isolating controls make it much easier to construct, store, present, and exchange mixed direction text dynamically. However, because these controls are new, implementation support at the operating system and user agent level lags. Currently the controls are often just some "invisible junk" that don't have the desired effect.

Finally, BCP47 tags are useful for identifying the language and presentation of content, but the artificial introduction of script subtags I think is a Bad Idea. The script is a property of the text itself. The script subtag exists to externally identify language/locale variations between content items. For most languages and most content--including the preponderance of languages using a bidirectional script--the script subtag is not recommended. In order to use a script subtag to set a base direction, content providers would have to determine what direction they wanted and infect the language tag with it. This might be fine if we follow common practice (as with CLDR/ICU) and infer suppressed scripts for text that follows its language (e.g. the tag ar implies ar-Arab and thus dir=rtl), but becomes problematic when we need to set the opposite effect (ar-Latn so we get dir=ltr). And languages that actually use multiple scripts (az-Latn vs. az-Cyrl vs. az-Arab) are at a disadvantage, since the content needs to indicate its actual script and also set its base direction. While these are unlikely to be in conflict, we are asking external processors (which want to just serialize and deserialize content) to introspect a lot of data.

Wouldn't it be better to just insert a metadata field than write code at the serialization layer (where I don't think it belongs) to inspect and insert additional bidi controls (which persist downstream), add script subtags, or wrap things in markup?

iherman commented 5 years ago

@aphillips

One of the sticking points at TPAC was whether the resulting construct was compatible with JSON-LD or with RDF. From the above thread, it seems like this is something that could be worked around? What specific pros/cons need addressing here?

Unfortunately, it is not possible. the JSON-LD rules are fairly strict, because they reflect the strict rules surrounding RDF Literals. This was discussed in the JSON-LD Working Group, and the Working Group decided that JSON-LD 1.1 would not deviate from RDF either. In other words, something like

{
    "@value" : "Textual content",
    "@language": "en",
    "@dir": "ltr"
}

Is not possible, it would be invalid JSON-LD, would be therefore rejected by JSON-LD processors.

As a consequence, I do not see any possible workaround at this moment...

Note, however, that there has been some interesting discussion on RDF that may affect this issue. A long and complex discussion has been started recently on the future of RDF; see, e.g., the recently set up github repository which collects lots of issues around RDF, including issues related to literals (see, e.g., https://github.com/w3c/EasierRDF/issues/22, https://github.com/w3c/EasierRDF/issues/21). It is unclear where this discussion will be heading and whether there will be a new version of RDF at some point, but that is certainly the long-term goal. If that happens, it will, eventually, affect the future evolution of JSON-LD as well. I think it would be worthwhile for the i18n community to be involved in that discussion on RDF, and make this recurring directionality issue very clear, hoping that this long-lasting problem may be solved once and for all.

iherman commented 5 years ago

@aphillips

We also note that the current draft contains language describing how to perform "first strong" detection of direction. We have a nit about the current text. It should say that the Unicode Bidirectional Algorithm is used to determine the base direction. In practice this means "usually the first strongly directional character", but there are certainly cases where the first strong character doesn't determine the direction (notably, when that character is surrounded by an "isolating" control character).

Would it be possible for you to provide a clear text, either in the form of a PR or simply sending us a replacement text? That would make the changes faster and cleaner.

iherman commented 5 years ago

I try to summarize my thoughts...

JSON-LD 1.0 and 1.1 and, ultimately, RDF 1.1, does not allow the setting the base direction for a text (literal, in RDF parlance). Also, JSON-LD does not allow any extra, non-RDF terms either (per JSON-LD Working Group resolutions) see also https://github.com/w3c/wpub/issues/354#issuecomment-447258790.
Using HTML literals (again in RDF parlance) alongside plain old literals is possible, but there is a considerable push back in the group in doing so. In particular:
- If one uses an HTML literal, then one has to do that all the way, ie, it is not possible (per RDF restictions again) to express the language using the same formalism as for plain texts. Ie, one must use the HTML syntax embedded into the literal to do so. This may become the source of frequent authoring confusions and errors.
- Processing the HTML as a completely separate branch alongside traditional literals is a drag for User Agents (as well as for authors)
- Usage of HTML Literals is unknown for schema.org. Ie, all the names, titles, etc, using this formalism would be ignored by schema.org processors, which goes against the very reason why schema.org has been chosen.
All these issues reflect the i18n deficiencies of RDF. JSON-LD is only a serialization of RDF, and schema.org is a user of JSON-LD. Finally, the Web Publication Manifest is just a user of schema.org and of JSON-LD. Problems should be solved at the root, i.e., in RDF, everything else sounds like a hack.
The metadata entries where directionality comes into the picture are the name of persons and organizations, title of the publication, and the "description" of links which are, essentially, the equivalent of alt texts for HTML images. Looking at these terms the issues around base direction represent an extremely small number of cases. The fact is that these metadata items have been in use in EPUB (with other syntaxes) for many years now and the community has never really faced major problems.
- Note also that, among those, the "title" may also come as a copy of the <title> HTML element (which is "lifted" into the manifest metadata). This is interesting because the content of the <title> element is (HTML) text (i.e., it is not possible to include further HTML markup). How come this issue has never come up for HTML so far? Why would a Web Publication be different in this respect?
On the positive side, there is now a work starting on a possible revision of RDF (see again https://github.com/w3c/wpub/issues/354#issuecomment-447258790). Part of the discussion is whether literals may be allowed in a "subject" position in a future RDF. If that is allowed, it would be possible to add various attributes to a specific text, including directions but also transcription, pronunciation, etc. In other words, there is a hope that this issue will be finally solved in RDF.

Based on these facts, my personal proposal is:

For the current version of WP we leave the draft text as is, acknowledging, and documenting, the fact that there might be edge cases that are not covered.
We (where "we" is mostly the I18N community, but I am also happy to help) should participate in the discussion on RDF and ensure that the new RDF work solves the base direction issue (and other possible I18N issue) in some way or other. If everything goes well, RDF 2.0 (or whatever the name will be) should not have this issue any more.
It can be expected that, if RDF 2.0 is published, there will be an upgrade of JSON-LD to a version that encompasses RDF 2.0. I presume, schema.org will also follow, eventually.
A future version of Web Publication may smoothly upgrade by adopting the new features for directions and other aspects, and would then close the loophole of the current version.

r12a commented 5 years ago

Would it be possible for you to provide a clear text, either in the form of a PR or simply sending us a replacement text? That would make the changes faster and cleaner.

I think we were envisaging something along the lines of:

"auto: indicates that the textual values are explicitly directionally set to the direction of the first character with a strong directionality , following the rules of the Unicode Bidirectional Algorithm." (bold added to highlight changes only)

Note also that, among those, the "title" may also come as a copy of the <title> HTML element (which is "lifted" into the manifest metadata). This is interesting because the content of the <title> element is (HTML) text (i.e., it is not possible to include further HTML markup). How come this issue has never come up for HTML so far? Why would a Web Publication be different in this respect?

Two points there:

it did come up, repeatedly
you can actually add a dir attribute to the title element to set the overall base direction, which is the exact equivalent of what we're discussing here. What is not possible, is the use of markup within the title element to deal with further changes in direction inside the string (and so you have to resort to unicode formatting characters).

iherman commented 5 years ago

@r12a,

Thanks for the change in the text. We are reorganizing the draft right now (that is orthogonal to the direction issues) and we will take care of making that change then.
Yes, it is correct that dir can be set on the title element. But the really controversial point of the discussion is the usage of an HTML RDF Datatype, which would be equivalent to having a rich content within the title element.

r12a commented 5 years ago

What i'm still not clear about is this: You added inDirection to give the overall direction of the resource, and the spec says "When specified, these properties are also used as defaults for textual values in the manifest." No distinction is made in that statement between inLanguage and inDirection.

This makes me assume that it is possible for a consumer of the string to figure out that the base direction of a string should be RTL, as long as it knows enough about the structure of the WPub metadata to recognise inDirection, and enough about the rules for using WPub metadata to understand that the inDirection value provides a (default) base direction for all strings.

The thing we appear to be stuck on is the use of a mechanism to indicate the item-specific base direction. And my understanding of the reason is that, when it comes to base direction, JSON-LD doesn't have a construct equivalent to that used for language.

So here's why i'm confused. It seems to me that either:

(a) we could add an item-specific field for base direction just like the one for language that may not be representable in JSON-LD, but presumably could still pass useful metadata to the consumer in a similar way to the use of the inDirection data, if the consumer knows how to get at that data, or

(b) there isn't actually a way of using the information provided by inDirection to pass metadata about base direction to the consumer, so what's the value of having inDirection at all?

I'm fully prepared to be told that these questions expose large chasms separating my (mis)understanding and the way this all works, so please help me get that straight.

iherman commented 5 years ago

@r12a

I agree it is a bit confusing, sorry about that.

From a purely JSON-LD (and RDF) point of view inDirection is meaningless, in the sense that it will be in the generated data but the strings themselves will not be marked up (because then can't). This means that the application/consumer will have to do some extra processing that is on top of the clean JSON-LD level. So far we agree I guess.

However... what does not work is this statement of yours:

…but presumably could still pass useful metadata to the consumer in a similar way to the use of the inDirection data, if the consumer knows how to get at that data

the problem being that there is no way to do that per JSON-LD syntax. Something like:

{"@value" : "something", "@dir" : "ltr" }

does not work. A (JSON) object using "@value" cannot have any other additional term except "@language", and a JSON-LD processor will reject such data altogether. In other words, we cannot define this extra item level metadata that the consumer could interpret.

(This was raised as a a JSON-LD WG issue (https://github.com/w3c/json-ld-syntax/issues/11) and was closed by the Working Group as a "won't fix".)

The introduction of inDirection was included into the WP draft as a suboptimal solution allowing for at least some level of control for base directions. We could remove it altogether to reduce confusion, but that may be a radical move...

r12a commented 5 years ago

Thanks Ivan, but that still doesn't really answer my question.

does not work. A (JSON) object using "@value" cannot have any other additional term except "@language", and a JSON-LD processor will reject such data altogether. In other words, we cannot define this extra item level metadata that the consumer could interpret.

Yes, i understand that. It has been said many times. But if the inDirection data "was included into the WP draft as a suboptimal solution allowing for at least some level of control for base direction", then there must be some way of the consumer getting at that information, even though "From a purely JSON-LD (and RDF) point of view inDirection is meaningless".

So why can't we have an additional item-specific field (direction) which is also meaningless in JSON-LD and RDF, but which an application/consumer can get at to override the information it has already consumed from inDirection?

That's what i don't understand.

azaroth42 commented 5 years ago

My 2c: Any validation or processing of the JSON as JSON-LD would raise an error or throw away that information. Meaning that you've turned what was JSON-LD into something that isn't. It's not just meaningless, it's wrong. You could call it just JSON ... but that would be a shame, instead of gathering together the use cases for this feature as a core part of the RDF data model.

r12a commented 5 years ago

@azaroth42 , does "it's wrong" apply here to inDirection as well as item-specific fields?

r12a commented 5 years ago

For maximum clarity, btw (should have said this earlier), the scenario i'm asking about wrt item-specific data is

String with explicit language setting, i.e.,

"name": {
   "value": "some text",
   "language": "en",
   "direction": "rtl"
}

azaroth42 commented 5 years ago

Apologies, to be clearer by way of example:

It is fine to use predicates in the graph to manage this data, by assigning them to a full resource. Any other information could be added as well. This is valid:

{
  "@id": "http://example.org/.../mytitlefield",
  "@type": "eg:TitleField",
  "content": "Here is some text",
  "language": "en",
  "direction": "ltr"
}

It does not work with individual strings, as the extra properties beyond value and language are not allowed due to the RDF data model. This (and your example) is (thus) invalid:

{
  "@value": "Here is some text",
  "@language": "en",
  "@direction": "ltr"
}

As it would try to generate a literal: "Here is some text"@en ... and then blow up on @direction as invalid.

And thus does not work with language maps on resources, which rely on the language tag of the string:

{
  "type": "TitleField",
  "content": {
    "en": "Here is some text",
    "ar": "(here arabic that should be rtl)"
  },
  "direction": " ... errr ... both?..."
}

The language map is a short cut for the more verbose, less familiar @value/@language construct, not an actual resource.

Instead you would need to have multiple title fields, each with exactly one content string, one language and one direction.

[
{
  "@id": "http://example.org/.../mytitlefield-en",
  "@type": "eg:TitleField",
  "content": "Here is some text",
  "language": "en",
  "direction": "ltr"
},
{
  "@id": "http://example.org/.../mytitlefield-ar",
  "@type": "eg:TitleField",
  "content": "(arabic content here)",
  "language": "ar",
  "direction": "rtl"
}
]

It would be a significant improvement to RDF if the direction form were allowed, and by demonstrating the lack here (and in the JSON-LD group, plus elsewhere) I feel that there's a better chance to fix it properly rather than patching over it in a non-standard way, thereby reducing the desire for a prompt solution.

iherman commented 5 years ago

What you have below as a value of "name" is an object with some properties, but it is not a literal in terms of JSON LD. Tha distinction means that, e.g., schema.org will not understand it.

If you replace "value" with "@value" and "language" with "@language" then you do get a valid representation of a literal (which is also ok with schema.org)... except that this will not work (per JSON LD) if one keeps any other term, ie, "direction":-(

Ie: what you propose may be valid JSON LD but means something fundamentally different and would not work for us...

(Written on my iPad. Excuses for brevity and misspellings...)

On 12 Apr 2019, at 19:48, r12a notifications@github.com wrote:

For maximum clarity, btw (should have said this earlier), the scenario i'm asking about wrt item-specific data is

String with explicit language setting, i.e., "name": { "value": "some text", "language": "en", "direction": "rtl" } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

iherman commented 5 years ago

I just realize a source of a terrible confusion, sorry about that. The previous comment, referring to value and name were taken isolation. However, it is a widespread pattern in the JSON LD world to define aliases for @value to be... value, for @language to be... language. I other words your example, as well as the examples in the document, are meant to be literals indeed, but the. you hit the issue with @value and JSON LD...

Sorry I did not realize this before.

I.

—— Ivan Herman

(Written on my iPad. Excuses for brevity and misspellings...)

On 12 Apr 2019, at 20:42, Ivan Herman notifications@github.com wrote:

What you have below as a value of "name" is an object with some properties, but it is not a literal in terms of JSON LD. Tha distinction means that, e.g., schema.org will not understand it.

If you replace "value" with "@value" and "language" with "@language" then you do get a valid representation of a literal (which is also ok with schema.org)... except that this will not work (per JSON LD) if one keeps any other term, ie, "direction":-(

Ie: what you propose may be valid JSON LD but means something fundamentally different and would not work for us...

(Written on my iPad. Excuses for brevity and misspellings...)

On 12 Apr 2019, at 19:48, r12a notifications@github.com wrote:

For maximum clarity, btw (should have said this earlier), the scenario i'm asking about wrt item-specific data is

String with explicit language setting, i.e., "name": { "value": "some text", "language": "en", "direction": "rtl" } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

r12a commented 5 years ago

Thanks to @azaroth42 and a chat with Ivan yesterday and i finally understand that the suggestion in https://github.com/w3c/wpub/issues/354#issuecomment-482663740 involves, by convention, a special JSON-LD construct (because it contains a mapping to @value) that represents an RDF string rather than an object, and for this to hold, an object containing an @value cannot contain anything else other than language and type. Adding anything else breaks its meaning for automated tools.

I then found myself wondering whether (just for the specific instances where we know that 1st strong heuristics will fail, eg a title like "HTML و CSS: تصميم و إنشاء مواقع الويب") we could use a different approach, such as

{
  ...
  "content": "HTML و CSS: تصميم و إنشاء مواقع الويب",
  "language": "ar",
  "direction": "rtl"
}

for which the this spec would provide advice about how applications can convert the relevant parts to RDF (minus the direction info), or otherwise spot that they need to override the 1st strong heuristics.

Ivan told me that this would create too much repetition in the manifest, and @azaroth42 appears to be saying that it would create a getout that takes the pressure off the JSON-LD/RDF folks to properly address direction. And i readily admit that it's not an elegant solution.

Therefore i conclude that, for the manifests created per this spec, wpub's preferred way of dealing with such a problematic title is to append an RLM to the beginning of the string, like this:

"name": [
        {
            "value": "&rlm;HTML و CSS: تصميم و إنشاء مواقع الويب",
            "language": "ar"
        }
],

Is that correct?

iherman commented 5 years ago

@r12a, yes, your summary is correct.

The current editors' draft does have a paragraph (the paragraph after the note in section 2.6.4.4.2 that refers to this possibility. We would appreciate, however, if you could have a look and give us a feedback that would make this clearer to the reader. If necessary, we can also add some more examples in the BiDi example table.

(It may be a good idea to provide such feedback in a separate issue, though, and close this one which came out of the TPAC F2F discussion.)

Cc: @mattgarrish @TzviyaSiegman

aphillips commented 5 years ago

@r12a I kind of disagree that this should be "wpub's preferred way of dealing with this". This is a workaround. For it to be automatic it would require wpub implementations to introspect string values and insert an RLM (or LRM in selected cases) marker (changing the data, which is generally a bad idea). I don't think this should be normative.

This is really more "advice to content authors" for dealing with RDF (etc.'s) shortcomings. To that I would add advice to wpub consumers that they can use other mechanisms described in string-meta, such as inferring the direction from the language tag.

Effectively, we have no solution to the problem and no short term path to a solution. We (I18N) should engage RDF-NG, JSON-LD, and schema folks about building a long term solution.

@iherman I agree with your resolution. Consider referencing the examples in string-meta to save space in your document. @r12a and I will review what you have and raise individual specific new issues.

iherman commented 5 years ago

@aphillips I see your point about the difficulties for implementers.

Just a layperson's informative question, though. If (and I agree that is a big 'if') the author/editor of a book produces a proper string using, e.g., &lrm;, and the user agent "just" uses current Web Technologies to display the text, would that be properly handles by today's browsers? The reason this may be important is that wpub user agents may rely on, internally, the same browser rendering core engines as the major browsers; i.e., the question is whether they have to do something special or not.

I am looking forward to your comments on the text in the current draft; we may then have an editorial run at those paragraphs and/or the examples.

I am happy to participate in the work around RDF & directions if an RDF-NG works is indeed initiated.

aphillips commented 5 years ago

@iherman If a string includes one of the strongly directional markers and the string is displayed in a dir=auto context, then current Web Technologies will display it correctly. If it is displayed in an opposite direction context (e.g. dir=ltr when the text has an \&rlm; on it and is a mixed direction rtl text--as in the example above) then it might (depending on the text) be displayed incorrectly. Here (from a default ltr page) is "no dir", dir=auto, dir=rtl, and dir=ltr of the example string Richard gave above (with an RLM as the first character):

One of the reasons to have metadata is so that implementations can supply the @dir correctly. Note that adding the isolating controls around strings gets a better result in several (but not all browsers), but has the downside that not all native controls support these Unicode characters yet (and may display them as "tofu" boxes):

(above has dir=ltr so displayed left aligned--but the isolating controls cause proper RTL display)

BigBlueHat commented 5 years ago

If @r12a's proposal, the &rlm; "entity" shows up...

    {
        "value": "&rlm;HTML و CSS: تصميم و إنشاء مواقع الويب",
        "language": "ar"
    }

Earlier, @llemeurfr points out that if we're not doing HTML processing (which would not be the case here because a language is included--so it has to be a string...), then...

there would also be a need to define the &lrm; and &rlm; entities, which in yet another complexity.

@iherman are we planning to require implementing those "magic" entities in WPUB strings? or are they just "stand-ins" for the otherwise invisible Unicode characters that @aphillips mentioned?

iherman commented 5 years ago

@BigBlueHat

@iherman are we planning to require implementing those "magic" entities in WPUB strings? or are they just "stand-ins" for the otherwise invisible Unicode characters that @aphillips mentioned?

as far as I am concerned, the latter. Ie, we would have

{
        "value": "U+200EHTML و CSS: تصميم و إنشاء مواقع الويب",
        "language": "ar"
 }

BigBlueHat commented 5 years ago

@r12a @aphillips can y'all confirm that the above is "sufficient" for what's been discussed here?

Also, these approaches don't handle multi-language strings...for which we'd need HTML (and ideally a limited subset).

aphillips commented 5 years ago

@BigBlueHat Including strongly directional characters, such as the RLM mark, is something authors do in their content. You could have an advisory message to do this, but I would oppose normative language requiring wpub applications to provide these characters.

No approach handles multi-language plain-text strings. That does require markup or other mechanisms. These are relatively rare.

r12a commented 5 years ago

If @r12a's proposal, the ‏ "entity" shows up...

Just to be clear, this was not my proposal: it was my understanding of the wpub proposal after talking with Ivan.

@iherman are we planning to require implementing those "magic" entities in WPUB strings? or are they just "stand-ins" for the otherwise invisible Unicode characters that @aphillips mentioned?

For plain text strings i believe you'd be looking at either the invisible formatting character itself, or perhaps, since this is Javascript,

{
        "value": "\u200FHTML و CSS: تصميم و إنشاء مواقع الويب",
        "language": "ar"
 }

The escape could also be written \u{200F}. (Note, Ivan, that it's 200F, not 200E, and that the U+ notation would not be appropriate.)

I wouldn't expect to see &rlm; in plain text strings.

iherman commented 5 years ago

https://github.com/w3c/wpub/issues/440 is now a separate issue on the editorial aspect of the document.

w3c / wpub

Proposal for handling localizable texts (writeup of the F2F discussions) #354

Global setting of language and base directions

Item specific language (ie, Localizable text)