Setting the (text) direction

iherman commented 6 years ago

This is a completely open issue at this moment, both for JSON-LD and Schema.org... The only (incomplete) approach would be to rely on, and base everything, on the UTF-encoding of the text...

iherman commented 6 years ago

See also the separate discussion on the JSON-LD 1.1 CG: https://github.com/json-ld/json-ld.org/issues/583

iherman commented 6 years ago

Another reference: https://w3c.github.io/string-meta/

danielweck commented 6 years ago

The only (incomplete) approach would be to rely on, and base everything, on the UTF-encoding of the text...

Do you mean Unicode Bidi? http://unicode.org/reports/tr9/

iherman commented 6 years ago

@danielweck

More specifically, see https://github.com/json-ld/json-ld.org/issues/583#issuecomment-364141212 : which referred to this. The following discussion (which was, as far as I am concerned, inconclusive) gave some pro and cons to that approach.

Note that the JSON-LD CG decided to defer that issue to the JSON-LD WG which has just been formed; I hope that the discussion will re-start with some more people involved (eg, Schema.org people as well). We may want to defer this issue to see where that discussion go.

iherman commented 6 years ago

@TzviyaSiegman reminded me that there is another approach that is perfectly viable, namely use the HTML datatype. What this means in practice is that, if a text has bidirectional issues, it could use HTML syntax and the result would be considered to be a string of an HTML datatype in RDF parlance. Here is what it would mean in JSON-LD:

{
    "name" : {
        "@value" : "We find the phrase '<span dir="rtl" lang="he">פעילות הבינאום</span>' 5 times on the page.",
        "@type" : "rdf:HTML"
    }
}

(The trick is to ensure that the character '5' appears on the right hand side of a Hebrew text. If the span is not used, the number will be used as if it was part of the hebrew text and will appear on the left of it!)

From an internationalization point of view, that is much better, because it gives a better control. We could therefore say that, for example for the term name, the author should use that approach. I see two problems with that, too:

It may be an extra load on authors (and maybe not; I am not sure how frequent these occurrences are)
Just as for the case #219, while the above JSON-LD is perfectly fine, the google structured data tester does not accept it :-(

laudrain commented 6 years ago

When written as:

{
    "name" : {
        "@value" : "We find the phrase '<span dir='rtl' lang='he'>פעילות הבינאום</span>' 5 times on the page.",
        "@type" : "rdf:HTML"
    }
}

the google structured data tester seems to validate it. I have replaced the double quotes by single quotes for the attribute values which is ok in HTML5.

The following HTML5 document is valid in https://validator.w3.org/: <!DOCTYPE html>

I AM YOUR DOCUMENT TITLE REPLACE ME

We find the phrase 'פעילות הבינאום' 5 times on the page.

iherman commented 6 years ago

@laudrain oops:-) That was my mistake. But then... this looks good as a solution for direction.

However, you as a potential author: how would you like it?

laudrain commented 6 years ago

I like it. The question is will it be possble to repeat the name of the author with multiple scriptures and directions?

Taking an example from EPUB 3.1 spec[1] with a Japanese name:

{
    "name" : {
        "@value" : "Haruki Murakami",
        "@type" : "rdf:HTML"
    }
}
{
    "name" : {
        "@value" : "<span dir='rtl' lang='ja'>村上 春樹</span>",
        "@type" : "rdf:HTML"
    }
}

Is this correct? Even possible?

[1] http://www.idpf.org/epub/31/spec/epub-packages.html#sec-shared-attrs

iherman commented 6 years ago

@laudrain which checker did you use? I just tested something that is based on what you ask on https://search.google.com/structured-data/testing-tool and I get an error:

screen shot 2018-06-13 at 15 15 28

iherman commented 6 years ago

(The previous example is accepted by the JSON-LD playground...)

laudrain commented 6 years ago

I used: https://search.google.com/structured-data/testing-tool/u/0/?hl=fr capture d ecran 2018-06-13 a 15 30 35

iherman commented 6 years ago

It is the same tool, @laudrain. However, try to add a "@context":"http://schema.org" :-(

BigBlueHat commented 6 years ago

https://schema.org/name is only defined as https://schema.org/Text, so it can't contain HTML. Sorry folks.

laudrain commented 6 years ago

End of the game ?

iherman commented 6 years ago

@BigBlueHat

I could argue that "text", at least in RDF land (though it is called "Literal"), may have a datatype, and this is all what the HTML stuff does but... me arguing does not make any sense, obviously.

Sigh...

iherman commented 6 years ago

@laudrain rather back to square one.

Putting UTF directionality code into the text works, see the examples on the Activity Stream spec. It is just ugly and may create problems with search.

laudrain commented 6 years ago

Why problems with search? The characteristics of this code should prevent them: https://codepoints.net/U+2067

iherman commented 6 years ago

@laudrain I think (and I am a bit on a slippery slope, because am not an expert of these things) the problem is that search (or query in a database) is based on comparing unicode points, and it is way too easy to make the mistake and give a search term that does not include those extra characters. That may be the issue. This is certainly the case when doing database search in a graph database (e.g., using SPARQL).

danielweck commented 6 years ago

I am not a specialist, but my understanding is that "text search" typically operates on multiple layers of abstractions over Unicode code units and/or code points, in ways that are quite domain-specific. Typically, both the query and input strings need to be normalized (language-specific handling of accentuated characters, punctuation removal, etc.) and are subject to further heuristic interpretation (conjugation, synonyms, logical combinators, etc.) I do not have a clear picture of how Uniode BiDi markers affect these processing steps. I would also be interested to know how hard/easy it is to edit such RTL markers into strings in the first place (i.e. in authored metadata properties, and in user-provided search / form input fields).

laudrain commented 6 years ago

For language direction, this one seems ok:

{
"@context":"http://schema.org",
"@type":"Book",
"author": {
        "@type":"Person",
            "name": "Haruki Murakami",
        "alternateName": "\u2067村上 春樹"
        }
}

but lack the language tag.

iherman commented 6 years ago

For language direction, this one seems ok:

{ "@context":"http://schema.org http://schema.org/", "@type":"Book", "author": { "@type":"Person", "name": "Haruki Murakami", "alternateName": "\u2067村上春樹" } } but lack the language tag.

Yes, that should work, but the missing language tag is a problem (hopefully we can sort that out with schema.org http://schema.org/).

The question is: how much of a drag is it for authors to add the \u2067? @tzviyasiegman may be in a better position that others to answer this: the problem really arises when there is a mixture of left-to-right and right-to-left scripts, otherwise those tags are not necessary. (E.g., in the example above, the 'ltr' flag is not really necessary for the Japanese name of Murakami, because the Japanese characters convey that information by default in Unicode).

llemeurfr commented 6 years ago

I may be offbeat, but feel that using some alternateName for xlang properties is an issue. Why would one language be a primary one and other subsidiary, as 'alternate' suggests in practice?

Also, For property values in one single language (i DON'T speak about strings using a mix of LTR and RTL), don't you think that the language attribute is enough for what UA have to do, i.e. filter the proper variant and display the value?

iherman commented 6 years ago

@llemeurfr we have not addressed this alternateName issue at all so far, this is only to explore the I18N issues...

Also, For property values in one single language (i DON'T speak about strings using a mix of LTR and RTL), don't you think that the language attribute is enough for what UA have to do, i.e. filter the proper variant and display the value?

Depends what expect from the UA for alternate names which, again, we have not discussed so far. But I believe you are right on a more general level: having the language information available is a necessity.

llemeurfr commented 6 years ago

@iherman, yes, my main question here is: what would be the practical use of a direction attribute for property values that are in a single language? I think the answer is none. If this is the case, we should make it easy to express a property expressed in multiple languages, each value being in a single language, i.e repeatable property with a language attribute, i.e what is possible today with JSON-LD. This may correspond to 99% of the use cases.

After this is settled, if we find a way to express values containing a mix of LTR & RTL using Unicode bidi characters or any other markup, fine.

iherman commented 6 years ago

@iherman, yes, my main question here is: what would be the practical use of a direction attribute for property values that are in a single language? I think the answer is none.

I believe that is correct.

If this is the case, we should make it easy to express a property expressed in multiple languages, each value being in a single language, i.e repeatable property with a language attribute, i.e what is possible today with JSON-LD. This may correspond to 99% of the use cases.

Again, that is correct. See #219 .

After this is settled, if we find a way to express values containing a mix of LTR & RTL using Unicode bidi characters or any other markup, fine.

Correct again. We can indeed put forward a resolution whereby we rely on the Unicode bidi characters like the Activity Stream Recommendation does: the advantage is that we can adopt it right away and we do not hit any obstacle with JSON-LD and/or Schema.org. The disadvantage is that it is a bit complex to author the metadata...

There are some people in the group who may have some experience with authoring mixed setups; would be good to hear whether that approach could work...

iherman commented 6 years ago

I try to offer an approach to close this issue. I am motivated by:

The inherent difficulties to solve this issue on the same level as the language tag; at the core, this is due to the fact that the underlying RDF model (alas!) does not offer any means to represent bidrectional texts properly. JSON-LD does not (and cannot really) solve this, shown also by the comments on the relevant issue (issue 11 in the JSON-LD, originally issue 583 in the JSON-LD CG).
The solution adopted by the Activity Stream standard which has many similarities to ours: it is a JSON-LD based encoding of some information which also includes language and direction marks for the terms that are meant to use natural language.

The core of the proposal is to rely exclusively on the strong directionality information of the "usual" Unicode characters, as well as the special control characters UTF. Editorially, this would mean something like:

We keep the default base direction infoset item. In its manifest expression we introduce a term of our own with values "ltr" or "rtl" (and we try to get that term into schema.org). The default is "ltr". This term is important to keep because UA-s may use it for the general display of information of the WP (e.g., justify text to the right for Hebrew, Persian, Arabic, etc.)
We do not talk about the direction of individual (textual) values any more. Instead, we put a general text in the introduction of the infoset (or as part of the language item) as follows (I am blatantly stealing text from the Activity Stream standard:-):

When specifying bidirectional text for a natural language value, and the base direction of the text cannot be correctly identified by the first strong directional character of that text, the value should explicitly identify the default direction either by prefixing the value with an appropriate Unicode bidirectional control character.

User agents, when using Web Publication manifests that contain bidirectional text should identify the base direction of any given natural language value by either scanning the text for the first strong directional character not contained within a markup tag; or by utilizing directional markup where provided. Once the base direction has been identified, consumers must determine the appropriate rendering and display of natural language values, according to the Unicode Bidirectional Algorithm [BIDI]. This may necessitate wrapping additional control characters or markup around the string prior to display, in order to apply the base direction.

It is worth looking at the first four rows of the table in Activity Stream standard to show what is happening (we could maybe take over that table as an example).

WDYT?

Cc @mattgarrish @llemeurfr @laudrain @danielweck

Note that the Activity Stream examples (and the text) also talk about the possibility of using HTML markup in the text, an avenue we discussed earlier in this issue (and we stopped by the fact that it would be very difficult, if not impossible, to get that into Schema.org).

iherman commented 6 years ago

As an "adjacent" change proposal: at the moment, the default language for textual information is set through the JSON-LD trick of setting the language in the context. This is fairly asymmetric with the fact that we would have a separate term for the default direction. I would also propose, therefore, to use a WP specific "defaultTextDirection" and "defaultTextLanguage" term, respectively...

iherman commented 6 years ago

To make the proposal clear(er), I have created a separate branch with a first draft implementing this proposal, see https://rawgit.com/w3c/wpub/solve-directionality-issue/index.html#language-and-dir

This is not a Pull Request at this point, just a way of making the proposal clearer...

(Note that I will be on vacations starting tonight for about 10 days, so if I do not react, this is the reason. I trust @mattgarrish doing the right thing with it...)

Cc @TzviyaSiegman @llemeurfr @laudrain

TzviyaSiegman commented 6 years ago

Looks good. I think we need to provide some explanation around "und". We could take that from Activity Streams too.

llemeurfr commented 6 years ago

I'm not fan of the names proposed for these properties ( "defaultTextDirection" and "defaultTextLanguage") but this is a bikeshedding detail that can be treated later. Apart from that detail the proposal is good.

mattgarrish commented 6 years ago

Not to bikeshed, but for a bit of brevity could we just use textDirection and textLanguage? Default-iness can be determined from the description.

Otherwise, looks fine to me.

iherman commented 6 years ago

I'm certainly not bound to those names. textDirection and textLanguage is fine with me.

iherman commented 6 years ago

Unfortunately, I realized that I have fallen into a trap, and the proposed solution for the default direction is not really clean:-( The problem is with the semantics of what JSON-LD/Schema.org really expresses.

In general, when we have, in the manifest, something like

"id" : "http://www.the.book.id",
"author": "John Doe"

What that means, in English, is that

The author of the publication, whose identifier is http://www.the.book.id, is "John Doe"

Ie, every statement is something we say about the publication with the identifier (or address). However, when we have a statement like "defaultLanguage:":"fr", what we want to express is not that the default language of the publication is French, but that the default language of the "metadata" about the publication is French. This is the reason that, in the current draft (not in my proposal) we used the extension of the @context to express the default language for the manifest statements.

Expressing all this properly, though possible, would involve other notions in JSON-LD (i.e., Datasets) that are (a) probably too complicated for most of our users/readers and (b) probably not understood by the schema.org processors. We should not go down that route, imho.

Sigh.

I can see two approaches:

We move back to the @context extension using "@language":"fr" for the default language (i.e., we keep what is in the current draft), and we accept that there is no simple way to express the global, default direction and, therefore, we remove that notion from our infoset. Bidirectional texts are solely expressed by their UTF encoding. (After all, EPUB3 cannot express this either, and it did not seem like a major drawback, although we simply may not have heard of Hebrew, Arabic, Farsi, etc, publishers.)
We accept that there is a semantic impurity, but we keep to the proposal and we consider it as pragmatic solution. The terms put forward are indeed not schema.org but our "own", and we can try to provide a decent rationale in their formal definition in the vocabulary itself. And we move on:-)

Under the adage that usability and authors'/users' interest has a higher priority than theoretical purity, I am mildly in favor of (2) above. But if we do that, we have to realize what is happening, ie, that we are cheating...

(My apologies not to have realized this when I made the proposal.)

Cc @TzviyaSiegman @mattgarrish @llemeurfr

mattgarrish commented 6 years ago

After all, EPUB3 cannot express this either, and it did not seem like a major drawback

EPUB does allow the default directionality to be specified through the dir attribute on the package element. You can also override it on each text-carrying element.

The problem with minting stuff ourselves is that we'll be stuck supporting it for as long as the format exists. It might be useful to add our own solution and highlight it as an issue we need feedback on in the next working draft.

iherman commented 6 years ago

@mattgarrish that is fine. Unless there are major objections you should the add a note to the draft (maybe also referring to the problems outlined above) and merge to the main branch...

(I’m on vacations fir 10 days, I won’t do it now...)

llemeurfr commented 6 years ago

Reading the https://w3c.github.io/wpub/#language-and-dir section with fresh eyes, I feel that we'll face a huge misunderstanding of what these 2 properties are for, from implementers. On language because what publishers want to express is mostly the "language of the book" (dc:language)/ On base direction because most will confuse this with the page progression direction.

So I would rather suppress the whole section and state that the language of the metadata will be inferred from the language of the book itself (i.e. the content), unless specified on the metadata value itself. This is short and pragmatic (the border between content and metadata is thin).

And we must acknowledge that there is no perfect solution today on the Web (and in JSON-LD) for expressing the base direction of metadata values in edge cases, therefore we'll stick with https://w3c.github.io/string-meta/ recommendations and JSON-LD specification.

iherman commented 6 years ago

@llemeurfr, I just want first to have a clear understanding of what you propose. Is it so that:

The language setting and base directions of the "primary entry page" (if any) also provides the (default) language and directionality for all the textual manifest items. In other words, if the manifest is embedded in the entry page, it "inherits" those values
There is no way to set the default language/base direction for the manifest textual items in the manifest itself.
Using the JSON-LD facilities (hoping that, eventually, schema.org processors will accept these) it is possible to set a specific language for an individual textual manifest item.
It is not possible to set the directionality of an individual textual manifest item except through the UTF direction marks.

Provided this is indeed what you propose, my 2 cents:

First of all, I can live with (1) (even if we maintain the rest of what we have); we already "inherit" the title from the enclosing primary entry page, ie, one could also be fine with the language. Although, we have to realize that it is not crystal clear semantically: just like the <meta> elements in the enclosing HTML file provide metadata for the enclosing document only (and not for a collection), and hence it is semantically erroneous to mix these two, the same holds (in my view) for language tags. We may have to transgress purity in favor of author's ease...
For (2): I simply do not know how often one has a situation whereby the language of the publication and the language of the manifest would differ. The message I got early on in the WG discussions was that these two may be different, and hence it is important to separate these. If that is not the case (this is the decision of the group, obviously) then, of course, the combination of (1) and (2) is fine with me.
All that being said, putting aside the deficiencies of schema.org processors, setting the language in the context for all textual values is a legal and existing JSON-LD facility. Does it mean that we would have to explicitly disallow its usage?
I presume we are in agreement on (3) and (4).

iherman commented 6 years ago

@llemeurfr is it o.k. if I prepare a separate draft (not necessarily a PR yet) that is based on the idea that the language/dir is inherited from the primary entry page, and we can then look at that? Thinking about it further since yesterday this may be a much better option indeed, with the least of the semantic issues...

If you are fine, I can try to do this before our call on Monday.

Cc @mattgarrish

llemeurfr commented 6 years ago

@iherman this is not what I have in mind. I'll try to express it in a clearer manner:

https://w3c.github.io/wpub/#properties-intro states that Descriptive properties describe aspects of a Web Publication, such as its title, creator, and language.
the section corresponding to this language metadata is 4.4.5, which describes the language of "Each textual property in the Web Publication's infoset", which is inconsistent with the introduction.
So, let's rename 4.4.5 "Language" and define here "a language of the publication". This is what publishers will expect.
Let's add in this section that the default language of the textual properties associated to the publication can be inferred from this value (which is intuitive, and is the reverse of point 2 in your list) and that individual textual properties can override this default language by expressing a value using the JSON-LD format (this is point 3 in your list).
And add your point 4 about the directionality of an individual textual manifest item.

nb: I would be against point 1 in your list, the inference is too remote.

mattgarrish commented 6 years ago

which describes the language of "Each textual property in the Web Publication's infoset", which is inconsistent with the introduction.

We never did resolve that issue - how epub uses dc:language for the publication and xml:lang for the package metadata values.

If we require that the first language code listed be the default language of the publication and manifest values (i.e., the property is either a single value or an array of values), then it probably makes as much sense as any other approach for now.

iherman commented 6 years ago

@llemeurfr that is indeed radically different, just as I got to like 'inheriting' the language/dir settings from the HTML level...:-)

However... I see a serious problem with what you propose. You give a primary role in setting the language for the manifest. However, that information will be invisible to vanilla (ie, not WP aware) browsers. This means that the language for the real (HTML) content will be considered as "und" unless the language is set on an HTML element as well. A source of redundancy. And then, of course, we may have an issue if the two are in conflict: english is set in the manifest and french in the content. What happens then?

Unfortunately, for me, that is a serious flaw and I would not be in favour of that approach...

I would actually argue for what I thought you had convinced me about:-): The case of the embedded manifest is particularly attractive: the language and direction is set on the, say, <html> element and the manifest automatically inherits it (unless it is explicitly set otherwise). Actually, one can also use the HTML facilities, and use <script lang="fr"> as well or, even, <script lang="ar" dir="rtl">, which would be a way to set the default language and direction for the manifest. It sounds fairly clean to me.

It is indeed a bit more 'distant' in the case of a separate manifest file but, there again, we could say that the language and dir on the <link> element (which, per HTML, is inherited from its ancestors unless explicitly set) is the one valid for the manifest (again, unless the manifest changes it explicitly). It is a bit less clean than the embedded case, but works.

In both cases the advantage is that a vanilla browser understands the language setting from the HTML, ie, there will be no possible discrepancy in the rendering. That is a major plus. (And is better than the current draft, actually!)

llemeurfr commented 6 years ago

@iherman I consider it required to set the language on each HTML resource individually, as it is the practice on the Web. Voice engine and other tools will make good use of it. I don't see any issue if the two (publication metadata in the manifest and resource level information) are in conflict, as they will be used by different tools. **it happens, but it won't break the experience of the user.

@mattgarrish I agree that it should be the first language value, as Jiminy advocated in its internationalization paper.

mattgarrish commented 6 years ago

I consider it required to set the language on each HTML resource individually

Yes, the language specified in the manifest is not used to set the language of the resources, just as it isn't in EPUB. It's there to provide context. The usual examples are to preload tts languages, offer to download dictionaries, etc.

llemeurfr commented 6 years ago

the language specified in the manifest is not used to set the language of the resources; it's there to provide context. The usual examples are to preload tts languages, offer to download dictionaries, etc.

... a very good editorial note to add to the spec of this language property.

iherman commented 6 years ago

the language specified in the manifest is not used to set the language of the resources; it's there to provide context. The usual examples are to preload tts languages, offer to download dictionaries, etc.

... a very good editorial note to add to the spec of this language property.

But isn't against what you propose, @llemeurfr ? The language specified in the manifest is, in your proposal, considered to be the language of the content, too. Ie, it does (much) more than setting the text in the context...

Even if we consider the possible conflict as a negligible issue I think that we would introduce a source of further confusion. And, per @mattgarrish

Yes, the language specified in the manifest is not used to set the language of the resources, just as it isn't in EPUB.

ie, what you propose would be the contrary of what EPUB does...

llemeurfr commented 6 years ago

@iherman no, it's "a language of the publication" and by inference also the default language of descriptive metatada if in first position in a list. If there is only one publication language and its not what the UA finds when getting the language of html resource, there is an editorial discrepency. But so what?

iherman commented 6 years ago

@llemeurfr,

I am trying to see what you propose (putting aside how this should be edited into the document).

We use the schema.org inLanguage term as defined. This means it defines the language of the publication (which is the "subject" of the manifest's statements), and we consider this as the (default) language of manifest's textual terms as well.
Using the JSON-LD facilities (hoping that, eventually, schema.org processors will accept these) it is possible to set a specific language for an individual textual manifest item.
It is not possible to set the directionality (neither globally or locally) of textual manifest item except through the UTF direction marks.

An alternative to (3) is that we do introduce our own term for direction for setting the global base textual direction for the publication (going in pair with inLanguage and hoping that, at some point, this will become a bona fide schema.org term). Which means that it does become impossible to set the directionality of individual text item, but at least we have something as a global value.

Does this reflect your proposal? If so, we do have two fairly distinct proposals to (finally) close this issue: this one, and the one I described in https://github.com/w3c/wpub/issues/220#issuecomment-405898657

llemeurfr commented 6 years ago

@iherman, items 1,2 and 3 reflect my position, yes (thank you for pointing at inLanguage).

Re. the alternative to 3 you're proposing, my issue is that I don't know what a direction property would be used for. Not for categorizing publications, not for displaying property values ...

iherman commented 6 years ago

@llemeurfr

It is the same as the dir attribute in HTML. User agents may choose to put the table of content popup on the right of the screen instead of the left as customary.

JayPanoz commented 6 years ago

... a very good editorial note to add to the spec of this language property.

Indeed, because in EPUB-land, some people assume that you only have to set the one in the manifest and you’re good to go. And resources are then missing xml:lang or lang, and some reading systems then use the manifest’s as a fallback and append the attributes because TTS but also default fonts, hyphenation, some CSS props like text-transform, how to break lines, etc. all depend on the language of the resource…

w3c / wpub

Setting the (text) direction #220