w3c / rdf-dir-literal

Proposal to add base direction to RDF Literals
Other
8 stars 6 forks source link

Do we need directional metadata in the first place ? #7

Open iherman opened 5 years ago

iherman commented 5 years ago

(Carried over from https://github.com/w3c/rdf-dir-literal/issues/3#issuecomment-496994496 into its own issue)

Side thoughts not expressed in the call. There was some question about whether we need direction at all. I think it is helpful to summarize the use cases. I think these might be:

  1. Strings with NO language and NO direction metadata.
    • Best Practice: display in an isolating context; use "first strong" semantics
    • NOTE: this includes data strings such as MAC addresses, ISBNs, etc.
  2. Strings with language metadata but NO direction metadata.
    • Best Practice: display in an isolating context; estimate direction from the language tag.
  3. Strings with BOTH language and direction metadata.
    • Best Practice: display in an isolating context; use direction metadata for base direction.
  4. Strings with direction metadata and NO language metadata (or an indeterminate language such as und)
    • Best Practice: display in an isolating context; use direction metadata

3 and 4 are unsolved problems in the Linked Data space and what (I think) we're discussing here. A key problem is that language estimation, particularly for short strings, is difficult and relies on flawed heuristics or on contextual data. There exist many contexts in which the customer's experienced based direction can be determined but where language metadata is harder to obtain. This is particularly true of UGC on the Web.

Would it help to write realistic user scenarios?

Originally posted by @aphillips in https://github.com/w3c/rdf-dir-literal/issues/3#issuecomment-496994496

r12a commented 5 years ago

There was some question about whether we need direction at all.

I think the question was more along the lines of: Do we need to specify direction separately at all, if we can infer the direction from the language?

r12a commented 5 years ago

[reply from aphillips]

@r12a Yes and my point is that there are cases where we either do not have the language metadata to infer from (4) or where we have separate data (3) where we lose information if we don't have a separate slot to transmit the direction metadata.

iherman commented 5 years ago

@aphillips,

The question for both (3) and (4): isn't it so that the author of the string CAN express those metadata, if needed, via existing BCP47 tags, albeit maybe in a more complex way through additional script references (ar-ARAB)?

aphillips commented 5 years ago

@iherman The "author" of most strings is a machine--a "producer" in string-meta's parlance. The producer has a string with associated language and direction metadata. The question is how to transmit the string and its associated data such that the eventual consumer can display and process it properly--and bearing in mind that interstitial processes may be transforming or storing the data.

For example, consider a Web form in which the user enters a bit of text. The user-agent can't reliably determine the language of the string (but might infer it from the language of the page), but it must estimate the base direction (and most user-agents offer features for RTL users to control the base direction). A Web form can send the estimated/user-set base direction in a field named with the dirname attribute.

Suppose in a form the dirname of a field foo is mydir. I might send this to the server as:

http://example.org?foo=some%20string&mydir=ltr

I want to serialize this into JSON, where it might become:

{
   "foo": "some string",
   "@language": "x-dir-ltr"
}

... because I don't have language metadata (although I might estimate en here). Does that make sense?

iherman commented 5 years ago

The problem I have is that we are receiving conflicting messages here, to be honest.

One message we are getting (see also the last part of our meeting minutes) is that, in theory, BCP47 can express the overall base direction of a piece of text if the pure language information is extended with other fields that are already part of BCP47. An example I heard from @r12a is to use az-Arab or az-Cyrl for variants of Azeri which would also express whether the text is right-to-left or left-to-right, respectively, whilst the language is identical. Which also means that the canonical example we used could be expressed as:

    <p lang="he-Hebr">פעילות הבינאום, W3C</p>

If this is correct than it fundamentally re-shapes our discussion, so it would be good if we had an agreement that I indeed this is correct. Indeed, in that case, I think we should put aside the whole of the core document of this repo: a general agreement may be to say that there is no need to change anything on JSON-LD/Turtle/RDF because having the possibility to use the BCP47 covers the problem as a whole (knowing that we do not refer to the fact we are not dealing with the really complex described by @r12a in https://github.com/w3c/rdf-dir-literal/issues/3#issuecomment-494868090). Ie, this is not a 5th option in our document: the document is moot in the first place!

iherman commented 5 years ago

@aphillips,

I understand the issue you are raising in https://github.com/w3c/rdf-dir-literal/issues/7#issuecomment-497009046, and I am commenting with the assumption that the answer to my question in https://github.com/w3c/rdf-dir-literal/issues/7#issuecomment-497272338 is 'yes, the document may be unnecessary'.

Yes, the issue of full data exchange between an essentially HTML environment and the data world (represented, in this case, by RDF) is real, but it may not be the typical use case for RDF/JSON-LD data. Such data, mostly originates from various types of databases, CSV data, etc., where there is no established way of representing direction in the first place, or is curated by hand or by dedicated scripts. These can very well be instructed to add a script sub-tag in case of doubt to the language tag. E.g., consider the schema.org data for a Publication Manifest: the typical example is the title of a book. The publisher should know, in any case, whether the title is potentially problematic because it is in Arabic with some Latin text in it, and should be instructed to do something anyway. Whether this means setting a direction tag separately or add -Arab for the language tag is not much difference.

So yes, there are some corner cases, like you the one you describe. I do not know whether such corner cases are more frequent than the "The country names include مصر, البحرين, ישראל and others" example of @r12a that we do not handle anyway, and for which (probably) the only acceptable way is to allow for the inclusion of HTML (and RDF/JSON-LD is already prepared for that). Does this justify to go through any of the (clearly problematic and controversial) roads outlined in the document?

aphillips commented 5 years ago

@iherman No, I'm still in disagreement. Full data exchange doesn't just mean the HTML environment. I used that as a useful allegory and a realistic example.

Amazon has tens of thousands of ASINs that are in Arabic but start with a strong left-to-right character sequence (such as a brand name). While I do have language data for these strings, I also have base direction. I need to be able to set (with low latency) the dir attribute in HTML or the direction of native Android and iOS controls and I'd prefer to do it using the metadata I've collected--not by introspecting the language tag.

Is it possible to guess the base direction of a string from a (hopefully accurate) language tag? Yes, of course. That's why we suggest it as a fallback strategy (including my (2)).

Mutating the language tag I consider to be an abomination (and unnecessary in any event). In most cases the script subtag is "suppressed" in BCP47 on purpose. Adding the script subtag means that the implementation is changing my data. Linked data specifications should, in my opinion, go out of their way not to mess with my data but to transmit it with fidelity. It turns out that language tags can be used to determine the base direction without adding the script subtag in the majority of cases---but again assuming that the tag is accurate, correct, and complete! The Azerbaijani example is instructive: if I have the tag az I don't really know the base direction and I'm back to first-strong or string introspection.

Management of direction in strings such as the "The country names..." you give above are out of scope for this discussion and irrelevant. You don't need HTML for these: the author can use Unicode controls to get the right out come (and probably should in a string-based environment).

What I'm trying to get at is that there are systems with large data sets containing mixed direction data. In many cases, first-strong gets the wrong result; language information may or may not be complete; and where direction metadata is provided. Across the breadth of Web specifications, we have consistently treated language and direction as separate data (dating back many years). This information is usually consistent (i.e. lang=ur (Urdu) generally goes with dir=rtl) when both are present. But it is also often the case that getting or computing accurate language metadata is difficult. So I don't want to foreclose the effort to provide a means of storing and communicating direction metadata simply because it is inconvenient to contemplate.

iherman commented 5 years ago

@aphillips o.k., I understand your position.

Trying to move forward, though, it would be good to have some sort of a general feeling (from all of us) which of the ~three~ ~four~ five directions has more pros than cons in your view, i.e.,

  1. Change the core of RDF
  2. Define a new RDF datatype
  3. Extend bcp47 via -d-*
  4. Use the current bcp47 mechanism via -x-d-*
  5. Do nothing, i.e., rely on today's bcp entirely

At this moment I do not see any other possible solution; i.e., these are our options as far as I can see.


My personal list, from most favorite to least favorite are: 2-5-4-1-3

r12a commented 5 years ago

[from Ivan (incorrectly deleted)]

(Admin question) @aphillips, would it help if I made an update of the document to spell out these 5 options? I am happy to do that tomorrow.

r12a commented 5 years ago

Which also means that the canonical example we used could be expressed as: <p lang="he-Hebr">פעילות הבינאום, W3C</p> If this is correct

That's not correct if you're using it for HTML. It should be:

<p lang="he" dir="rtl">פעילות הבינאום, W3C</p>

r12a commented 5 years ago

At the end of our post-telecon telecon i said we need to work through the details of what we had been discussing vis. "Do we need to specify direction separately at all, if we can infer the direction from the language?". I've been doing that, and to capture my thoughts I began to draft text in a wiki page at https://github.com/w3c/rdf-dir-literal/wiki/Draft-ideas-related-to-string-metadata-storage-options

In those thoughts i also tried to incorporate Addison's comments above.

duerst commented 5 years ago

Copied from https://github.com/w3c/rdf-dir-literal/issues/3#issuecomment-497625233, because discussion has been redirected here.

I think the list below is a very good start. It's mostly written from the viewpoint of a consumer (data is arriving with/without additional information; how to treat it). But we should also look at things from a producer side.

(Carried over from #3 (comment) into its own issue)

Side thoughts not expressed in the call. There was some question about whether we need direction at all. I think it is helpful to summarize the use cases. I think these might be:

  1. Strings with NO language and NO direction metadata.
  • Best Practice: display in an isolating context; use "first strong" semantics
  • NOTE: this includes data strings such as MAC addresses, ISBNs, etc.

"display in an isolated context" is good as a default, but in many cases may not be needed. If the string contains only letters of a single directionality (e.g. all Latin, all Arabic,...), then isolation shouldn't be needed (assuming the outside context knows what it wants).

  1. Strings with language metadata but NO direction metadata.
  • Best Practice: display in an isolating context; estimate direction from the language tag.

This is probably the point where we have to be most careful. It assumes that the language tag is correct. On the Web, I'm not sure what percentage of language tagging is correct; my recollection is that it's lower for "en" or "en-us" than for other languages, because "en" and "en-us" are often used in templates. That, and originally the argument that there are many, many languages that are RTL (not only Hebrew and Arabic) made sure this was never the way that HTML worked.

It would be good to know the track record for language tagging correctness in RDF and related technologies. That may help giving a hint as to which way to go.

Again, same as for 1), if it's a "directionally homogeneous" string, probably no need for isolation. That actually applies to the next two items, too.

  1. Strings with BOTH language and direction metadata.
  • Best Practice: display in an isolating context; use direction metadata for base direction.
  1. Strings with direction metadata and NO language metadata (or an indeterminate language such as und)
  • Best Practice: display in an isolating context; use direction metadata

3 and 4 are unsolved problems in the Linked Data space and what (I think) we're discussing here. A key problem is that language estimation, particularly for short strings, is difficult and relies on flawed heuristics or on contextual data. There exist many contexts in which the customer's experienced based direction can be determined but where language metadata is harder to obtain. This is particularly true of UGC on the Web.

Yes indeed. It's true of UGC, and it's also true of a large percentage of short label strings that I think make up a large percentage of the strings in RDF and related technologies. That's because the shorter a string, the higher the chance that it's directionally uniform.

Would it help to write realistic user scenarios?

I think we would benefit from input to such scenarios from the RDF community. I think such scenarios should range from input through storage/transport/processing to output.

Originally posted by @aphillips in #3 (comment)

r12a commented 5 years ago

"display in an isolated context" is good as a default, but in many cases may not be needed. If the string contains only letters of a single directionality (e.g. all Latin, all Arabic,...), then isolation shouldn't be needed (assuming the outside context knows what it wants).

While that may sometimes be true, i think you'll find that you do commonly need isolation when inserting strings into other content. For example, insert 3 single word strings into a paragraph of HTML, such as "The country names include مصر, البحرين, ישראל and others.", and if you don't isolate each one the order of the items in the list will be incorrect. Also, if you drop a string into the pattern "Restaurant: ישראל 5 reviews", you need to isolate the Hebrew name from the number so that the number appears just before 'reviews' as expected. That's why we recommend to always isolate strings when inserting them - it causes no harm if not needed.

I think we would benefit from input to such scenarios from the RDF community. I think such scenarios should range from input through storage/transport/processing to output.

I attempted something along those lines at https://github.com/w3c/rdf-dir-literal/wiki/Draft-ideas-related-to-string-metadata-storage-options That wiki page also mentions a scenario where the language stored may be incorrect, though not for the same reason as you mention.

iherman commented 5 years ago

@duerst

It would be good to know the track record for language tagging correctness in RDF and related technologies. That may help giving a hint as to which way to go.

I am not sure what you mean by "language tagging correctness in RDF". RDF regards language tags as black boxes, so to say, it does not consider, checks, or interprets its semantics (that is true for the RDF as well as OWL 2, including the 'direct' semantics). See, e.g., the definition of literal equality which simply compare language tags verbatim (in the value space language tags are all lower case).

RDF based applications may or may not do more than that, but that would probably be much more difficult to check...

r12a commented 5 years ago

I think Martin is asking how often people label RDF with the wrong language.

[There was a time, some years back, when people didn't understand the lang attribute in HTML so well, and editing tools automatically inserted lang="en" at the top of the page. My experience indicates that is no longer anywhere near the problem it used to be: partly because people are better educated, and partly because language tagging in HTML has much more of a useful effect on content than it used to (eg. for hyphenation, for font assignment, for spellchecking, for line-breaking in some languages, for case conversion, etc.)]

duerst commented 5 years ago

I think Martin is asking how often people label RDF with the wrong language.

Exactly.

iherman commented 5 years ago

@r12a @duerst

I understand. And I really do not know how to get this information, to be honest.

I picked up, a bit randomly, a page from dbpedia which gives a bunch of triples. The reason this is representative is because it is a query into dbpedia, which contains all the statements of all the wikipedia-s in various languages. The query correctly adds the language tags wherever relevant, because it comes from a well identified lingustic community. But it may not be representative because most of the data on the linked data cloud (of which dbpedia is one of the main hubs) comes from various databases, sites like wikipedia, etc, and it just takes what it gets in terms of language tags. I would expect the extraction processes to treat language tags as immutable strings.

macchiati commented 5 years ago

At the end of our post-telecon telecon i said we need to work through the details of what we had been discussing vis. "Do we need to specify direction separately at all, if we can infer the direction from the language?". I've been doing that, and to capture my thoughts I began to draft text in a wiki page at https://github.com/w3c/rdf-dir-literal/wiki/Draft-ideas-related-to-string-metadata-storage-options

In those thoughts i also tried to incorporate Addison's comments above.

I couldn't add notes to that page.

My view is that from the consumer side, the best practices to follow are:

  1. If there is language information, use that to get the base direction. a. If there is an explicit script*, use it. b. Otherwise, infer it from the language**.
  2. If there is no language information, use standard BIDI algorithm (first strong) to get the base direction.

Notes:

** The languages have a long tail. Simplest would be to have a fixed list (which could be smaller than the following), and everything else would use an explicit script. [aeb, ar, arq, ary, arz, az-IQ, az-IR, bal, bej, bgn, bqi, brh, ckb, dcc, doi, fa, glk, ha-CM, ha-SD, haz, he, hno, kk-AF, kk-CN, kk-IR, kk-MN, ks, ku-LB, ky-CN, lah, lrc, mfa, ms-CC, ms-ID, mzn, pa-PK, ps, rmt, sd, sdh, skr, tg-PK, ug, ur, uz-AF, yi]

Once the consumer BPs are specified, then the producer knows what the degrees of freedom are, and can decide how best to tag in order to convey both the language and base direction.

r12a commented 5 years ago

It is possible that more would be added, but extremely likely that they'd be obsolete.

By 'obsolete' do you mean 'archaic'? (I don't think we can ignore archaic languages.)

The languages have a long tail. Simplest would be to have a fixed list (which could be smaller than the following), and everything else would use an explicit script .... Once the consumer BPs are specified, then the producer knows what the degrees of freedom are, and can decide how best to tag in order to convey both the language and base direction.

So here we are introducing format-specific rules for use of BCP47. Although it's theoretically possible for language-detection algorithms to help (if the string text is long enough and if algorithms exist for the language in question), i think that in reality producers will in a large percentage of cases ultimately rely on people to provide metadata somewhere in the production process. The implication of that is that a person producing string text would (1) need to be aware that it is going to be used by RDF or similar technology, which has special rules for use of BCP47 tags, and (2) would need to know what languages are on the list that an application knows about and what are not (so that they can add script subtags where needed).

And btw, the list appears to be missing dv (Dhivehi), which is for me an important language, and is one for which there's a suppress-script rule in BCP47. Also missing is aii (Assyrian Neo-Aramaic). I suspect i could find others fairly easily. For example, just ar as a macrolanguage subtag encompasses the following primary language tags: aao abh abv acm acq acw acx acy adf aeb aec afb ajp apc apd arb arq ars ary arz auz avl ayh ayl ayn ayp bbz pga shu ssh, very few of which appear in the list above.

These things make me feel uneasy about relying on this approach.

r12a commented 5 years ago

btw, fwiw, anyone can get more information about the languages in Mark's list by following this link: https://r12a.github.io/app-subtags/?lookup=aeb,ar,arq,ary,arz,az-IQ,az-IR,bal,bej,bgn,bqi,brh,ckb,dcc,doi,fa,glk,ha-CM,ha-SD,haz,he,hno,kk-AF,kk-CN,kk-IR,kk-MN,ks,ku-LB,ky-CN,lah,lrc,mfa,ms-CC,ms-ID,mzn,pa-PK,ps,rmt,sd,sdh,skr,tg-PK,ug,ur,uz-AF,yi

And, fwiw, here's my own list of RTL scripts: http://r12a.github.io/blog/201512.html#20160825

iherman commented 5 years ago

would need to know what languages are on the list that an application knows about and what are not (so that they can add script subtags where needed).

Is this a new problem? If I add a string to a syntax that has the notion of language tag, and that language is Azeri, then isn't it true that I do have to add the script subtag if that string is supposed to be displayed (regardless of the direction issue)?

macchiati commented 5 years ago

It is possible that more would be added, but extremely likely that they'd be obsolete.

By 'obsolete' do you mean 'archaic'? (I don't think we can ignore archaic languages.)

Remember, I was talking about obsolete scripts, not languages.

The languages have a long tail. Simplest would be to have a fixed list (which could be smaller than the following), and everything else would use an explicit script .... Once the consumer BPs are specified, then the producer knows what the degrees of freedom are, and can decide how best to tag in order to convey both the language and base direction.

So here we are introducing format-specific rules for use of BCP47. Although it's theoretically possible for language-detection algorithms to help (if the string text is long enough and if algorithms exist for the language in question), i think that in reality producers will in a large percentage of cases ultimately rely on people to provide metadata somewhere in the production process. The implication of that is that a person producing string text would (1) need to be aware that it is going to be used by RDF or similar technology, which has special rules for use of BCP47 tags, and (2) would need to know what languages are on the list that an application knows about and what are not (so that they can add script subtags where needed).

It would not currently be feasible to identify all of the 7K+ languages, as to which is RTL. SIL does have more data mapping languages to scripts, but that data is not open.

Luckily, it is not necessary to identify all such languages. Any time a producer is unsure whether a language would be recognized as RTL by consumers, the fix is to add an explicit script.

Actually, the cleanest way to define an algorithm would be to have a fixed list of languages whose script is RTL. Consumers (once upgraded) would support that list. Producers would know that anything outside of that list is not guaranteed to have mappings, and would include explicit scripts where it mattered.

Remember again, what we are talking about is improving the rendering of BIDI text in a limited set of circumstances:

The reason I say improving is that the base direction alone is not sufficient for the most complex BIDI; that would require interior embeddings and so forth.

So if you wanted to use this mechanism for conveying the base direction of a language using a RTL script, but one that had a smaller speaker count or even was historic, you could still do it.

And btw, the list appears to be missing dv (Dhivehi), which is for me an important language, and is one for which there's a suppress-script rule in BCP47. Also missing is aii (Assyrian Neo-Aramaic). I suspect i could find others fairly easily. For example, just ar as a macrolanguage subtag encompasses the following primary language tags: aao abh abv acm acq acw acx acy adf aeb aec afb ajp apc apd arb arq ars ary arz auz avl ayh ayl ayn ayp bbz pga shu ssh, very few of which appear in the list above.

CLDR doesn't have data for all RTL languages. (We could talk to SIL to see if they would be willing to share their larger list.)

Moreover, to keep the list shorter (because some had concerns about the complexity) I also filtered out languages with smaller speaker counts.

These things make me feel uneasy about relying on this approach.

As I said, whenever the producer has doubts, there is always the option of including the explicit script.

r12a commented 5 years ago

What follows is my attempt to summarise the discussion in today's telecon. It's quite possible that i am misrepresenting some things, in which case i hope that other attendees will correct me.


Most of the time, string-level base direction can be detected correctly simply by using first-strong heuristics. This is a relatively efficient approach, and one that is in widespread use.

Such heuristics, however, sometimes fail, and it is our goal here to identify and fix those failures.

We don’t need to specify direction for every string, only those that would otherwise fail when heuristics are used. (This is unlike language metadata: every string should be associated with a language, either by inheritance or explicitly.)

Essentially we need a way to provide a hint for potentially problematic strings that indicates the appropriate base direction. These cases tend to be strings that have mixed script content where the ‘wrong’ script begins the string, and strings with no strong characters at all (such as telephone numbers).

Much of the recent discussion has focused on whether it’s possible to express direction as part of the language information provided for strings (thereby avoiding the need to extend the RDF/JSON-LD format).

Normally, script subtags are not expected as part of a language tag unless they are specifically needed to distinguish some feature. Relying on inference of base direction from ordinary BCP47 language tags (without script tags) is problematic for a couple of reasons. First, the number of languages that may be written with RTL base direction is large, and difficult to bound. This makes it difficult to establish reliable rules for inference from language to base direction unless script subtags are used. Secondly, we expect language metadata to be provided with every string: this effectively rules out the use of first-strong heuristics (since metadata has to always trump heuristics), and instead requires that every language tag be parsed and checked to determine whether the consumer must infer base direction for the string (which is much more expensive).

Various ways to circumvent the problems of mappings between language and direction have been proposed. These include use of script subtags, use of -d- extensions to BCP47, and use of BCP47's -x- private use tags. In reality, these approaches remove the reliance on mapping from simple language tags by simply appending (or in the case of script tags, integrating) directional metadata to the language tag.

We disagree with the use of the -d- extension, because as a general extension to BCP47 it would cause problematic spillover effects for other formats that would be undesirable.

The use of private use tags after -x- is a slightly better solution, partly because it is clearly signalled that this is a private usage in the context of RDF/JSON-LD, and implies a temporary solution. Ideally, this -x- extension would only be applied to those strings where we need to override the base direction that would otherwise be indicated by first-strong heuristics. However, the consumer would still need to check the language tag of every string for the existence of -x- before deciding on whether or not to use heuristics or metadata to determine the base direction.

Relying on script subtags requires a similar approach, ie. the consumer would need to scan each language tag for script subtags, and then map the script found to direction. In some cases, eg. az-Arab, the script tag may already be part of the language information rather than explicitly inserted to indicate direction. The previously mentioned difficulties in mapping ordinary language tags to scripts/direction mean that (possibly with the exception of a short list of specific tags, assuming those are specified in a way that leads to interoperable implementations) producers would need to apply script subtags to their language tags. This essentially applies a code of practise that is specific to RDF/JSON-LD and which breaks the normal guidelines for use of subtags in BCP47. Also, the cost of determining the base direction from the language tag is higher in this case than for the previous approaches. These considerations significantly reduce the appeal of using script subtags to encode direction information for a string.

With any of these approaches, there are also other issues where language tags have simply not been provided (especially in direction information is available), where language tags are incorrect, or where the string data is not in a specific language.

Taking all of the above considerations into account, the group on today’s telecon preferred to establish an end goal of making it possible to provide separate direction metadata for strings, where needed to override first strong heuristics.

An aspect that didn’t lead to a firm conclusion was whether to, in the meantime, adopt a disposable workaround, specifically use of -x-, to provide a near term mechanism for overriding the heuristics which would eventually be superseded by the separate metadata.

Ivan made the point that much of the data used by Web Publishing and Verifiable Claims is likely to be produced by humans entering string data directly into the data format described by the specs. Another possibility that arises in that case is for the content author to use RLM/LRM at start of string where the default behaviour of the first-strong heuristics needs to be overridden. This would provide the needed behaviour, efficiently, without the need to use -x-. (This was not discussed in the call.)

iherman commented 5 years ago

Another possibility that arises in that case is for the content author to use RLM/LRM at start of string where the default behaviour of the first-strong heuristics needs to be overridden.

True, but that requires some extra steps; besides, the RLM/LRM approach has its own difficulties (e.g., equality of strings with and without those extra, though invisible characters).

an end goal of making it possible to provide separate direction metadata for strings, where needed to override first strong heuristics.

To make a little bit more precise: the goal is to extend the current RDF syntax and semantics to integrate this possibility (as opposed to a separate, and new datatype on top of RDF). Ie, the group would prefer the 1st option rather than the 2nd option.

dlongley commented 5 years ago

I've been watching the discussion over how much work will need to be done to add dir to RDF and how many different specs will need to be touched to address this problem. It sounds like a lot of effort -- and while it seems that you all came to a decision that it was the theoretically best way to go on the last call (sorry, I couldn't attend) it also sounds like there's now a lot of nervousness around the amount of work that would need to be done.

All that being said, just reading this thread still doesn't cause me to think that it's the "theoretically best" path forward to add dir to RDF. I'd like to get some clarifications to figure out where I'm going wrong.

@r12a said:

Normally, script subtags are not expected as part of a language tag unless they are specifically needed to distinguish some feature. Relying on inference of base direction from ordinary BCP47 language tags (without script tags) is problematic for a couple of reasons. First, the number of languages that may be written with RTL base direction is large, and difficult to bound.

How "large" is large? You also said this:

And, fwiw, here's my own list of RTL scripts: http://r12a.github.io/blog/201512.html#20160825

Which says that there are:

6 in modern use 3 in modern limited use 17 that are archaic

There are a total of 26 (in your list). This does not strike me as a "large" data set in the context of modern computing. So under what context should I be evaluating the meaning of "large"?

You also said that they are "difficult to bound". How is that possible? The number of languages is bounded to something like 7,111 (for which 3,995 are written) per the first google result I get. That is very small data set for modern computing systems to handle, especially if you're going to turn it into a binary tree for look ups. But we're not even talking about that number, but rather the number that are RTL -- which looks to be at least order of magnitude smaller, and two per your list.

@aphillips said:

Amazon has tens of thousands of ASINs that are in Arabic but start with a strong left-to-right character sequence (such as a brand name). While I do have language data for these strings, I also have base direction. I need to be able to set (with low latency) the dir attribute in HTML or the direction of native Android and iOS controls and I'd prefer to do it using the metadata I've collected--not by introspecting the language tag.

You mean Amazon Standard Identification Number (also), right? Doesn't your application know these are ASINs and can always render them properly regardless? If not, shouldn't the data be transformed so the application can do so? It sounds like maybe there's a data cleaning issue here?

Also, your preference to use the meta data you've collected is simply because you already have it, right? I can understand that: why do extra work when you don't have to? However, it seems to me that doing the extra work should not cause latency issues given how small the data set is and how simple the parsing is. So I'd like to throw that out as a red herring. Is that fair or not?

@r12a said:

Secondly, we expect language metadata to be provided with every string: this effectively rules out the use of first-strong heuristics (since metadata has to always trump heuristics), and instead requires that every language tag be parsed and checked to determine whether the consumer must infer base direction for the string (which is much more expensive).

This sounds like either:

  1. An argument against parsing language tags entirely and instead using dir everywhere. And it seems like the parsing/lookup performance difference would be negligible for modern systems. Why is it wrong to think that?

  2. An argument for setting defaults wherever possible. Ok -- that's fine, but why would it matter if that default were a language tag or a direction? It seems like we're trying to bake a special optimization into the data when the information was otherwise already there. And, of course, when deviating from the default you'd still have to specify something that would have to be parsed and appropriately processed.

So, how expensive are we talking? For what systems? What kind of restricted hardware are we talking about where this would be an issue in the modern age? Is there really a performance issue for parsing hyphenated strings and doing a lookup in a tree of 100 items? I feel that I'm missing something as a non-expert here. But with that perspective it sounds like performance/latency is a non-issue here. But I don't have any performance numbers (nor have any been provided here by anyone else) to show that there's an important difference.

To me, the only viable argument here seems to be: "I already have independent direction meta data, why can't you just preserve that for me so I can use it?" If that's the real argument, it's fine, but I'd like us to say so and stick with it so we can focus and better evaluate the trade offs for various solutions.

We disagree with the use of the -d- extension, because as a general extension to BCP47 it would cause problematic spillover effects for other formats that would be undesirable.

First, it sounds like the need for -d- would be very rare as existing tags/subscripts provide that information already. I'd actually like to see the list of language tags for languages that can be either RTL or LTR that would need this extension because there's no other way to carry it in the language tag (maybe a new subscript would be sufficient for these cases?). The intent of adding that extension was never to "smuggle" direction information that is in conflict with a language (i.e., ar-d-ltr), rather it was to provide a way to indicate direction because the base language tag was insufficient and there was no subscript that would work either. Maybe this isn't even a thing, I don't know. But hearing about this "Arabic" ASIN use case just leaves me confused ... is the goal for "Arabic" ASINs to be marked as Arabic even though they aren't? Or to leave them "as is" even though that's incorrect? What is correct? It sounds like they should be a non-language string to me.

Second, I keep hearing about these undesirable spillover effects but they seem ill-defined to me. In fact, it sounds to me like "Arabic" ASINs are an undesirable spillover effect.

Anyway, what I've heard that there's a general feeling that -d- would spill into HTML and it would cause headaches. But HTML already has dir and a previously established order of precedence that puts dir first. It seems to me that HTML processors could ignore -d- unless/until the i18n community put pressure on HTML processor implementers to make changes to consider it. Otherwise, we'd just have the status quo for syntaxes like HTML that already have dir. Application authors that understand -d- can parse that information and set it to dir in HTML. Why is this not a valid line of reasoning? Of course, at this point, whether or not I'd support -d- depends on the answers to the above questions.

I'm no stranger to the concept of technical debt. In fact, that's the main reason I'm so concerned about adding dir to RDF. It sounds like we're repeating a mistake that HTML made and compounding it across many other syntaxes. I'm also one of the implementers that would be asked to support this stuff.

I wonder, if dir had never been added to HTML would we still think that it should be added to RDF? From my perspective, we're currently discussing the "undesirable spillover effects" of that decision.

So I've now heard that the main data sets we want to add support for is some data where dir is set but it is somehow in conflict with language tag meta data, either because the language tag is missing, it is incorrect, or it's present when it shouldn't be. This sounds to me like a data cleaning/corruption issue -- and one that would not exist, or would take on a different form, had dir never been added to HTML. All of these things are signals to me that dir was a mistake and that adding it elsewhere would be a proliferation of that mistake. Why is that wrong?

Furthermore, if we're talking about writing a function that will parse a hyphenated string and check a small lookup table for a direction vs. updating RDF and all of its syntaxes and implementations ... one of these seems to be obviously more practical. Again, this could be a strawman argument and for that I apologize, it's not my intent. I am really just trying to get on the same page here.

iherman commented 5 years ago

@dlongley

I am certainly not more of an expert than you are in internationalization, but let me chime in anyway. I let the expert react on the technical details...

From a technical point of view I have sympathy with the argument that it may make sense to use a -d-* extension to BCP; after all, the current BCP mixes different concepts already. If I say zh, that refers to the language as a general, "social" term (let put aside the fact that there are many versions of Chinese, so "zh" is actually not precise), which may determine, for example, how a text-to-speech engine works, what dictionary to offer to the reader, or how the user interface should look like to abide to cultural practices. Adding -Hans or -Hant to the zh tag brings in a very different notion, namely the fact that the same language should be "rendered" in simplified, resp. traditional Chinese characters. It is a different type of information added to the pure language tag and, in this sense, I indeed would not be shocked (as a layperson, of course) to have something like -d-* for yet another type of metadata added to the pure language information if and when it is strictly necessary. (Except, of course, that a string may not really have a language in the first place (for an acronym, for example) but may have a base direction for displaying the string, and using und-d-rtl does look a bit awkward...)

Also, would it be o.k. to define HTML without @dir but make use of a BCP tag with -d-*? Maybe. Although we should not forget that the @dir value is not only used for a single string (like we have in RDF) but for a string embedded in another string embedded in yet another string… a use case that we have ruled out from the RDF issue but is very much at the center of concerns in HTML and where @dir values without a language are probably much more frequent. But o.k, that can be handled by und-d-rtl.

However. My fundamental point is: both of these boats have already sailed a long time ago. BCP is out and direction is not part of it; HTML has been defined with a separate @dir. Both of these standards are fundamental, have a huge deployment out there and, frankly, RDF, JSON-LD, Turtle, etc., are all small fishes compared to those. You referred to the difficulties we face if we try to modify RDF by adopting a base direction: The changes are actually tiny as you can see from the separate document. Nevertheless, it is like the oft quoted butterfly whose slapping wings yield a storm somewhere else: by following the changes it creates a significant amount of work as you can see in the charter draft. Well, modifying HTML or BCP would create tornados and not only light storms! Though each individual modifications are tiny by today's standards (and you have outlines some yourself), both BCP and HTML are just, well, everywhere. We should modify CSS, the DOM, EPUB, just to cite some ubiquitous standards, let alone all the implementations, the tools for the user communities, tutorials, courses… I just do not even want to think of going there.

I.e.: I am just pragmatic. For me, BCP47 and HTML are, from our point of view, immutable. We may use a -x-d-*, but that is just cosmetic that most of the people will not appreciate and will use all over the place in HTML as well; it is not that different from -d-*. We have to find a solution in the RDF domain. I.e., we should keep away from that, too.

The only question that does come to my mind is: how important is it to solve this issue in the RDF (or JSON) world? Aren't the use cases for RDF very different from the HTML usage? After all, we are talking about simple literal strings without an internal structure and without a surrounding textual context: aren't those use cases actually solved by "just" relying on today's BCP. However, I do hear the (compelling!) example of data exchange among systems for strings like brand names, acronyms, and the like (I have also heard a similar case in WoT) which may need pure directional information without a clear indication for language and it seems that today's BCP cannot cope with this. Ie, specification purity but even practice does require that we provide a solution for this, although we know that, in real life, this will not be widely used because the language tag does the trick most of the time…

dlongley commented 5 years ago

@iherman,

Thank you for your detailed response! I will try and focus on what I'm struggling to understand.

I.e.: I am just pragmatic. For me, BCP47 and HTML are, from our point of view, immutable. We may use a -x-d-, but that is just cosmetic that most of the people will not appreciate and will use all over the place in HTML as well; it is not that different from -d-. We have to find a solution in the RDF domain. I.e., we should keep away from that, too.

I understand this perspective. Where I struggle is that I see adding -d- as similar to adding some new datatype on top of RDF (rather than modifying the core). Neither of these would make me think there's a butterfly that's going to drown us all. It must be worse than this than I realize in ways I cannot see without more domain knowledge.

However, I'd like to move away from even talking about -d- because, with my latest thinking, based on what I've been hearing and the only use case I've been able to somewhat tease out so far, is that changes/extensions to RDF, BCP47, and HTML maybe aren't even necessary. Is the use case we're trying to solve "Arabic" ASINs? You mention a use case about acronyms below...

Except, of course, that a string may not really have a language in the first place (for an acronym, for example) but may have a base direction for displaying the string, and using und-d-rtl does look a bit awkward...)

If we have a use case that needs direction for an acronym that has no language, how would updating langString help? It is not a langString to begin with...

However, I do hear the (compelling!) example of data exchange among systems for strings like brand names, acronyms, and the like (I have also heard a similar case in WoT) which may need pure directional information without a clear indication for language and it seems that today's BCP cannot cope with this.

I guess I'm still trying to pin down the use cases we've got. So far, I've heard about the "Arabic" ASIN one. Here, to be clear, you're saying that there are use cases for things like brand names and acronyms that:

  1. Do not have a language or datatype (they are xsd:string),
  2. That need to be understood as RTL, AND
  3. Using unicode/heuristics would cause them to be incorrectly understood as LTR.

i.e., therefore, only "dir" can help. (Again, I would say that this is a problem whereby modifying langString wouldn't help because of number 1).

I really feel like we should call out the use cases that we're trying to solve:

"Arabic" ASINs

From @aphillips:

Amazon has tens of thousands of ASINs that are in Arabic but start with a strong left-to-right character sequence (such as a brand name). While I do have language data for these strings, I also have base direction. I need to be able to set (with low latency) the dir attribute in HTML or the direction of native Android and iOS controls and I'd prefer to do it using the metadata I've collected--not by introspecting the language tag.

Acronyms and Brand Names with no language

???

Some IoT Use Case Ivan mentioned

???

What else?

aphillips commented 5 years ago

@dlongley noted:

All of these things are signals to me that dir was a mistake and that adding it elsewhere would be a proliferation of that mistake. Why is that wrong?

I think clearly dir was not a mistake on the part of HTML. We (the I18N community) did a poor job of communicating requirements back when it would have helped RDF/Linked Data the most. But that doesn't make direction metadata the wrong thing.

Language and direction metadata are, as you point out, related. We don't expect "lang=ar but dir=ltr" nor do we expect "lang=en but dir=rtl". It is possible to introspect the direction from language metadata (in cases where the language metadata exists and is correct) and, indeed, we allow this and to some degree encourage it as one of the means of determining base direction (as a stronger hint, in fact, than the first strongly directional character in the string).

However, there are many data sources where we can know the base direction (at least the one that the content creator was using) and where the language is at best a guess (and the shorter the content, the less accurate the guess and the harder it is to guess with any accuracy). Having separate metadata allows a system to collect language and direction and transmit them through various interstitial systems and get the right result at the end.

If HTML didn't have dir, much of the text layout and stylistic presentation for right-to-left languages would be broken. Recall that this goes beyond just the text composition: it extends to things like display mirroring and determining where start and end are, etc. There are many problems that are well handled by the presentation layers of the Web (HTML, CSS, etc.) precisely because we have dir.

However, it seems to me that doing the extra work should not cause latency issues given how small the data set is and how simple the parsing is. So I'd like to throw that out as a red herring. Is that fair or not?

I think that's not entirely fair. I mention my day job as an example of a linked data application. The extra work you mention entails doing language detection for cases where there is no language metadata (or the data might be incorrect), which is not small or fast. It means introspecting the language tag or even the string itself.

To be clear, I'm not against involving language tags in a solution (in fact, I think it was my idea in the first place). My concern is that, on the Web, one often has direction metadata (but no way to transmit it end-to-end). The availability of a language tag (which might already have a value assigned to it) doesn't help me transmit the separate bit of data I have.

aphillips commented 5 years ago

@dlongley My point about ASINs was really more "I have a large catalog of data with many APIs". A given item can have attributes like "title" or "description" which consist of text and which may start with a strongly directional character in opposition to the base direction. I can also have many kinds of content related to items---reviews, for example---for which I can only infer language (but for which I can have the browser base direction when the customer composed the text).

That is, talking about "ASINs" is just me trying to give an example of a distributed Web-based application that uses structured data and which might need separate attributes for language and direction in various wire formats.

This has nothing to do with substrings. An acronym or a brand name or other Latin script inclusion into an Arabic string is the province of the Unicode bidirectional algorithm: indeed our example in String-Meta (‫HTML و CSS: تصميم و إنشاء مواقع الويب‬) illustrates this (notice that I've used Unicode bidi controls to get the right display here). We are only concerned with the base direction of strings because that is the separate information needed to render the text properly.

One "solution" for strings is to do as I've done in this comment and use Unicode bidi controls. The problem with this is that the controls are invisible and take room in storage. Implementations that impose length restrictions or do truncation need to be aware of the (paired) controls. Specifications might be tempted to require the addition of controls (which means looking inside the data and changing my data on the wire, which I regard with horror).

We can infer direction from the language tag (when we have the language tag), but if I have the metadata in hand (which I already do for much of the web thanks to support in HTML and XML), I can't transmit it separately. I can just drop it on the floor.

dlongley commented 5 years ago

@aphillips,

Thank you, I have a better understanding of your perspective and some of the problems at hand now.

I think that's not entirely fair. I mention my day job as an example of a linked data application. The extra work you mention entails doing language detection for cases where there is no language metadata (or the data might be incorrect), which is not small or fast. It means introspecting the language tag or even the string itself.

Ok. When you say it's not small or fast -- are you saying that there's a difference that is perceptible to users? If not, what measure are you using to determine whether or not the difference is acceptable?

If HTML didn't have dir, much of the text layout and stylistic presentation for right-to-left languages would be broken. Recall that this goes beyond just the text composition: it extends to things like display mirroring and determining where start and end are, etc. There are many problems that are well handled by the presentation layers of the Web (HTML, CSS, etc.) precisely because we have dir.

Hmm, I agree with what you say here that there are many uses for dir. So my previous comments should have been more constrained so as to not lose my point.

I do think dir is useful, for example, as a display directive for reversing icons, images, timelines, and so on, but I was questioning its use as the primary source for directional meta data on language-tagged strings. I wonder if we should have always assumed this information came from the language tag and then set dir accordingly to assist with the presentation of other elements. Instead it seems that the implication is sometimes flipped: that dir is set for presentation purposes and then that is used as primary meta data for the strings that are, for example, entered by the user into a form on a Web page.

While that may be a convenient or useful conflation, it could also be seen as the root cause of the "problem" where we have extra meta data that we're considering preserving as a core part of the RDF data model.

We can infer direction from the language tag (when we have the language tag), but if I have the metadata in hand (which I already do for much of the web thanks to support in HTML and XML), I can't transmit it separately. I can just drop it on the floor.

I certainly understand that frustration. But, it makes me think: is there any other meta data that is gathered via HTML or another tech that would be useful to "bring along" in the RDF core data model? What if HTML invents "foo" and then we've got all this extra "foo" meta data that we just have to drop on the floor? Why should "direction" be given special consideration?

Part of the trouble with answering that seems to be precisely that "direction" and "language" are linked by implication, reducing one avenue for considering it special as we already have "language" in core. As you said:

Language and direction metadata are, as you point out, related. We don't expect "lang=ar but dir=ltr" nor do we expect "lang=en but dir=rtl". It is possible to introspect the direction from language metadata (in cases where the language metadata exists and is correct) and, indeed, we allow this and to some degree encourage it as one of the means of determining base direction (as a stronger hint, in fact, than the first strongly directional character in the string).

So, when "language" is present, it already implies the direction for all of the cases we seem to be considering (we've ruled out embedded strings and so forth). Here your motivator for adding the direction meta data is so that you don't have to do the inference work and "drop [direction] on the floor". Again, I'm not unsympathetic to that. However...

What else could be inferred from "language"? Suppose "foo" can be inferred but people have collected meta data for it... should it be added to core? How could the RDF community say "no"? Is your argument that they should never say "no" or is it that "direction" is special enough? What do you think the metric for saying "yes" under these conditions should be?

I think a better argument for adding "foo" to core would be that there are valid use cases when "foo" is in conflict with "language" and we therefore have no other way to support them... not that we do expect them to agree but there's a lost opportunity for optimization. This brings me to:

I can also have many kinds of content related to items---reviews, for example---for which I can only infer language (but for which I can have the browser base direction when the customer composed the text).

Here I think you're saying that you have use cases where you have direction meta data but not language. This is more convincing to me as an argument for adding something to core. At the same time, I have to wonder about how safe the assumptions are here that even lead to having "direction" and not "language".

Just because the browser base direction was set to RTL, does that really mean that the entered string "W3C" should carry that direction? In this circumstance, wouldn't you want to infer language from the characters and not assume the browser base direction applied to each string independently? Couldn't doing so be considered creating potentially bad data?

I would think you'd want to infer the language at the point of capture or store the base browser direction (and whatever other signals you have) independently from the individual strings and include that in your application level data model rather than assuming it applied to all strings without running some intelligent algorithm.

In fact, the way you described it makes it sound like that's how you stored the data (with the browser base direction independent). Is that not true or are you looking to change that? Are you saying you can assign a direction to a language string with certainty but you can't assign its language tag? I'm at a bit of a loss here since it's not my domain and I don't have sufficient experience with this problem to understand its prevalence.

r12a commented 5 years ago

@dlongley i'm in the process of writing something much longer to address the question of whether it's possible to use language metadata to derive base direction info, so please bear with me while i do so. In the meantime, i thought it might be helpful to say something to a small part of your previous questions at https://github.com/w3c/rdf-dir-literal/issues/7#issuecomment-500956469 related to the following questions:

How "large" is large? You also said this:

And, fwiw, here's my own list of RTL scripts: http://r12a.github.io/blog/201512.html#20160825

Which says that there are:

6 in modern use 3 in modern limited use 17 that are archaic

There are a total of 26 (in your list). This does not strike me as a "large" data set in the context of modern computing. So under what context should I be evaluating the meaning of "large"?

You also said that they are "difficult to bound". How is that possible? The number of languages is bounded to something like 7,111 (for which 3,995 are written) per the first google result I get. That is very small data set for modern computing systems to handle, especially if you're going to turn it into a binary tree for look ups. But we're not even talking about that number, but rather the number that are RTL -- which looks to be at least order of magnitude smaller, and two per your list.

First, i should mention that the names in my list of RTL scripts are script names, not script tags. So for example, the Syriac script is associated with 4 script tags (syrc, syre, syrj, syrn). But the much longer list is the list of languages that use those scripts: for example the language tag ar represents a macrolanguage in BCP47 which encompasses the following primary language subtags: aao abh abv acm acq acw acx acy adf aeb aec afb ajp apc apd arb arq ars ary arz auz avl ayh ayl ayn ayp bbz pga shu and ssh). These then all have their historical and dialectal variants, or just alternate ways of describing a language (for example, RTL Azeri can be az-Arab, az-IR, az-IQ, azb, azj, etc. So the list of language tags is reasonably large, but is multiplied by combining language subtags with variant or other subtags.

In addition, users may use language tags ambiguously, eg. they may just use uz to mean Uzbeck in the Arabic script, rather than auz, or uzn, or uzs, uz-Arab, etc., and because subtags after the language subtag in BCP47 language tags are intended for contrastive use, they may feel, rightly, that may be adequate for their needs. Of course, uz could represent either a RTL or LTR orthography.

Also, bear in mind that the BCP47 list is constantly growing, not only because of the addition of new language subtags (and note, btw, that there are currently 8,152 language subtags in BCP47, well beyond the seven thousand odd listed by your source), but also because combinations of existing language tags with new variant subtags may associate them for the first time with RTL text direction (such as a possible new variant that indicates the arabic orthography for Malayalam which was used at the start of the 20thC, or the syriac orthography for Suriyani Malayalam, used widely by christians in the 19thC, or the 'Yekgirtú' Kurdish Unified Alphabet being promoted by the Kurdish Academy of Language for all Kurdish dialects, including ckb, which currently is usually written in Arabic script, etc.). Then there are transliteration schemes, which may be generated using a potentially growing list of variant subtags, such as ar-fonipa, or by the -t extension tag, which have the power to modify the direction for a language tag from RTL to LTR. And so on.

I just wanted to give a sense to how (a) the set of languages for which RTL base direction applies is not that straightforward to list, (b) that it's a growing list, with plenty of potential to expand as BCP47 covers new variant and historic orthographies, and (c) human producers can't be relied upon to always produce unambiguous arrangements of subtags as input to the process.

Hope that helps a little. More to come.

r12a commented 5 years ago

Second, I keep hearing about these undesirable spillover effects but they seem ill-defined to me.

Does this help clarify the issue? https://www.w3.org/International/articles/inline-bidi-markup/uba-basics.en#isolation

For (a little) more technical detail, see https://www.w3.org/International/articles/inline-bidi-markup/index.en#usecase2 and https://www.w3.org/International/articles/inline-bidi-markup/index.en#usecase3

iherman commented 5 years ago

I looked into the details of one of @r12a's examples in the article. I took one of his examples and put it into a small HTML file:

<html>
<head>
    <meta charset="UTF-8">
    <title>dir example</title>
</head>
<body>
    <p lan=en><span>פיצה סגלה</span> - 5 reviews</p>
    <p lan=en><span lan=he>פיצה סגלה</span> - 5 reviews</p>
    <p lan=en><span dir=rtl lan=he>פיצה סגלה</span> - 5 reviews</p>
</body>
</html>

What I get on the screen is (see also online):

Screen Shot 2019-06-18 at 14 55 15

The first two lines are wrong, the last one is fine. This is a Firefox dump, Chromium based browsers produce the same results. Safari gets it wrong, though: all lines are displayed incorrectly.

The result looked counterintuitive because I expected that by isolating the text as being explicitly Hebrew (i.e., by setting the lang=he in the second line) would make things right. I suspect that the HTML BiDi algorithm actually ignores the lang tag and the span element when working out the directional run, and only considers the directional information of the unicode characters themselves. I.e., the lang tag is actually ignored as far as the BiDi algorithm is concerned. @r12a, is that correct?

(We may say: the algorithm is badly defined for HTML, and it should take the language into account. That may be so, but this is one of those boats that have sailed…)

The fact is that the explicit dir value is necessary. If I translate this into an RDF/JSON-LD usage for a hypothetical HTML application that gets a bunch of review data in JSON-LD and converts it into HTML for display, that data set could be:

{
    "review1" : {
        "note" : 5,
        "name" : "פיצה סגלה"
     },
    "review2" : {
        "note" : 5,
        "name" : {
            "@value" : "פיצה סגלה",
            "@language" : "he"
        }
    },
    "review3" : {
        "note" : 5,
        "name" : {
            "@value" : "פיצה סגלה",
            "@language" : "he",
            "@direction" : "rtl"
        }
    }
}

The values of review1 or review2 can only work if we expect the application (which would, in this hypothetical case, translate this metadata into HTML) to work out that, in this case, the addition of the rtl value is necessary; the language tag only does not provide enough data. This is clearly too much to expect from an application (see @r12a's comment above).

r12a commented 5 years ago

I looked into the details of one of @r12a's examples in the article. I took one of his examples and put it into a small HTML file:

The lang attribute is written lan in the examples. (Not that i expect it to change anything vav the results.)

Safari gets it wrong, though: all lines are displayed incorrectly.

Yes, Safari is lagging here, which is strange because the behaviour is supported by Safari when applied by CSS (dir isolates if you apply the shim we mention in the articles), and it should be just a question of flipping a switch to make it work for the dir attribute. I've asked for that several times, and i don't know why there has been no action.

I suspect that the HTML BiDi algorithm actually ignores the lang tag and the span element when working out the directional run, and only considers the directional information of the unicode characters themselves. I.e., the lang tag is actually ignored as far as the BiDi algorithm is concerned. @r12a, is that correct?

Read the introduction to the bidi algorithm again. Your examples all use a single directional run. It's when you have bidirectional text that things become more difficult. Btw, it's not the HTML bidi algorithm, it's The Unicode Bidi Algorithm at work here. Inside a Hebrew word containing only Hebrew characters, it's just the character properties that matter. When you have multiple words with mixed direction or neutrals, that's when it becomes more complicated, and the bidi algorithm jumps in to assess the context (as far as it can). The actual language of the text is not a factor for the bidi algorithm.

iherman commented 5 years ago

The lang attribute is written lan in the examples. (Not that i expect it to change anything vav the results.)

Oops. Am I stupid or what... for the records, the file is now:

<html>
<head>
    <meta charset="UTF-8">
    <title>dir example</title>
 </head>
 <body>
     <p lan=en><span lang=he>פיצה סגלה</span> - 5 reviews</p>
     <p lan=en><span dir=rtl lang=he>פיצה סגלה</span> - 5 reviews</p>
     <p lan=en><span dir=rtl>פיצה סגלה</span> - 5 reviews</p>
</body>
</html>

and the second and third display fine, the first does not.

iherman commented 5 years ago

@r12a, I believe this is the important point:

The actual language of the text is not a factor for the bidi algorithm.

sorry not to have used the right terminology; that was my core explanation of what is happening. This is the decisive conclusion for this thread...

dlongley commented 5 years ago

@r12a,

Thanks for your response.

(and note, btw, that there are currently 8,152 language subtags in BCP47, well beyond the seven thousand odd listed by your source)

Well, I would say we could go ahead and round up to 10,000 and I still wouldn't consider it a problem. In my view, we're talking about a difference in the number of microseconds it takes to execute a search.

As for the rest of your comment, there are a lot of details that I would like to boil down to this question:

Do you think a function could be constructed and made available to applications that:

  1. Takes one input, a language tag, and outputs either RTL or LTR (in a way that is reliable and useful).
  2. Runs quickly enough to be imperceptible to users (or could use a caching mechanism that would enable this).
  3. Wouldn't need constant updating to incorporate new language tags?

I get the impression that you would answer "no". If the answer is "no", then I believe I'm in agreement that we'd be asking applications to do too much work to derive direction from the language tag.

To my mind, the only exception to this would be a trade off that considers the amount of work necessary to enable directional meta data to travel separately against how often it is actually needed to cover all of the cases, i.e., if we could construct the function above to handle 99% of the use cases, it could easily be the more practical way to go.

r12a commented 5 years ago

hi @dlongley. My considered answer would indeed be 'no', as you surmised.