w3c / rdf-dir-literal

Proposal to add base direction to RDF Literals
Other
8 stars 6 forks source link

In defense of extending language tags #3

Open pchampin opened 5 years ago

pchampin commented 5 years ago

BCP47 is indeed a complex specification, but extending it looks relatively lightweight: the U extension is 7 pages long, counting a lot of boilerplate text...

The regular expression mentioned in the document is very complex because it aims at identifying every part of the language tag. But unless I'm misreading the ABNF grammar, if you are only interested in the direction, the regular expression -d-(ltr|rtl|auto) would do the trick.

gkellogg commented 5 years ago

It seems to me that text direction is equivalent to a style attribute. To my mind, two language-tagged strings that differ only in direction at least have the same value, but can’t be considered to be the same term.

I could see that -auto would be a syntax-level consideration similar to xsd:string and rdf:landString. Any language tagged string without an explicit direction implicitly is “auto” and serializers should not emit this.

pchampin commented 5 years ago

@gkellogg I included auto because it was part of the two other proposals, but I tend to agree with you: the extension d to BCP47 should only define ltr and rtl. The auto mode is conveyed by the absence of the d extension.

r12a commented 5 years ago

I'm quite reluctant to see this as a good solution, because it bends the language tag to something it isn't designed for.

(Note that the original hack suggestion in the i18n document was to deduce the direction from the language, rather than to specify it explicitly. (And we're not even very keen on that idea.))

(I'm also quite worried about the running battles i'm sure we'll have to face where people working with other formats (such as HTML) might (in my mind, erroneously) think this is a good idea (i already know some who have suggested it), and might want to use language tags everywhere instead of direction tags. Not to mention that we'd then have 4 ways of applying direction in HTML.)

dlongley commented 5 years ago

@r12a,

I think that if the direction of a language can be gleaned from the language tag (e.g., you can already do this with English and most? other languages), then it seems to me wholly appropriate that you can specify a direction explicitly using the same mechanism. It also seems like the simplest and least disruptive solution across various implementations and platforms.

As you mentioned, HTML already has a number of different ways of applying direction, what's the harm in adding another one if it finally gets this issue solved across the board? It seems like any system that already supports language tags would get direction support with this change and it would remove the need to argue any further about this for the variety of syntaxes in the world. What would be so bad if people did start using language tags everywhere instead of direction tags? What are the other drawbacks besides a feeling that it isn't "the best fit"?

As the old adage goes, let's not let the best be the enemy of the good.

chaals commented 5 years ago

@dlongley you cannot necessarily get the direction information just from the language.

You asked

HTML already has a number of different ways of applying direction, what's the harm in adding another one

The first harm in adding another way to check direction in HTML is that it requires all HTML implementations to implement it, which is generally quite difficult to arrange. Browsers are nervous about changes to the layout engines because those are already very complex and can have major performance impact. Other software (e.g. editing software) often takes a long time to work out what to do and then implement it.

The second harm is that we have to specify what to do in each possible combination of methods, including working out what should happen if there is conflicting information. Apart from being a complex task with potential for us to make mistakes that don't get picked up until the first round of implementations make them happen in practice, the more complex the system the more likely that implementations will confuse something and introduce a bug even in the historically unlikely situation that the specification is perfect first time.

gkellogg commented 5 years ago

We should consider not doing a blanket change to language tags for direction, but allow them for certain syntaxes, such as Turtle/SPARQL. Other serializations, such as HTML and JSON-LD (with @dir) can express it differently.

chaals commented 5 years ago

Forking language tagging according to what spec is being applied seems like a pretty bad idea to me.

I don't think it will take long for confusion to arise and implementors who were promised they wouldn't get language tags with directions in them will find that they are appearing, and in practical terms they need to work out how to deal with them.

That seems something worth avoiding if we can.

dlongley commented 5 years ago

@chaals,

you cannot necessarily get the direction information just from the language.

Sorry, I meant that this is already true for many languages, but not all. For those languages where it is possible, that information comes from the language tag (e.g., "en" means the direction is "ltr"). That indicates to me that a sensible solution to this problem is to add the direction information to the language tags that do not imply it already (via an -d- language tag extension). This would unify how it is dealt with today across all languages and it would work across all syntaxes.

The first harm in adding another way to check direction in HTML is that it requires all HTML implementations to implement it, which is generally quite difficult to arrange.

This is true, but it seems an even more onerous task to update all implementations of all syntax processors except HTML. My understanding is that various efforts have been made to shoehorn HTML into other syntaxes because it's the only syntax with the necessary direction support. It seems to me that we should solve the problem then not by having a change to every other syntax and processor, but by extending the language tags that they already process.

The second harm is that we have to specify what to do in each possible combination of methods, including working out what should happen if there is conflicting information.

This is a valid concern, however, a language tag extension seems like the easier path than having a separate debate for how to update every syntax followed by updating all processors for all syntaxes (sans HTML). Instead, we could focus on a clear precedence order for dealing with this in HTML. That would be the only thing we need to concern ourselves with as opposed to having to sort out what to do with everything else in the world. That seems to be the more intractable problem to me -- and the lack of a solution in the other direction despite many passing years seems to be evidence of this as well.

So, for example, for HTML, we could state that if direction is specified in one of the three current ways that HTML handles it today -- those ways take precedence over any conflicting information (according to whatever precedence order for those three there is today). Only if those mechanisms are not used, would the language tag -d- extension be used. This seems, at least naively, to be a good start for upgrading browsers to support it. I imagine browsers already guess at the direction when no information is given today -- and this behavior would not change at all when encountering a language tag with a -d- extension until they are upgraded to understand it.

Even if browsers never support the -d- extension, we'd be in a very similar situation to today: HTML does one thing, other syntaxes do something else. The only difference would be that the "something else" would become "express and preserve language direction information" instead of "do not express or preserve language direction information".

All that to say that I'm still in the camp that a language tag extension is the simplest fix here. It is not without drawbacks, but the stated challenges seem far more surmountable than the alternatives so far.

iherman commented 5 years ago

We are facing an RDF problem. I believe we have to solve it within the realm of RDF, and even that is not obvious in terms of deployment.

By touching BCP47 we would change a standard that is used, referred to, and deployed by a way larger palette of technologies including, but not limiting to, HTML. We would have to convince a huge community, or more exactly a huge number of different communities, that this is a way to solve a problem which they may not even have (the last remark also includes HTML, which has solved the problem of directions for a while now). I do not think we should go there.

There was a remark in an earlier mail (not on this issue) to extend BCP47 for RDF usage only. I think forking a standard this way is also a bad idea, and would create bad precedence in the way standards are treated....

pchampin commented 5 years ago

We are facing an RDF problem. I believe we have to solve it within the realm of RDF

This problem arose in the RDF community, but is it intrinsically an RDF problem? Could it not arise in other contexts?

That being said, I admit that I didn't consider the complex interactions that the -d- extension could have on HTML (or other technologies having their own way of encoding direction). But may be this could be mitigated by carefully writing the RFC, making it clear that the -d- extension is merely a hint (no MUST involved). In other words, any technology specifying its own way to convey direction (e.g. HTML) MAY supersede this hint.

pchampin commented 5 years ago

I realize that the proposal in my comment above is a "feature, not a bug" kind of solutions. If we make the -d- extension available, people will immediately want to write something like that

<p lang="he-d-rtl">פעילות הבינאום, W3C</p>

and expect the hint to have an efffect...

sigh

iherman commented 5 years ago

This problem arose in the RDF community, but is it intrinsically an RDF problem? Could it not arise in other contexts?

The problem, ie, to indicate the base direction, obviously does arise elsewhere. But, e.g., in a pure JSON context it can simply be solved by adding an extra tag. See https://www.w3.org/TR/string-meta that touches upon that. The problems come when we are bound by the JSON-LD rules that are bound by the RDF rules...

r12a commented 5 years ago

And, btw, one very important reason that using -d in HTML would be bad is that it doesn't directionally isolate the tag's contents from the text outside the tag (which dir="rtl" does, and which you can do with unicode characters and CSS). Such isolation is an important aspect of working with bidi text, but one of the ways in which language tags differ in use from direction.

iherman commented 5 years ago

@r12a, I am not sure I understand what you say. Can you give an example?

r12a commented 5 years ago
The country names include <span lang="ar-dir-rlt">مصر</span>, <span lang="ar-dir-rlt">البحرين</span>, <span lang="ar-dir-rlt">ישראל</span> and others.

gives:

The country names include مصر, البحرين, ישראל and others.

whereas

The country names include <span lang="ar"  dir="rlt">مصر</span>, <span lang="ar"  dir="rlt">البحرين</span>, <span lang="he"  dir="rlt">ישראל</span> and others.

gives the correct:

The country names include مصر, البحرين, ישראל and others.

The difference in ordering comes about because the dir attribute isolates the contents of the element from the text outside the element wrt the bidi algorithm. Similar common examples include situations where a number appears after a name. Without isolation the number will appear to the right of a latin-script name in LTR text, but to the left of an rtl-script name. This is because of 'spillover effects' associated with the bidi algorithm (which are useful in some circumstances, but not those just mentioned).

dlongley commented 5 years ago

@r12a,

To my knowledge, @dir in JSON-LD would not solve the problem with mixed directions within a single statement. It similarly would not be solved by the proposed changes to RDF. This is why, to my mind, the proposed changes would be just as effective at addressing the language direction problem as a -d- language tag extension, but would be more disruptive.

iherman commented 5 years ago

@dlongley

To my knowledge, @dir in JSON-LD would not solve the problem with mixed directions within a single statement. It similarly would not be solved by the proposed changes to RDF.

That is correct (and probably worth emphasizing in the final document). It solves the probably most frequent case only. For a really complex case (like the above) the usage of HTML datatype is probably the only viable solution, but that is a real sledgehammer (requires at least a specialized HTML parser).

This is why, to my mind, the proposed changes would be just as effective at addressing the language direction problem as a -d- language tag extension, but would be more disruptive.

This is correct, except that using -d- would mean touching an existing standard whose deployment is significantly larger (probably by an order of magnitude) than RDF & Co.

r12a commented 5 years ago

@dlongley yes. My comment just above was part of a subthread talking about problems that would arise if people started trying to take the existence of a BCP47 -d extension tag and try applying it to HTML lang attributes (because changing BCP47 would not only affect JSON-LD). The point being that in the more general case, adding directional information to the language tag doesn't achieve what's needed for directional control.

r12a commented 5 years ago

And just to hopefully make things even clearer, the handling of directional and language changes within a string is the big white elephant in the room in all recent discussions. What we've been focusing on so far is only related to establishing the overall directional context for a given string.

To change the direction inside a complex string such as "The country names include مصر, البحرين, ישראל and others." one would have to use Unicode formatting characters in plain text strings, or markup in HTML format strings.

To set language appropriately for such a string (ie. arabic here, hebrew there) cannot actually be done at all, afaik. (The use of Unicode tag codepoints is problematic and is strongly deprecated by the Unicode Consortium.)

i here hand you back to our usual programming... (such string-internal considerations are best handled in a separate issue).

aphillips commented 5 years ago

@r12a and I just were discussing this adjacent to the I18N WG teleconference.

I agree that this is a Bad Idea because of the impact it would have outside of the RDF space. The right way to solve this for JSON-LD is to add a context variable, just as we've done for language.

The RDF (SPARQL/Turtle/etc.) problem is how to serialize the direction in a non-breaking/backwards compatible way onto the existing syntax, e.g. "My string goes here"@en-US.

We've previously speculated about adding direction by adding a new delimiter, e.g. "My string goes here"@en-US^ltr, but that would be a breaking change. Using -d- is effectively a hack to make the delimiter be something permitted in language tags, e.g. "My string goes here"@en-US-d-ltr, where the -d-ltr part is not part of the language tag, even thought existing processors treat it as if it were. But the leakage would be confusing. Just reading this thread illustrates the range of assumptions that people might make about the relationship of language metadata to base direction: there is a tenuous linkage. It's not wholly unrelated. But we shouldn't emphasize it. The spread of misinformation about -d- would be difficult to contain, particularly if we made it an actual extension (and not just "RDF's hack")

A better workaround might be to use the private use syntax. "My string goes here"@en-US-x-dir-ltr would be opaque to older processors, valid in the BCP47 syntax, and not introduce a language tag extension that is "unrecommended for general interchange". (Also note that it is safe to postpend an -x- sequence to any language tag, while it is not safe to postpend another singleton without checking the contents of the tag). There would be some infelicities related to those tags leaking into the wild, but perhaps less danger of other specifications abusing it in turn: it is private use after all.

iherman commented 5 years ago

On en-US-x-dir-ltr: well... I would hardly call this a 'private' use, because it could be spread in millions of dataset around the globe (e.g., in schema.org data embedded in HTML pages). I.e., it would be used for 'general interchange'.

It reminds me of the usage of data-* attributes in HTML which is a good example for "private" use and whose usage would not be appropriate for something like that.

I would prefer to explore the two other solutions.

r12a commented 5 years ago

I would prefer to explore the two other solutions.

Which other two solutions are you referring to ?

r12a commented 5 years ago

'Private use' means using private agreements between a particular technology and its consumers, rather than used by few people. There's a description here. Note also that we suggest this as a better alternative, using appropriate syntax BCP47, rather than the use of -d extensions, in case there is a strong desire in this community to go down the path of using langauge tags, however Addison and i would prefer to avoid this approach based on language tags altogether, if possible.

pchampin commented 5 years ago

@aphillips wrote

about the relationship of language metadata to base direction: there is a tenuous linkage. It's not wholly unrelated. But we shouldn't emphasize it.

I humbly recognized that I am no i18n specialist. Thanks for those clarifications. I guess I come round to your (more informed) opinion.

About using private extensions (-x-), I had thought about it, but I share @iherman's concerns. Endorsing by a W3C document (even a non-normative one) hardly qualifies as "private agreement" in my mind.

dlongley commented 5 years ago

@pchampin,

About using private extensions (-x-), I had thought about it, but I share @iherman's concerns. Endorsing by a W3C document (even a non-normative one) hardly qualifies as "private agreement" in my mind.

I agree -- and still think that if -x- would be acceptable, then -d- really is the way to go. This seems to me to be a problem that has persisted far too long to be (forever?) anticipating future fixes that may use language tags as a reason to not define -d- and solve it now.

iherman commented 5 years ago

Which other two solutions are you referring to ?

Either amend the current langString of RDF or define a new datatype.

iherman commented 5 years ago

Thank you @r12a for the pointer to description here. If I understand it well, the private extension (i.e., -x-dir-ltr and -x-dir-rtl`) can be defined in a way that says: “this subtag is used exclusively for RDF Literals in the RDF model and for the serialization syntaxes of RDF.”

If this is indeed the case, then I must admit it takes away most of my bad feelings about using bcp47: because it is restricted to RDF & Co, HTML processors (for example) are not supposed to use it, nor any other, non-RDF technology. Ie, the “spilling over” of a feature to unknown areas is contained, in contrast to the -d-ltr approach. Taking into account the simplicity of deployment over the JSON-LD/Turtle/RDF worlds this makes this approach attractive.

(I am not sure yet what the proper procedural approach is to ‘register’ such a usage, though.)

@r12a @aphillips I understand you have reservations with this approach. I think in would be helpful if these could be spelled out consisely in the document itself to make improve that section to help the community to make the final choice.

(Note that we owe ourselves to consider the other two approaches seriously, too, to make a proper pro/con cases for them, too.)

msporny commented 5 years ago

I'm the unfortunate editor that needs to convert this very long thread into concrete specification text for the Verifiable Credentials data model. It sounds to me like the group is settling on some sort of language tag extension (i.e., -x-dir-ltr and -x-dir-rtl with or without the -x)... either that, or having to rewrite multiple RDF specs and RDF syntaxes (which sounds insane... that'll take years).

I'm going to write something up with the language tag extension for the VC spec because it makes progress while not locking us into that mechanism if @dir finds its way into RDF 1.2 some time in the year 2025. :P

If the folks active in this conversation can't agree soon, the VCWG won't be able to pull the PR in, and this will remain an area of non-interoperability... and that would be a real loss to the i18n community. Can we have a special call between @r12a, @aphillips, a subset of the VCWG, and a subset of the JSON-LD WG? I feel like we could hash this out in an hour call?

msporny commented 5 years ago

PR for Verifiable Credentials spec is in: https://github.com/w3c/vc-data-model/pull/641 please review.

aphillips commented 5 years ago

@msporny I'd be very pleased to have a call with VCWG and JSON-LD on this topic. Do you want to make the invite or should I?

I will look at the PR above. I don't think VCWG or others should "freestyle" this. I think what we discussed with wpub a few weeks ago is unsatisfying-but-temporarily-the-best-we-can-do. That was effectively: "use a first strong heuristic in the absence of metadata; use metadata where you can; allow implementations (MAY, not SHOULD) to infer base direction from the language tag".

I do believe that we need standardization, even if it takes years to become a de jure standard; we should free implementers to do the right thing; and we should clearly document the path forward quickly and clearly.

msporny commented 5 years ago

@aphillips wrote:

Do you want to make the invite or should I?

Please send the invite... I can do Tue/Wed 9am ET or after 12pm ET.

msporny commented 5 years ago

I will look at the PR above. I don't think VCWG or others should "freestyle" this.

Agreed. Not trying to "freestyle", trying to say more than "We can't suggest anything right now that is going to get traction, come back in a few years." We have a unified proposal that could work in JSON, JSON-LD, RDF, and HTML... I'm baffled why there is so much hemming and hawing over it. I clearly don't understand something, but given that I've spent 15 hours now reading i18n specs, and looking at implementations, and I can find no good reason not to do '-x-DIR' or '-x-d-DIR' or '-d-DIR', I'm pushing hard for someone to show me why that is a terrible idea.

I think what we discussed with wpub a few weeks ago is unsatisfying-but-temporarily-the-best-we-can-do. That was effectively: "use a first strong heuristic in the absence of metadata; use metadata where you can; allow implementations (MAY, not SHOULD) to infer base direction from the language tag".

This is insufficient for our use case, we want to provide stronger guidance (given our limitations... we're in CR).

I do believe that we need standardization, even if it takes years to become a de jure standard; we should free implementers to do the right thing; and we should clearly document the path forward quickly and clearly.

Let's jump on a call and try to sort it out. Worst case is that we determine that there is nothing we can do at present.

msporny commented 5 years ago

Also, and I know I'm probably going to drive a few of you crazy with this -- most notably @r12a and @aphillips -- :)... but I put together a language direction extension for language tags and published it:

https://tools.ietf.org/html/draft-msporny-d-langtag-ext-00

This whole "-x-" vs. "-d-" preference makes no sense to me. We should just deprecate "dir" across all syntaxes and use the "-d-" extension.

msporny commented 5 years ago

Let's jump on a call and try to sort it out. Worst case is that we determine that there is nothing we can do at present.

We're meeting 9am ET on Wednesday here to discuss:

https://chime.aws/5719267133

aphillips commented 5 years ago

Invite sent. Please forward as needed.

:: sigh:: Why do we need to rush into an ID? A letter extension should be a last resort.

gkellogg commented 5 years ago

@msporny If we come up with a solution using -x-dir-rtl, say, I think that this is restricted to the RDF model as represented in N-Quads; given the support for directionality, then specs such as JSON-LD, should use other metadata for specifying the direction. I would advocate the us of an @dir (or equivalent) property in a value object to handle this. When serializing to RDF, or deserializing the other way, it would set/extract any -x-dir-xxx element form the language-tag to set @dir.

If we were to retrofit RDFa, I would expect to do the same thing using the dir attribute from HTML. It's only for serializations that don't have such a mechanism that the language-tag hack is necessary.

In the fullness of time, the RDF model should be updated to provide for a distinct directional attribute on language-tagged literals.

Alternatively, if we were to describe separate rdf:langStringLTR and rdf:langStringRTL datatypes, this could also be handled the same way by serializations, as rdf:langString doesn't appear explicitly in any RDF serialization as it is.

pchampin commented 5 years ago

@msporny

We're meeting 9am on Wednesday here to discuss: https://chime.aws/5719267133

Speaking of I18n, would you mind specifying a timezone? :-)

pchampin commented 5 years ago

@gkellogg let me rephrase, to be sure I get it right: you suggest that -x-dir-xxx would be a temporary fallback plan to encode direction in RDF, until a proper "fix" for RDF is standardized. I think I like this idea: it allows JSON-LD and VC to go forward, while keeping doors open for a cleaner solution.

The only drawback that I can foresee is that, once said cleaner solution is standardized, we may have a mix of "hello"@en-x-dir-ltr and "hello"@en^ltr in deployed data, raising some slight interoperability issues. But that seems manageable.

iherman commented 5 years ago

Sigh. Much has happened while I was out since Friday... Let me react on various remarks here, in no particular order.

First of all, yes, a telco would help. Just to be on the safe side (@aphillips, your invite did not specify this): 9am means 9am ET, right? At least that is what I deduce from https://github.com/w3c/rdf-dir-literal/issues/3#issuecomment-495949448. Which works for me, b.t.w.; the downside is that this would probably not work for @swickr although his presence would be helpful.


@msporny,

If the folks active in this conversation can't agree soon, the VCWG won't be able to pull the PR in

This is over-dramatizing. If there is no proper solution for this issue, then I do not believe the PR would be stopped. We do have specifications that do not solve this issue, due to the fundamental problem we are trying to sort out here (e.g., activity streams). We (meaning the I18N review) accepted a status quo in the Web Publication work (as noted by @aphillips in https://github.com/w3c/rdf-dir-literal/issues/3#issuecomment-495945572). I do not think the use case of VCWG are more stringent. I do not believe this issue would stop the PR transition. (But, of course, this is only my personal opinion, it is up to the Director to decide on this.)

...or having to rewrite multiple RDF specs and RDF syntaxes (which sounds insane... that'll take years).

Again, this is an over-reaction. Yes, alternative (1) in the document would probably take years. The very reason I've put in alternative (2) is because defining a new datatype is, comparatively, a piece of cake, not much more complicated than extending the language tag.


All that being said, as I said in https://github.com/w3c/rdf-dir-literal/issues/3#issuecomment-495338451, I am warming up to the -x-d-... option. I am not “hemming and hawing” over the -d-... option: I (and others) simply do not find it acceptable to define an extension in haste to a standard that is more widely used than any of the standards we refer to, or we are working on. In this respect, I am not in favor to your IETF draft, @msporny. Whatever we do, it should restricted to RDF & Co, and we SHOULD NOT expect HTML processors, browsers, etc., to understand and interpret the -d-... tag described in that draft. “We should just deprecate "dir" across all syntaxes and use the "-d-" extension.”: now that sounds insane to me. Sorry.


We should be mindful of the fact that (with all solutions, not only with this one) we have to specify what we expect browser extensions, ebook readers, etc, to do with the data. E.g., the WPUB (and maybe the VC) recommendation should specify that if the term value is displayed using HTML then the WPUB/VC specific processor should remove that -x-d-... tag from the language part, and convert the result into the relevant <span dir='...' lang='...'>...</span> structure for display. Not a big deal, but requires extra processing for specialized processors using, for example, a Web Publication Manifest which must be documented.


@gkellogg

If we come up with a solution using -x-dir-rtl, say, I think that this is restricted to the RDF model as represented in N-Quads; given the support for directionality, then specs such as JSON-LD, should use other metadata for specifying the direction.

I think "should" is the operative term here. What I mean is that it should be perfectly valid to use, say,

{
    "@value" : "פעילות הבינאום, W3C",
    "@language" : "he-x-d-rtl"
}

But JSON-LD should also define the @dir keyword. If that extra tag is defined, the genie is out of the bottle, and I am sure there will be people using it in JSON-LD, too. Let alone the fact that if we do allow this in JSON-LD, such language tags can be used in language maps, too, which is an extra bonus.


To be more forward looking: by all means we should discuss the -x-d-... option in more details. I would like to understand from @r12a and @aphillips what the possible problems are and, if it works after all, what the operational route should be. If the -x-d... does not require an IETF specification (which I understand is the case), is it enough if, say, we publish a document in the W3C Internationalization Activity? Should we create a dedicated CG (that is a matter of a few hours) and create a CG community report? Clearly, this issue is not a private problem of the VCWG or the JSON-LD WG, i.e., I do not believe it is appropriate to define this behavior through any of these groups.

I also believe we should share this document, and these discussions, with the semantic web mailing list asap. After all, that is the community that is really affected by any action here.

msporny commented 5 years ago

If there is no proper solution for this issue, then I do not believe the PR would be stopped.

To be clear, I was talking about the Pull Request :) -- none of this is going to stop us from going to Proposed Recommendation, especially if the i18n and JSON-LD WG cannot provide clear guidance for these sets of use cases.

msporny commented 5 years ago

@iherman wrote:

Whatever we do, it should be restricted to RDF & Co, and we SHOULD NOT expect HTML processors, browsers, etc., to understand and interpret the -d-... tag described in that draft.

I think a number of you are failing to grasp that this is not an RDF-only / JSON-LD only / Linked Data-only issue. There are two extra requirements above and beyond what we typically discuss when we talk about base direction.

  1. We are specifically defining how to do base direction across JSON-LD and JSON such that the markup a developer would use is the same across both syntaxes (and whether or not use you use a JSON-LD processor).
  2. We want markup that is likely to be adopted vs. something that developers see as overly complex.
  3. We care deeply about how the solution is represented when RDF Dataset Canonicalization runs because all Verifiable Credentials are digitally signed and thus need to be canonicalized. We need to do this in a way that doesn't require the entire RDF stack to be upgraded and ideally a solution that works w/ existing deployments w/o any changes necessary.

The -x-dir-* or the -d-* approach is the only approach I've seen that meets these three additional requirements.

msporny commented 5 years ago

@iherman wrote:

“We should just deprecate "dir" across all syntaxes and use the "-d-" extension.”: now that sounds insane to me. Sorry.

It does sound insane, doesn't it? :) ... but it's not. More below...

What's insane are the number of choices that developers have across syntaxes and how they're inconsistent. HTML uses dir=rtl, CSS uses direction: rtl;, JSON-LD doesn't have anything, RDF doesn't have anything, RDFa failed to use dir, JSON doesn't have anything, Android uses android:textDirection="rtl", CBOR doesn't have anything, and so on!

This is a mess, and there is one mechanism that would enable us to fix the mess and stay backwards compatible, and that's the -x-dir or -d- BCP47 extension. If we go that route, we fix the problem for HTML, CSS, JSON-LD, RDF, RDFa, JSON, Android XML, CBOR, and we do so in a backwards compatible manner by asserting the order in which you should infer base direction (something like this):

  1. If you detect BiDi, use it.
  2. If no BiDi, and your language has dir and it's specified, use that.
  3. If no BiDi, and no dir, then see if -x-dir or -d- is available via lang/@language, use that.

That algorithm would be backwards compatible and work for every syntax specified above... and it would work today.

I assert that that's what we should've done 15 years ago... or rather, we never should have introduced dir.

So, if we do that, what breaks?

dlongley commented 5 years ago

Given that this could be solved with a language tag that would work cross-syntax, I'm -1 to introducing yet another syntax-specific "direction" to JSON-LD or anything else. That's just going to accrue more technical debt.

Either every syntax has to adopt their own rules, which is clearly a major part of the problem we're dealing with today and it means you have to do point-to-point conversion to transform between them and you have to define some onerous precedence rules for every syntax (like we're so currently concerned about with just HTML)... or we solve it once and for all with a language tag extension. Furthermore, a language tag extension can be done in a backwards compatible way, allowing asynchronous upgrading to support it.

iherman commented 5 years ago

If there is no proper solution for this issue, then I do not believe the PR would be stopped.

To be clear, I was talking about the Pull Request :) -- none of this is going to stop us from going to Proposed Recommendation, especially if the i18n and JSON-LD WG cannot provide clear guidance for these sets of use cases.

Wow. This is the kind of misunderstandings that may start wars :-)

But then I think we can agree on one point: solving this issue properly is not under and extreme time pressure, i.e., not under the time pressure of getting VC out of the door. We should do this properly and not under time pressure.

(Do not get me wrong: I want to solve this as soon as possible, too!)

iherman commented 5 years ago

I think a number of you are failing to grasp that this is not an RDF-only / JSON-LD only / Linked Data-only issue. There are two extra requirements above and beyond what we typically discuss when we talk about base direction.

  1. We are specifically defining how to do base direction across JSON-LD and JSON such that the markup a developer would use is the same across both syntaxes (and whether or not use you use a JSON-LD processor).
  2. We want markup that is likely to be adopted vs. something that developers see as overly complex.
  3. We care deeply about how the solution is represented when RDF Dataset Canonicalization runs because all Verifiable Credentials are digitally signed and thus need to be canonicalized. We need to do this in a way that doesn't require the entire RDF stack to be upgraded and ideally a solution that works w/ existing deployments w/o any changes necessary.

The -x-dir- or the -d- approach is the only approach I've seen that meets these three additional requirements.

I agree that the -x-dir-* approach meets these requirements. I disagree that this is the only one. I believe the extra datatype approach also does it:

  1. At allows us to use the idiom in JSON-LD that can be used in JSON in general. JSON-LD can define the idiom as in the draft, it corresponds to the general Localizable structure as defined in the string-meta document of the I18N.
  2. JSON developers already have the complication of using the extra @language (whether we drop the @ character or not is a detail). I do not believe the extra @direction is what would make it unacceptable.
  3. The mapping of the JSON-LD to the datatype is deterministic. RDF datatypes are already a fact of life that RDF canonicalization must take into account. I.e., there is no change there.

I am not saying that the extra datatype must be the solution to be adopted; I am not fully decided at this point. But we have to weight and document all the pros and cons objectively.

(In fact, even updating the langString core RDF type would I think abide to your three points. The cons against that solution are strong, however, because it would require changes at the core of the RDF implementations; hence I do not think it would fly.)

iherman commented 5 years ago

@iherman wrote:

“We should just deprecate "dir" across all syntaxes and use the "-d-" extension.”: now that sounds insane to me. Sorry.

It does sound insane, doesn't it? :) ... but it's not. More below... ... I assert that that's what we should've done 15 years ago... or rather, we never should have introduced dir.

Maybe we should have, maybe not, I do not know. Clearly, we should have taken care of directions for RDF language literals, but we didn't. These ships have sailed. The reasons I do think it is insane is not technical, is social.

Note that, just like @r12a put it in https://github.com/w3c/rdf-dir-literal/issues/3#issuecomment-494868090, setting the base direction is just the tip of the iceberg, and it does not solve all the BIDI issues. I.e., HTML does need that dir attribute separately of the language tag, it needs extra markup and, from that point of view, there is no need to extend the BCP47 space. Good luck trying to convince the browser manufacturers to change the DOM and the CSS-DOM, the algorithms producing and handling those in return for… well, nothing; from the HTML/DOM/CSS point of view this issue wouldn't be solved because the extra markup is necessary whereas the current solution works.

If -x-d-* solves the issue for a number of environments (that may also include CBOR or pure JSON, that is fine with me, although JSON would have to introduce some new structure for language tags anyway), that is fine. If we define -d-*, without the -x, we essentially start that fight around HTML/DOM/CSS. I am personally not ready to get into this fight; life is too short.

chaals commented 5 years ago

The main problem I see with -x-d-rtl is that it I am convinced it will get copied into data outside the planned use case, and that means it will keep biting people if it only works in some places until either it dies, or it works everywhere. Either case will take years, which means there is a big cost.

If there is a will to change the RDF stack to fix this particular problem with a solution which is agreed in say 6 weeks, that probably could be done in less than a year, and it's a reasonable expectation that VC would be allowed to move forward with the solution that is expected to be generalised.

I'm very sympathetic to the argument that it is important to get something that could be acceptable for broad adoption by people who use "plain" JSON - which is one of the big i18n pain points I keep coming across.

If we achieve that with the -x-d-rtl approach I expect more pain to arise from copy-pasting that into places it wasn't meant to go.

Conversely, if we already require an object to get language information in (and an array for actually multilingual terms), then for the relatively far smaller set of cases that also require direction information I actually believe that the patterns are pretty much equivalent in acceptability.

iherman commented 5 years ago

If there is a will to change the RDF stack to fix this particular problem with a solution which is agreed in say 6 weeks, that probably could be done in less than a year, and it's a reasonable expectation that VC would be allowed to move forward with the solution that is expected to be generalised.

If defining a new datatype does solve the issue (and this something we will have to decide, where 'we' should be a larger community than ours) then we can do that properly, e.g., via a W3C CG report or something like that. That may provide the necessary stability and can be done in less than a year, that is for sure. It has the advantage of offering the least social resistance, so to say. I think we should at least consider this.

(Updating the langString does certainly take more than a year, alas!)

I am convinced it will get copied into data outside the planned use case

yes, indeed, this is a compelling argument, I agree. We are between a rock and a hard place here.

aphillips commented 5 years ago

Call details for Wednesday 1300Z:

Per discussion with Manu on the RDF repo, let’s discuss the path forward for JSON-LD, RDF, and other Linked Data standards. We’ll use the W3C IRC and publish notes.

W3C IRC channel: #i18n Meeting Time: https://www.timeanddate.com/worldclock/meetingtime.html?iso=20190529&p1=283&p2=136&p3=179

United States toll free: +1 855-552-4463 International: https://chime.aws/dialinnumbers/

andjc commented 5 years ago

My gut reaction is the use of -d- or -x-dir- is a non-starter for RDFa or JSON-LD

A lot of linked data from libraries use ALA-LC romanisation. The correct mechanism for tagging transliterated content using BCP47 is to use the -t- extension.

My understanding of the -t-extension is that it prohibits the use of -x- and other extensions when the -t- extension is used.

Although my interpretation could be wrong but that seems to be a restriction specified in the relevant RFC.