w3c / rdf-dir-literal

Proposal to add base direction to RDF Literals
Other
8 stars 6 forks source link

Literal equality, and examples in appendix #1

Open chaals opened 5 years ago

chaals commented 5 years ago

As i read the proposal for extending the existing datatype,

"1"^^xs:integer != "01"^^xs:integer as noted already but "פעילות הבינאום, W3C"@he^auto == "פעילות הבינאום, W3C"@he and "פעילות הבינאום, W3C"@he^auto != "פעילות הבינאום, W3C"@he^rtl and "hello world"@en^rtl != "hello world"@en

Is that correct? The last two seem unintuitive to me, and thus a likely source of mistakes. The alternative would be to define when auto means something. That also has issues (although IMHO fewer than trying to redefine language tags with the -d-dir suffix)...

chaals commented 5 years ago

Thinking about what is intuitive, "מאַזל-טאָוו W3C, what do we do now?"@en^ltr leapt to mind.

One could argue it is only partially tagged, but there are places where it would seem a reasonable english expression. ("Mazl-tov W3C, what do we do now?" seem likely to be understood by some large number of people as idiomatic english.)

pchampin commented 5 years ago

"hello world"@en^rtl != "hello world"@en (...) seems unintuitive to me

Well, I can see your point, but it could be argued that we already have this kind of unintuitive cases in the current RDF. Consider that

"hello world"@en-US != "hello world"@en

pchampin commented 5 years ago

Regardless of my response above, your comment made me realize something.

First, let me first clarify a point:

"1"^^xs:integer != "01"^^xs:integer as noted already

That is only partially true. In the lexical space (abstract syntax), those two literals are indeed distinct. In the value space (semantics), they are equal.

Now, in the current proposal, we would have

"hello world"@en^auto != "hello world"@en^ltr

I can live with that at the abstract syntax level, but at the semantics level, that seems wrong.

I suggest that the auto direction should not be part of the value space Instead, the lexical-to-value mapping (either in the updated langString or in the new LocalizableString) should "resolve" auto to ltr or rtl, according to the method described in the HTML spec

ericprud commented 5 years ago

One way to look at is is that there is a stack of semantic sophistication (aka developer burden).

RDF works only with lexical values and doesn't require any particular datatype schema, though it sort of plays favorites by "recommending" most of XSD for use in stuff higher in the stack. For example, the RDF storage engine, and typically, the parser supplying it, don't have to know anything about datatypes. Many RDF specs basically import XSD, which maps lexical forms to a value space. ShEx (and maybe the elusive fully-compliant OWL engine?) understand numeric XSD datatypes well enough to test for {Min,Max}{In,Ex}clusive for a subset of XSD datatypes. SPARQL (via XPath) and SHACL use type promotion on numeric XSD datatypes to compare homogeneous (1 < 2) and heterogeneous (1.0 < 2) values as well as booleans and dateTimes. Even in SPARQL, "1" and "01" are distinct until you explicitly invoke comparison.

If XSD had stated that the only valid form was the canonical form, RDF would probably have followed this lead, which would have made "1"^^xsd:integer the only form of 1 you would find in RDF.

Likewise RDF knows nothing about the semantics of BCP-47, asserting only that:

The language tag must be well-formed according to section 2.2.9 of [BCP47].

The only incursion it makes into BCP47 is to know that language tags are compared case-insensitively:

The value space of language tags is always in lower case.

chaals commented 5 years ago

At an intuitive level, it makes sense that processors look further into BCP-47. But at a standards level it's less clear.

There are some reasonable arguments to insist en != en-us != en-au semantically, as a default (since in reality they only sort of match). So should we table the question?

I would guess that ultimately we end up in a world where processors may offer enhanced but approximate matching, allowing for the selection of regional variants to match as a user preference. (/en.*/ is not a complicated thing to do in many cases).

Which leaves me thinking that we should probably expand the examples from the numeric one to explicitly cover language and direction information.

I suggest opening a separate issue for @pchampin's suggestion that dir = auto be resolved in mapping from a literal to a value...

iherman commented 5 years ago

I suggest that the auto direction should not be part of the value space Instead, the lexical-to-value mapping (either in the updated langString or in the new LocalizableString) should "resolve" auto to ltr or rtl, according to the method described in the HTML spec

I think that sounds like a great solution.

iherman commented 5 years ago

There are some reasonable arguments to insist en != en-us != en-au semantically, as a default (since in reality they only sort of match). So should we table the question?

'table' as in en-us or 'table' as in en-gb? Speaking about linguistic differences...

r12a commented 5 years ago

bikeshedding alert! fwiw (wishing i could attach comments to other comments here)

Thinking about what is intuitive, "מאַזל-טאָוו W3C, what do we do now?"@en^ltr leapt to mind.

The phrase פעילות הבינאום, W3C means "Internationalisation Activity, W3C". It's taken from my business card. Notice how 'W3C' appears on the wrong side of the phrase if you get the base direction wrong. (It should appear to the left.)

The phrase you suggest is suboptimal for me because it doesn't show up wrongly in a LTR context.

pchampin commented 5 years ago

@chaals FTR, I was not suggesting that en-US and en should be equal! I was just pointing out that this subtle inequality could also be considered unintuitive to the untrained eye.