w3c / rdf-concepts

https://w3c.github.io/rdf-concepts/
Other
16 stars 2 forks source link

Compare language tags after normalizing to lower case. #55

Closed gkellogg closed 10 months ago

gkellogg commented 1 year ago

Mixed in with #48, which has since been removed from that PR, is text to compare language tags after normalizing to lower case. This is consistent with the suggestion that language tags can be converted to lower case when language-tagged strings are introduced, but was never part of RDF 1.0 nor RDF 1.1. It arguably intrudes on D-entailment where "foo"@en and "foo"@EN could be considered to have the same value but still be separate terms.

The key commit which reverted the wording is c45d9470785a9c681c1ad2cf8cf47906b4d62c75.

afs commented 1 year ago

Reference to RDF Semantics: https://www.w3.org/TR/rdf12-semantics/#D_interpretations

The issue with c45d9470785a9c681c1ad2cf8cf47906b4d62c75 is that it is a required part of term-equals where as earlier it was "MAY" followed by "The value space of language tags is always in lower case."

I think this is the only case where two things would be RDF term-equals without them being character-by-character equals (after escape processing).

Antoine-Zimmermann commented 1 year ago

I think the specification is quite clear (quote from RDF 1.1 Concepts):

Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal, character by character. Thus, two literals can have the same value without being the same RDF term.

The tags en-US and en-us use different characters, therefore are different, regardless of semantics.

Normalising language tags has the same effect as normalising lexical forms: it changes the graph. The fact that the normalised graph means the same does not imply that they are the same.

gkellogg commented 1 year ago

This was discussed at TPAC

discussion

gkellogg: last issue is about BCP47 case issue; do we want to take this after the break?

addison: this one seems easy

gkellogg: the problem is: are two triples differing only by the language tag case two separate triples or a single triple?
… this raises issues for RDF C14N.

<AndyS> pfps - Issue: w3c/rdf-concepts#9 // PR w3c/rdf-concepts#48

gkellogg: no PR on this, only an issue.

addison: BCP47 is clearly made to be case insensitive
… it is perfectly valid to normalize things or to XXX
… I would not recommend people to only use the lowercase form -- many people want to make the tags pretty.

gkellogg: currently, literal term equality is term sensitive
… we could change that to make the comparison of the language tag case-insensitive
… this has consequences when you insert triples in a store

AndyS: what is the approach in XML?
… I believe there is a SPARQL test related to case-sensitivity with language tags.
… Should we push this to the meaning domain or the value domain?

addison: from what you described earlier, this is probably one triple

AndyS: then we need to decide which noramlization to use

<AZ> The fact that the lower case and upper case mean the same does not imply that they are the same tag in the syntax

<ora> Thanks Addison!

ktl: I think we have what we need from i18n, thank you very much.
… we'll continue the discussion between us.

addison: I will share some reference material

My takeaway is that RDF was wrong to interpret "foo"@en-US and"foo"@en-us as different literals. If we updated language to require the internal representation as being in lower case, then serializations would be free to either representative them as originally specified, in lower case, or based on suggested BCP47 formatting without changing their lexical value.

Antoine-Zimmermann commented 1 year ago

@gkellogg Ok, but if this change is made, that would be a backward incompatibility change. If a SPARQL query counts the number of literals there are in the data, then in SPARQL 1.1, with "foo"@en and "foo"@EN, the answer would be 2, and in SPARQL 1.2 the answer it would be 1. Maybe it is not a big deal, but backward compatibility is taken very seriously in W3C standards.

gkellogg commented 1 year ago

I agree that we need to consider this seriously. But, the tacit advice in RDF concepts that implementations may normalize to lower case gives us cover. AFAIK, many implementations follow this option (my own does).

Needs more discussion.

TallTed commented 1 year ago

@gkellogg -- In your https://github.com/w3c/rdf-concepts/issues/55#issuecomment-1721887864, I think you should wrap the "foo"@en-us and "foo"@en-us" (the latter of which should probably be "foo"@en-US, i.e., capital US and no trailing ") in backticks, so the @en-us user is not pinged about this thread, and so your meaning is clearer...

gkellogg commented 11 months ago

After the discussion on This week's call I believe we' agreed to separate this into two issues:

1) Require some form of normalization in the abstract syntax (implementation dependent, but consistent) so that parsing two literals that differ only in the language-tag case would result in just a single triple. Now this is only permitted by implementations, leaving a gray area that this would close, at the cost of being a breaking change for some implementations. 2) To the degree that we suggest how graphs are serialized, provide some guidance in the form of language tags. For N-Quads/Triples canonicalization, if this were lower-case (e.g., "foo"@en-us), it would be consistent with the RDF Dataset Canonicalization Candidate Recommendation (see note in introduction). Alternatively, it could be changed to use the recommended format from BCP47/RFC4646 (e.g., "foo"@en-US), but this would immediately conflict with RDF Canonicalization, even thought it is based on RDF 1.1 and not RDF 1.2.

Proposed changes

TallTed commented 11 months ago

I'll have some text tweaks... but these proposed changes look like the right direction.

afs commented 11 months ago
  1. Require some form of normalization in the abstract syntax (implementation dependent, but consistent) so that parsing two literals that differ only in the language-tag case would result in just a single triple. Now this is only permitted by implementations, leaving a gray area that this would close, at the cost of being a breaking change for some implementations.

I believe we agreed that to within case-sensitivity parsing two literals that differ only in the language-tag case would result in just a single literal. Consistent formatting is a way of doing; there are other ways (e.g. dictionaries).

We have the opportunity to get away from RDF preferring "lower-case" when BCP-47 says something different.

afs commented 11 months ago

BTW the BCP47 terminology is "format" (Although in one place later-on about extensions, it slips in "normalize").

2.1.1. Formatting of Language Tags

afs commented 11 months ago

As for Dataset canonicalization, it only has to add that language tags are lower-cased during canonicalization.

Systems exist which today do not lower-case ("EN-gb" becomes "en-GB") and have unique language tags - they are not wrong.