Closed gkellogg closed 10 months ago
Reference to RDF Semantics: https://www.w3.org/TR/rdf12-semantics/#D_interpretations
The issue with c45d9470785a9c681c1ad2cf8cf47906b4d62c75 is that it is a required part of term-equals where as earlier it was "MAY" followed by "The value space of language tags is always in lower case."
I think this is the only case where two things would be RDF term-equals without them being character-by-character equals (after escape processing).
I think the specification is quite clear (quote from RDF 1.1 Concepts):
Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal, character by character. Thus, two literals can have the same value without being the same RDF term.
The tags en-US
and en-us
use different characters, therefore are different, regardless of semantics.
Normalising language tags has the same effect as normalising lexical forms: it changes the graph. The fact that the normalised graph means the same does not imply that they are the same.
This was discussed at TPAC
gkellogg: last issue is about BCP47 case issue; do we want to take this after the break?
addison: this one seems easy
gkellogg: the problem is: are two triples differing only by the language tag case two separate triples or a single triple?
… this raises issues for RDF C14N.
<AndyS> pfps - Issue: w3c/
gkellogg: no PR on this, only an issue.
addison: BCP47 is clearly made to be case insensitive
… it is perfectly valid to normalize things or to XXX
… I would not recommend people to only use the lowercase form -- many people want to make the tags pretty.
gkellogg: currently, literal term equality is term sensitive
… we could change that to make the comparison of the language tag case-insensitive
… this has consequences when you insert triples in a store
AndyS: what is the approach in XML?
… I believe there is a SPARQL test related to case-sensitivity with language tags.
… Should we push this to the meaning domain or the value domain?
addison: from what you described earlier, this is probably one triple
AndyS: then we need to decide which noramlization to use
<AZ> The fact that the lower case and upper case mean the same does not imply that they are the same tag in the syntax
<ora> Thanks Addison!
ktl: I think we have what we need from i18n, thank you very much.
… we'll continue the discussion between us.
addison: I will share some reference material
My takeaway is that RDF was wrong to interpret "foo"@en-US
and"foo"@en-us
as different literals. If we updated language to require the internal representation as being in lower case, then serializations would be free to either representative them as originally specified, in lower case, or based on suggested BCP47 formatting without changing their lexical value.
@gkellogg Ok, but if this change is made, that would be a backward incompatibility change. If a SPARQL query counts the number of literals there are in the data, then in SPARQL 1.1, with "foo"@en
and "foo"@EN
, the answer would be 2, and in SPARQL 1.2 the answer it would be 1. Maybe it is not a big deal, but backward compatibility is taken very seriously in W3C standards.
I agree that we need to consider this seriously. But, the tacit advice in RDF concepts that implementations may normalize to lower case gives us cover. AFAIK, many implementations follow this option (my own does).
Needs more discussion.
@gkellogg -- In your https://github.com/w3c/rdf-concepts/issues/55#issuecomment-1721887864, I think you should wrap the "foo"@en-us
and "foo"@en-us"
(the latter of which should probably be "foo"@en-US
, i.e., capital US
and no trailing "
) in backticks, so the @en-us
user is not pinged about this thread, and so your meaning is clearer...
After the discussion on This week's call I believe we' agreed to separate this into two issues:
1) Require some form of normalization in the abstract syntax (implementation dependent, but consistent) so that parsing two literals that differ only in the language-tag case would result in just a single triple. Now this is only permitted by implementations, leaving a gray area that this would close, at the cost of being a breaking change for some implementations.
2) To the degree that we suggest how graphs are serialized, provide some guidance in the form of language tags. For N-Quads/Triples canonicalization, if this were lower-case (e.g., "foo"@en-us
), it would be consistent with the RDF Dataset Canonicalization Candidate Recommendation (see note in introduction). Alternatively, it could be changed to use the recommended format from BCP47/RFC4646 (e.g., "foo"@en-US
), but this would immediately conflict with RDF Canonicalization, even thought it is based on RDF 1.1 and not RDF 1.2.
if and only if the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString, a non-empty language tag as defined by [BCP47]. The language tag MUST be well-formed according to section 2.2.9 of [BCP47] and MUST be case normalized consistently (e.g., to lower case).
A literal is a language-tagged string if the third element is present and the fourth element is not present. Lexical representations of language tags MUST be case normalized and MAY be converted to lower case. The value of language tags is always treated as being in lower case.
Language tags were previously allowed to be normalized to lower case, which made it ambiguous if two literals with language tags different only by case represented the same literal, or distinct literals. RDF 1.2 requires that language tags be case normalized, but does not specify excactly how this is to be performed. Implementations can either follow the advice to normalize to lower case, use the recommended BCP47 format, or something else, as long it is performed consistently.
I'll have some text tweaks... but these proposed changes look like the right direction.
- Require some form of normalization in the abstract syntax (implementation dependent, but consistent) so that parsing two literals that differ only in the language-tag case would result in just a single triple. Now this is only permitted by implementations, leaving a gray area that this would close, at the cost of being a breaking change for some implementations.
I believe we agreed that to within case-sensitivity parsing two literals that differ only in the language-tag case would result in just a single literal. Consistent formatting is a way of doing; there are other ways (e.g. dictionaries).
We have the opportunity to get away from RDF preferring "lower-case" when BCP-47 says something different.
BTW the BCP47 terminology is "format" (Although in one place later-on about extensions, it slips in "normalize").
As for Dataset canonicalization, it only has to add that language tags are lower-cased during canonicalization.
Systems exist which today do not lower-case ("EN-gb" becomes "en-GB") and have unique language tags - they are not wrong.
Mixed in with #48, which has since been removed from that PR, is text to compare language tags after normalizing to lower case. This is consistent with the suggestion that language tags can be converted to lower case when language-tagged strings are introduced, but was never part of RDF 1.0 nor RDF 1.1. It arguably intrudes on D-entailment where
"foo"@en
and"foo"@EN
could be considered to have the same value but still be separate terms.The key commit which reverted the wording is c45d9470785a9c681c1ad2cf8cf47906b4d62c75.