Closed afs closed 8 months ago
From https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
If the literal is a language-tagged string, then the literal value is a pair consisting of its lexical form and its language tag, in that order.
So no ill-typed language-tagged strings.
From The Unicode Standard, Version 15.0 (PDF download)
A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Whenever such strings are specified to be in a particular Unicode encoding form—even one with the same code unit size—the string must not violate the requirements of that encoding form.
The net result is that a Unicode string encodes a sequence of Unicode code points, each of which can be any integer between 0 and 0x10FFFF.
It seems that there should be some cleanup in RDF Concepts here. Instead of Unicode string, I suggest finite sequence of Unicode code points to eliminate a potential source of confusion. I also suggest forbidding the noncharacter code points (surrogates and code points ending in FFFE and FFFF), which are currently allowed.
From https://www.w3.org/TR/xmlschema-2/#string
The ·value space· of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the Char production from [XML 1.0 (Second Edition)].
Following the link:
[2] | Char | ::= | #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] | / any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. /
(The description isn't correct - these are Unicode code points, not characters. Also all ASCII control codes are valid Unicode code points.)
But there still would be strings that can be in rdf:langString but not in xsd:string.
RDF Semantics links to XML 1.1
XML 1.0:
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
XML 1.1
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
(still no U+0000)
Update: Unicode 15.0 3.9 Unicode Encoding Forms defines the notion of a Unicode scalar value, which excludes surrogate code points. UTF-8, UTF-16, and UTF-32 strings that would encode any of these surrogate code points are ill-formed. As RDF 1.1 Concepts doesn't mention UTF much this feature of Unicode might not be worth mentioning, but it does motivate the exclusion of surrogate code points from rdf:langString.
Indeed, XML 1.1 is more permissive. I don't know what links I followed to get to the 1.0 definition. Another strange thing about xsd:string is that it excludes FFFE and FFFF but not the other non-characters.
The xs:string definition goes to XML 1.0. XML schema 1.1 part 2 is 2012.
https://www.w3.org/TR/xpath-datamodel-3/#xml-and-xsd-versions has some related text.
It seems that there should be some cleanup in RDF Concepts here. Instead of Unicode string, I suggest finite sequence of Unicode code points to eliminate a potential source of confusion. I also suggest forbidding the noncharacter code points (surrogates and code points ending in FFFE and FFFF), which are currently allowed.
We have w3c/rdf-concepts#51 which relates to this.
From https://www.w3.org/TR/xmlschema-2/#string
The ·value space· of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the Char production from [XML 1.0 (Second Edition)].
RDF Concepts references https://www.w3.org/TR/xmlschema11-2/#string, which is based on XML 1.1, as @afs indicated semantics is.
Update: Unicode 15.0 3.9 Unicode Encoding Forms defines the notion of a Unicode scalar value, which excludes surrogate code points. UTF-8, UTF-16, and UTF-32 string that would encode any of these surrogate code points are ill-formed. As RDF 1.1 Concepts doesn't mention UTF much this feature of Unicode might not be worth mentioning, but it does motivate the exclusion of surrogate code points from rdf:langString.
RDF Concepts only informatively mentions UTF-8, but most serialization forms we define are limited to UTF-8 (RDF/XML seems to allow any Unicode variation XML can use). Describing the space of literals to be Unicode scalar values would be an improvement, but could raise some compatibility concerns. This would apply to any kind of literal, not just xsd:string or rdf:langString.
Should this issue be closed? My vote is "yes".
Might be worth a connection to the PR that addressed it... which I guess I've just added.
In section https://www.w3.org/TR/rdf12-semantics/#D_interpretations
If there are ill-typed literals with dataype xsd:string, then the same string + a language tag would be ill-typed for rdf:langString, so it does have ill-typed literals.
"\uFFFF"
is not in the char productions of XML1.1 (other codepoints in the "non-character" block are, oddly). So"\uFFFF"@en
would be ill-typed. U+FFFF is EF BF BE in UTF-8.