w3c / rdf-semantics

https://w3c.github.io/rdf-semantics/
Other
6 stars 2 forks source link

Ill-typed literals for rdf:langString #41

Closed afs closed 8 months ago

afs commented 1 year ago

In section https://www.w3.org/TR/rdf12-semantics/#D_interpretations

The special datatype rdf:langString has no ill-typed literals. Any syntactically legal literal with this type will denote a value in every D-interpretation where D includes rdf:langString. The only ill-typed literals of type xsd:string are those containing a Unicode code point which does not match the Char production in [XML11]. Such strings cannot be written in an XML-compatible surface syntax.graph.

If there are ill-typed literals with dataype xsd:string, then the same string + a language tag would be ill-typed for rdf:langString, so it does have ill-typed literals.

"\uFFFF" is not in the char productions of XML1.1 (other codepoints in the "non-character" block are, oddly). So "\uFFFF"@en would be ill-typed. U+FFFF is EF BF BE in UTF-8.

pfps commented 1 year ago

From https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

If the literal is a language-tagged string, then the literal value is a pair consisting of its lexical form and its language tag, in that order.

So no ill-typed language-tagged strings.

From The Unicode Standard, Version 15.0 (PDF download)

A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Whenever such strings are specified to be in a particular Unicode encoding form—even one with the same code unit size—the string must not violate the requirements of that encoding form.

The net result is that a Unicode string encodes a sequence of Unicode code points, each of which can be any integer between 0 and 0x10FFFF.

It seems that there should be some cleanup in RDF Concepts here. Instead of Unicode string, I suggest finite sequence of Unicode code points to eliminate a potential source of confusion. I also suggest forbidding the noncharacter code points (surrogates and code points ending in FFFE and FFFF), which are currently allowed.

From https://www.w3.org/TR/xmlschema-2/#string

The ·value space· of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the Char production from [XML 1.0 (Second Edition)].

Following the link:

[2] | Char | ::= | #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] | / any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. /

(The description isn't correct - these are Unicode code points, not characters. Also all ASCII control codes are valid Unicode code points.)

But there still would be strings that can be in rdf:langString but not in xsd:string.

afs commented 1 year ago

RDF Semantics links to XML 1.1

XML 1.0:

[2]  Char  ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

XML 1.1

[2]  Char  ::=  [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

(still no U+0000)

pfps commented 1 year ago

Update: Unicode 15.0 3.9 Unicode Encoding Forms defines the notion of a Unicode scalar value, which excludes surrogate code points. UTF-8, UTF-16, and UTF-32 strings that would encode any of these surrogate code points are ill-formed. As RDF 1.1 Concepts doesn't mention UTF much this feature of Unicode might not be worth mentioning, but it does motivate the exclusion of surrogate code points from rdf:langString.

pfps commented 1 year ago

Indeed, XML 1.1 is more permissive. I don't know what links I followed to get to the 1.0 definition. Another strange thing about xsd:string is that it excludes FFFE and FFFF but not the other non-characters.

afs commented 1 year ago

The xs:string definition goes to XML 1.0. XML schema 1.1 part 2 is 2012.

https://www.w3.org/TR/xpath-datamodel-3/#xml-and-xsd-versions has some related text.

gkellogg commented 1 year ago

It seems that there should be some cleanup in RDF Concepts here. Instead of Unicode string, I suggest finite sequence of Unicode code points to eliminate a potential source of confusion. I also suggest forbidding the noncharacter code points (surrogates and code points ending in FFFE and FFFF), which are currently allowed.

We have w3c/rdf-concepts#51 which relates to this.

From https://www.w3.org/TR/xmlschema-2/#string

The ·value space· of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the Char production from [XML 1.0 (Second Edition)].

RDF Concepts references https://www.w3.org/TR/xmlschema11-2/#string, which is based on XML 1.1, as @afs indicated semantics is.

Update: Unicode 15.0 3.9 Unicode Encoding Forms defines the notion of a Unicode scalar value, which excludes surrogate code points. UTF-8, UTF-16, and UTF-32 string that would encode any of these surrogate code points are ill-formed. As RDF 1.1 Concepts doesn't mention UTF much this feature of Unicode might not be worth mentioning, but it does motivate the exclusion of surrogate code points from rdf:langString.

RDF Concepts only informatively mentions UTF-8, but most serialization forms we define are limited to UTF-8 (RDF/XML seems to allow any Unicode variation XML can use). Describing the space of literals to be Unicode scalar values would be an improvement, but could raise some compatibility concerns. This would apply to any kind of literal, not just xsd:string or rdf:langString.

pfps commented 8 months ago

Should this issue be closed? My vote is "yes".

TallTed commented 8 months ago

Might be worth a connection to the PR that addressed it... which I guess I've just added.